The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature.
A vCMDQ introduced by this series as a part of the vIOMMU infrastructure represents a HW supported queue/buffer for VM to use exclusively, e.g. - NVIDIA's virtual command queue - AMD vIOMMU's command buffer either of which is an IOMMU HW feature to directly load and execute cache invalidation commands issued by a guest kernel, to shoot down TLB entries that HW cached for guest-owned stage-1 page table entries. This is a big improvement since there is no VM Exit during an invalidation, compared to the traditional invalidation pathway by trapping a guest-own invalidation queue and forwarding those commands/requests to the host kernel that will eventually fill a HW-owned queue to execute those commands.
Thus, a vCMDQ object, as an initial use case, is all about a guest-owned HW command queue that VMM can allocate/configure depending on the request from a guest kernel. Introduce a new IOMMUFD_OBJ_VCMDQ and its allocator IOMMUFD_CMD_VCMDQ_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc.
Meanwhile, a guest-owned command queue needs the kernel (a command queue driver) to control the queue by reading/writing its consumer and producer indexes, which means the command queue HW allows the guest kernel to get a direct R/W access to those registers. Introduce an mmap infrastructure to the iommufd core so as to support pass through a piece of MMIO region from the host physical address space to the guest physical address space. The VMA info (vm_pgoff/size) used by an mmap must be pre-allocated during the IOMMUFD_CMD_VCMDQ_ALLOC and given those info to the user space as an output driver-data by the IOMMUFD_CMD_VCMDQ_ALLOC. So, this requires a driver-specific user data support by a vIOMMU object.
As a real-world use case, this series implements a vCMDQ support to the tegra241-cmdqv driver for the vCMDQ on NVIDIA Grace CPU. In another word, this is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS.
This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_vcmdq-v2
Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_vcmdq-v2
Changelog v2 * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks Nicolin
Nicolin Chen (22): iommufd/viommu: Add driver-allocated vDEVICE support iommu: Pass in a driver-level user data structure to viommu_alloc op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommu: Add iommu_copy_struct_to_user helper iommufd: Add iommufd_struct_destroy to revert iommufd_viommu_alloc iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add covearge for viommu data iommufd: Abstract iopt_pin_pages and iopt_unpin_pages helpers iommufd/viommu: Introduce IOMMUFD_OBJ_VCMDQ and its related struct iommufd/viommmu: Add IOMMUFD_CMD_VCMDQ_ALLOC ioctl iommufd: Add for-driver helpers iommufd_vcmdq_depend/undepend() iommufd/selftest: Add coverage for IOMMUFD_CMD_VCMDQ_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update vCMDQ iommu/arm-smmu-v3-iommufd: Add vsmmu_alloc impl op iommu/arm-smmu-v3-iommufd: Support implementation-defined hw_info iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +- drivers/iommu/iommufd/io_pagetable.h | 8 + drivers/iommu/iommufd/iommufd_private.h | 25 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 43 +- include/linux/iommufd.h | 146 ++++++ include/uapi/linux/iommufd.h | 113 ++++- tools/testing/selftests/iommu/iommufd_utils.h | 51 +- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 42 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 451 +++++++++++++++++- drivers/iommu/iommufd/device.c | 117 +---- drivers/iommu/iommufd/driver.c | 81 ++++ drivers/iommu/iommufd/io_pagetable.c | 95 ++++ drivers/iommu/iommufd/main.c | 58 ++- drivers/iommu/iommufd/selftest.c | 123 ++++- drivers/iommu/iommufd/viommu.c | 111 ++++- tools/testing/selftests/iommu/iommufd.c | 93 +++- .../selftests/iommu/iommufd_fail_nth.c | 11 +- Documentation/userspace-api/iommufd.rst | 14 + 19 files changed, 1436 insertions(+), 191 deletions(-)
To allow IOMMU drivers to allocate own vDEVICE structures, move the struct iommufd_vdevice to the public header and provide a pair of viommu ops.
The iommufd_vdevice_alloc_ioctl will prioritize the callback function from the viommu ops, i.e. a driver-allocated vDEVICE.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 8 ------ include/linux/iommufd.h | 34 +++++++++++++++++++++++++ drivers/iommu/iommufd/viommu.c | 9 ++++++- 3 files changed, 42 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 80e8c76d25f2..5c69ac05c029 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -607,14 +607,6 @@ void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj);
-struct iommufd_vdevice { - struct iommufd_object obj; - struct iommufd_ctx *ictx; - struct iommufd_viommu *viommu; - struct device *dev; - u64 id; /* per-vIOMMU virtual ID */ -}; - #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); void iommufd_selftest_destroy(struct iommufd_object *obj); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 34b6e6ca4bfa..83e5c4dff121 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -104,6 +104,14 @@ struct iommufd_viommu { unsigned int type; };
+struct iommufd_vdevice { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommufd_viommu *viommu; + struct device *dev; + u64 id; /* per-vIOMMU virtual ID */ +}; + /** * struct iommufd_viommu_ops - vIOMMU specific operations * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory @@ -120,6 +128,13 @@ struct iommufd_viommu { * array->entry_num to report the number of handled requests. * The data structure of the array entry must be defined in * include/uapi/linux/iommufd.h + * @vdevice_alloc: Allocate a vDEVICE object and init its driver-level structure + * or HW procedure. Note that the core-level structure is filled + * by the iommufd core after calling this op. @virt_id carries a + * per-vIOMMU virtual ID for the driver to initialize its HW. + * @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice. + * The memory of the vDEVICE will be free-ed by iommufd core + * after calling this op */ struct iommufd_viommu_ops { void (*destroy)(struct iommufd_viommu *viommu); @@ -128,6 +143,10 @@ struct iommufd_viommu_ops { const struct iommu_user_data *user_data); int (*cache_invalidate)(struct iommufd_viommu *viommu, struct iommu_user_data_array *array); + struct iommufd_vdevice *(*vdevice_alloc)(struct iommufd_viommu *viommu, + struct device *dev, + u64 virt_id); + void (*vdevice_destroy)(struct iommufd_vdevice *vdev); };
#if IS_ENABLED(CONFIG_IOMMUFD) @@ -245,4 +264,19 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret->member.ops = viommu_ops; \ ret; \ }) + +#define iommufd_vdevice_alloc(viommu, drv_struct, member) \ + ({ \ + drv_struct *ret; \ + \ + static_assert(__same_type(struct iommufd_viommu, *viommu)); \ + static_assert(__same_type(struct iommufd_vdevice, \ + ((drv_struct *)NULL)->member)); \ + static_assert(offsetof(drv_struct, member.obj) == 0); \ + ret = (drv_struct *)_iommufd_object_alloc( \ + viommu->ictx, sizeof(drv_struct), IOMMUFD_OBJ_VDEVICE);\ + if (!IS_ERR(ret)) \ + ret->member.viommu = viommu; \ + ret; \ + }) #endif diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 01df2b985f02..d4c7c5072e42 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -90,6 +90,9 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_vdevice, obj); struct iommufd_viommu *viommu = vdev->viommu;
+ if (viommu->ops && viommu->ops->vdevice_destroy) + viommu->ops->vdevice_destroy(vdev); + /* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */ xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL); refcount_dec(&viommu->obj.users); @@ -124,7 +127,11 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) goto out_put_idev; }
- vdev = iommufd_object_alloc(ucmd->ictx, vdev, IOMMUFD_OBJ_VDEVICE); + if (viommu->ops && viommu->ops->vdevice_alloc) + vdev = viommu->ops->vdevice_alloc(viommu, idev->dev, virt_id); + else + vdev = iommufd_object_alloc(ucmd->ictx, vdev, + IOMMUFD_OBJ_VDEVICE); if (IS_ERR(vdev)) { rc = PTR_ERR(vdev); goto out_put_idev;
On 4/26/25 13:57, Nicolin Chen wrote:
@@ -120,6 +128,13 @@ struct iommufd_viommu {
array->entry_num to report the number of handled requests.
The data structure of the array entry must be defined in
include/uapi/linux/iommufd.h
- @vdevice_alloc: Allocate a vDEVICE object and init its driver-level structure
or HW procedure. Note that the core-level structure is filled
by the iommufd core after calling this op. @virt_id carries a
per-vIOMMU virtual ID for the driver to initialize its HW.
I'm wondering whether the 'per-vIOMMU virtual ID' is intended to be generic for other features that might require a vdevice. I'm also not sure where this virtual ID originates when I read it here. Could it potentially come from the KVM instance? If so, how about retrieving it directly from a struct kvm pointer? My understanding is that vIOMMU in IOMMUFD acts as a handle to KVM, so perhaps we should maintain a reference to the kvm pointer within the iommufd_viommu structure?
- @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice.
The memory of the vDEVICE will be free-ed by iommufd core
*/ struct iommufd_viommu_ops { void (*destroy)(struct iommufd_viommu *viommu);
after calling this op
@@ -128,6 +143,10 @@ struct iommufd_viommu_ops { const struct iommu_user_data *user_data); int (*cache_invalidate)(struct iommufd_viommu *viommu, struct iommu_user_data_array *array);
- struct iommufd_vdevice *(*vdevice_alloc)(struct iommufd_viommu *viommu,
struct device *dev,
u64 virt_id);
- void (*vdevice_destroy)(struct iommufd_vdevice *vdev); };
Thanks, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Sunday, April 27, 2025 2:24 PM
On 4/26/25 13:57, Nicolin Chen wrote:
@@ -120,6 +128,13 @@ struct iommufd_viommu {
array->entry_num to report the number of handled requests.
The data structure of the array entry must be defined in
include/uapi/linux/iommufd.h
- @vdevice_alloc: Allocate a vDEVICE object and init its driver-level
structure
or HW procedure. Note that the core-level structure is filled
by the iommufd core after calling this op. @virt_id carries a
per-vIOMMU virtual ID for the driver to initialize its HW.
I'm wondering whether the 'per-vIOMMU virtual ID' is intended to be generic for other features that might require a vdevice. I'm also not sure where this virtual ID originates when I read it here. Could it
for PCI it's the virtual BDF in the guest PCI topology, hence provided by the VMM when calling @vdevice_alloc:
potentially come from the KVM instance? If so, how about retrieving it directly from a struct kvm pointer? My understanding is that vIOMMU in IOMMUFD acts as a handle to KVM, so perhaps we should maintain a reference to the kvm pointer within the iommufd_viommu structure?
It's OK to maintain a KVM pointer in viommu (for which I recall such discussion for confidential io), but obviously it's not the requirement in this series.
On Mon, Apr 28, 2025 at 12:41:33AM +0000, Tian, Kevin wrote:
From: Baolu Lu baolu.lu@linux.intel.com Sent: Sunday, April 27, 2025 2:24 PM
On 4/26/25 13:57, Nicolin Chen wrote:
@@ -120,6 +128,13 @@ struct iommufd_viommu {
array->entry_num to report the number of handled requests.
The data structure of the array entry must be defined in
include/uapi/linux/iommufd.h
- @vdevice_alloc: Allocate a vDEVICE object and init its driver-level
structure
or HW procedure. Note that the core-level structure is filled
by the iommufd core after calling this op. @virt_id carries a
per-vIOMMU virtual ID for the driver to initialize its HW.
I'm wondering whether the 'per-vIOMMU virtual ID' is intended to be generic for other features that might require a vdevice. I'm also not sure where this virtual ID originates when I read it here. Could it
for PCI it's the virtual BDF in the guest PCI topology, hence provided by the VMM when calling @vdevice_alloc:
The "virtual ID" here can, but not necessarily always, be BDF.
Jason had remarks when we added the ioctl: https://lore.kernel.org/linux-iommu/20241004114147.GF1365916@nvidia.com/
And uAPI kdoc (include/uapi/linux/iommufd.h) has its description: /** * struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC) ... * @virt_id: Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID * of AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
So, yes, here we are just forwarding that from the ioctl to viommu op. Perhaps I should add a line here: * @vdevice_alloc: Allocate a vDEVICE object and init its driver-level * or HW procedure. Note that the core-level structure is filled * by the iommufd core after calling this op. @virt_id carries a * per-vIOMMU virtual ID (refer to struct iommu_vdevice_alloc in * include/uapi/linux/iommufd.h) for the driver to initialize its * HW for an attached physical device.
Thanks Nicolin
The new type of vIOMMU for tegra241-cmdqv needs to pass in a driver-level data structure from user space via iommufd, so add a user_data to the op.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 ++- include/linux/iommu.h | 3 ++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++- drivers/iommu/iommufd/selftest.c | 8 ++++---- drivers/iommu/iommufd/viommu.c | 2 +- 5 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index dd1ad56ce863..6b8f0d20dac3 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1062,7 +1062,8 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, struct iommu_domain *parent, struct iommufd_ctx *ictx, - unsigned int viommu_type); + unsigned int viommu_type, + const struct iommu_user_data *user_data); int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, struct arm_smmu_nested_domain *nested_domain); void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 3a8d35d41fda..ba7add27e9a0 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -662,7 +662,8 @@ struct iommu_ops {
struct iommufd_viommu *(*viommu_alloc)( struct device *dev, struct iommu_domain *parent_domain, - struct iommufd_ctx *ictx, unsigned int viommu_type); + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data);
const struct iommu_domain_ops *default_domain_ops; unsigned long pgsize_bitmap; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index e4fd8d522af8..66855cae775e 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -385,7 +385,8 @@ static const struct iommufd_viommu_ops arm_vsmmu_ops = { struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, struct iommu_domain *parent, struct iommufd_ctx *ictx, - unsigned int viommu_type) + unsigned int viommu_type, + const struct iommu_user_data *user_data) { struct arm_smmu_device *smmu = iommu_get_iommu_dev(dev, struct arm_smmu_device, iommu); diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 18d9a216eb30..8b8ba4fb91cd 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -733,10 +733,10 @@ static struct iommufd_viommu_ops mock_viommu_ops = { .cache_invalidate = mock_viommu_cache_invalidate, };
-static struct iommufd_viommu *mock_viommu_alloc(struct device *dev, - struct iommu_domain *domain, - struct iommufd_ctx *ictx, - unsigned int viommu_type) +static struct iommufd_viommu * +mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data) { struct mock_iommu_device *mock_iommu = iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev); diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index d4c7c5072e42..fffa57063c60 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -48,7 +48,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) }
viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain, - ucmd->ictx, cmd->type); + ucmd->ictx, cmd->type, NULL); if (IS_ERR(viommu)) { rc = PTR_ERR(viommu); goto out_put_hwpt;
On 4/26/25 13:57, Nicolin Chen wrote:
The new type of vIOMMU for tegra241-cmdqv needs to pass in a driver-level data structure from user space via iommufd, so add a user_data to the op.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
It would be better to add some words explaining what kind of user data can be passed when allocating a vIOMMU object and the reason why this might be necessary.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Thanks, baolu
On Sun, Apr 27, 2025 at 02:31:54PM +0800, Baolu Lu wrote:
On 4/26/25 13:57, Nicolin Chen wrote:
The new type of vIOMMU for tegra241-cmdqv needs to pass in a driver-level data structure from user space via iommufd, so add a user_data to the op.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
It would be better to add some words explaining what kind of user data can be passed when allocating a vIOMMU object and the reason why this might be necessary.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Sure. Will do something like this:
The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one of its virtual command queue HW resources exclusively. This requires user space to mmap the corresponding MMIO page from kernel space for direct HW control.
To forward the mmap info (vm_pgoff and size), iommufd should add a driver specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver to output the info (during the allocation) back to user space.
Similar to the existing ioctls and their IOMMU handlers, add a user_data to viommu_alloc op to bridge between iommufd and drivers.
Thanks Nicolin
On Mon, Apr 28, 2025 at 10:19:21AM -0700, Nicolin Chen wrote:
On Sun, Apr 27, 2025 at 02:31:54PM +0800, Baolu Lu wrote:
On 4/26/25 13:57, Nicolin Chen wrote:
The new type of vIOMMU for tegra241-cmdqv needs to pass in a driver-level data structure from user space via iommufd, so add a user_data to the op.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
It would be better to add some words explaining what kind of user data can be passed when allocating a vIOMMU object and the reason why this might be necessary.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Sure. Will do something like this:
The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one of its virtual command queue HW resources exclusively. This requires user space to mmap the corresponding MMIO page from kernel space for direct HW control.
To forward the mmap info (vm_pgoff and size), iommufd should add a driver specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver to output the info (during the allocation) back to user space.
Similar to the existing ioctls and their IOMMU handlers, add a user_data to viommu_alloc op to bridge between iommufd and drivers.
Ack, with this change (addressing Lu's nit).
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks Nicolin
The new type of vIOMMU for tegra241-cmdqv driver needs a driver-specific user data. So, add data_len/uptr to the iommu_viommu_alloc uAPI and pass it in via the viommu_alloc iommu op.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/uapi/linux/iommufd.h | 6 ++++++ drivers/iommu/iommufd/viommu.c | 8 +++++++- 2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f29b6c44655e..cc90299a08d9 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -965,6 +965,9 @@ enum iommu_viommu_type { * @dev_id: The device's physical IOMMU will be used to back the virtual IOMMU * @hwpt_id: ID of a nesting parent HWPT to associate to * @out_viommu_id: Output virtual IOMMU ID for the allocated object + * @data_len: Length of the type specific data + * @__reserved: Must be 0 + * @data_uptr: User pointer to an array of driver-specific virtual IOMMU data * * Allocate a virtual IOMMU object, representing the underlying physical IOMMU's * virtualization support that is a security-isolated slice of the real IOMMU HW @@ -985,6 +988,9 @@ struct iommu_viommu_alloc { __u32 dev_id; __u32 hwpt_id; __u32 out_viommu_id; + __u32 data_len; + __u32 __reserved; + __aligned_u64 data_uptr; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index fffa57063c60..a65153458a26 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -17,6 +17,11 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) { struct iommu_viommu_alloc *cmd = ucmd->cmd; + const struct iommu_user_data user_data = { + .type = cmd->type, + .uptr = u64_to_user_ptr(cmd->data_uptr), + .len = cmd->data_len, + }; struct iommufd_hwpt_paging *hwpt_paging; struct iommufd_viommu *viommu; struct iommufd_device *idev; @@ -48,7 +53,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) }
viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain, - ucmd->ictx, cmd->type, NULL); + ucmd->ictx, cmd->type, + user_data.len ? &user_data : NULL); if (IS_ERR(viommu)) { rc = PTR_ERR(viommu); goto out_put_hwpt;
On 4/26/25 13:57, Nicolin Chen wrote:
The new type of vIOMMU for tegra241-cmdqv driver needs a driver-specific user data. So, add data_len/uptr to the iommu_viommu_alloc uAPI and pass it in via the viommu_alloc iommu op.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
On Fri, Apr 25, 2025 at 10:57:58PM -0700, Nicolin Chen wrote:
The new type of vIOMMU for tegra241-cmdqv driver needs a driver-specific user data. So, add data_len/uptr to the iommu_viommu_alloc uAPI and pass it in via the viommu_alloc iommu op.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Acked-by: Pranjal Shrivastava praan@google.com
include/uapi/linux/iommufd.h | 6 ++++++ drivers/iommu/iommufd/viommu.c | 8 +++++++- 2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f29b6c44655e..cc90299a08d9 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -965,6 +965,9 @@ enum iommu_viommu_type {
- @dev_id: The device's physical IOMMU will be used to back the virtual IOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- @data_len: Length of the type specific data
- @__reserved: Must be 0
- @data_uptr: User pointer to an array of driver-specific virtual IOMMU data
- Allocate a virtual IOMMU object, representing the underlying physical IOMMU's
- virtualization support that is a security-isolated slice of the real IOMMU HW
@@ -985,6 +988,9 @@ struct iommu_viommu_alloc { __u32 dev_id; __u32 hwpt_id; __u32 out_viommu_id;
- __u32 data_len;
- __u32 __reserved;
- __aligned_u64 data_uptr;
}; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC) diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index fffa57063c60..a65153458a26 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -17,6 +17,11 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) { struct iommu_viommu_alloc *cmd = ucmd->cmd;
- const struct iommu_user_data user_data = {
.type = cmd->type,
.uptr = u64_to_user_ptr(cmd->data_uptr),
.len = cmd->data_len,
- }; struct iommufd_hwpt_paging *hwpt_paging; struct iommufd_viommu *viommu; struct iommufd_device *idev;
@@ -48,7 +53,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) } viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain,
ucmd->ictx, cmd->type, NULL);
ucmd->ictx, cmd->type,
if (IS_ERR(viommu)) { rc = PTR_ERR(viommu); goto out_put_hwpt;user_data.len ? &user_data : NULL);
-- 2.43.0
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ba7add27e9a0..634ff647888d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -562,6 +562,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, return 0; }
+/** + * __iommu_copy_struct_to_user - Report iommu driver specific user space data + * @dst_data: Pointer to a struct iommu_user_data for user space data location + * @src_data: Pointer to an iommu driver specific user data that is defined in + * include/uapi/linux/iommufd.h + * @data_type: The data type of the @dst_data. Must match with @src_data.type + * @data_len: Length of current user data structure, i.e. sizeof(struct _src) + * @min_len: Initial length of user data structure for backward compatibility. + * This should be offsetofend using the last member in the user data + * struct that was initially added to include/uapi/linux/iommufd.h + */ +static inline int +__iommu_copy_struct_to_user(const struct iommu_user_data *dst_data, + void *src_data, unsigned int data_type, + size_t data_len, size_t min_len) +{ + if (WARN_ON(!dst_data || !src_data)) + return -EINVAL; + if (dst_data->type != data_type) + return -EINVAL; + if (dst_data->len < min_len || data_len < dst_data->len) + return -EINVAL; + return copy_struct_to_user(dst_data->uptr, dst_data->len, src_data, + data_len, NULL); +} + +/** + * iommu_copy_struct_to_user - Report iommu driver specific user space data + * @user_data: Pointer to a struct iommu_user_data for user space data location + * @ksrc: Pointer to an iommu driver specific user data that is defined in + * include/uapi/linux/iommufd.h + * @data_type: The data type of the @ksrc. Must match with @user_data->type + * @min_last: The last member of the data structure @ksrc points in the initial + * version. + * Return 0 for success, otherwise -error. + */ +#define iommu_copy_struct_to_user(user_data, ksrc, data_type, min_last) \ + __iommu_copy_struct_to_user(user_data, ksrc, data_type, sizeof(*ksrc), \ + offsetofend(typeof(*ksrc), min_last)) + /** * struct iommu_ops - iommu ops and capabilities * @capable: check capability
On 4/26/25 13:57, Nicolin Chen wrote:
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
On Fri, Apr 25, 2025 at 10:57:59PM -0700, Nicolin Chen wrote:
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ba7add27e9a0..634ff647888d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -562,6 +562,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, return 0; } +/**
- __iommu_copy_struct_to_user - Report iommu driver specific user space data
- @dst_data: Pointer to a struct iommu_user_data for user space data location
- @src_data: Pointer to an iommu driver specific user data that is defined in
include/uapi/linux/iommufd.h
- @data_type: The data type of the @dst_data. Must match with @src_data.type
^ Nit: Must match with @dst_data type.
- @data_len: Length of current user data structure, i.e. sizeof(struct _src)
- @min_len: Initial length of user data structure for backward compatibility.
This should be offsetofend using the last member in the user data
struct that was initially added to include/uapi/linux/iommufd.h
- */
+static inline int +__iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
void *src_data, unsigned int data_type,
size_t data_len, size_t min_len)
+{
- if (WARN_ON(!dst_data || !src_data))
return -EINVAL;
- if (dst_data->type != data_type)
return -EINVAL;
- if (dst_data->len < min_len || data_len < dst_data->len)
return -EINVAL;
- return copy_struct_to_user(dst_data->uptr, dst_data->len, src_data,
data_len, NULL);
+}
+/**
- iommu_copy_struct_to_user - Report iommu driver specific user space data
- @user_data: Pointer to a struct iommu_user_data for user space data location
- @ksrc: Pointer to an iommu driver specific user data that is defined in
include/uapi/linux/iommufd.h
- @data_type: The data type of the @ksrc. Must match with @user_data->type
- @min_last: The last member of the data structure @ksrc points in the initial
version.
- Return 0 for success, otherwise -error.
- */
+#define iommu_copy_struct_to_user(user_data, ksrc, data_type, min_last) \
- __iommu_copy_struct_to_user(user_data, ksrc, data_type, sizeof(*ksrc), \
offsetofend(typeof(*ksrc), min_last))
/**
- struct iommu_ops - iommu ops and capabilities
- @capable: check capability
With the above nit. Reviewed-by: Pranjal Shrivastava praan@google.com
-- 2.43.0
On Mon, Apr 28, 2025 at 05:50:28PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:57:59PM -0700, Nicolin Chen wrote:
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ba7add27e9a0..634ff647888d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -562,6 +562,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, return 0; } +/**
- __iommu_copy_struct_to_user - Report iommu driver specific user space data
- @dst_data: Pointer to a struct iommu_user_data for user space data location
- @src_data: Pointer to an iommu driver specific user data that is defined in
include/uapi/linux/iommufd.h
- @data_type: The data type of the @dst_data. Must match with @src_data.type
^
Nit: Must match with @dst_data type.
Oh, that's a copy-n-paste mistake. It should be: * @data_type: The data type of the @src_data. Must match with @dst_data.type
Thanks! Nicolin
On Mon, Apr 28, 2025 at 11:21:43AM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 05:50:28PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:57:59PM -0700, Nicolin Chen wrote:
Similar to the iommu_copy_struct_from_user helper receiving data from the user space, add an iommu_copy_struct_to_user helper to report output data back to the user space data pointer.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ba7add27e9a0..634ff647888d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -562,6 +562,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, return 0; } +/**
- __iommu_copy_struct_to_user - Report iommu driver specific user space data
- @dst_data: Pointer to a struct iommu_user_data for user space data location
- @src_data: Pointer to an iommu driver specific user data that is defined in
include/uapi/linux/iommufd.h
- @data_type: The data type of the @dst_data. Must match with @src_data.type
^
Nit: Must match with @dst_data type.
Oh, that's a copy-n-paste mistake. It should be:
- @data_type: The data type of the @src_data. Must match with @dst_data.type
Ack, yes that's what I meant!
Thanks! Nicolin
Thanks, Praan
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use.
Move iommufd_object_abort() to the driver.c file and the public header, to introduce common iommufd_struct_destroy() helper that will abort all kinds of driver structures, not confined to iommufd_viommu but also the new ones being added in the future.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 1 - include/linux/iommufd.h | 15 +++++++++++++++ drivers/iommu/iommufd/driver.c | 14 ++++++++++++++ drivers/iommu/iommufd/main.c | 13 ------------- 4 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 5c69ac05c029..8d96aa514033 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -180,7 +180,6 @@ static inline void iommufd_put_object(struct iommufd_ctx *ictx, wake_up_interruptible_all(&ictx->destroy_wait); }
-void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx, struct iommufd_object *obj); void iommufd_object_finalize(struct iommufd_ctx *ictx, diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 83e5c4dff121..ef0d3c4765cf 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -211,6 +211,7 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx) struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, enum iommufd_object_type type); +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -226,6 +227,11 @@ _iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, return ERR_PTR(-EOPNOTSUPP); }
+static inline void iommufd_object_abort(struct iommufd_ctx *ictx, + struct iommufd_object *obj) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -279,4 +285,13 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret->member.viommu = viommu; \ ret; \ }) + +/* Helper for IOMMU driver to destroy structures created by allocators above */ +#define iommufd_struct_destroy(ictx, drv_struct, member) \ + ({ \ + static_assert(__same_type(struct iommufd_object, \ + drv_struct->member.obj)); \ + static_assert(offsetof(typeof(*drv_struct), member.obj) == 0); \ + iommufd_object_abort(ictx, &drv_struct->member.obj); \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 922cd1fe7ec2..7980a09761c2 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -36,6 +36,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, } EXPORT_SYMBOL_NS_GPL(_iommufd_object_alloc, "IOMMUFD");
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */ +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) +{ + XA_STATE(xas, &ictx->objects, obj->id); + void *old; + + xa_lock(&ictx->objects); + old = xas_store(&xas, NULL); + xa_unlock(&ictx->objects); + WARN_ON(old != XA_ZERO_ENTRY); + kfree(obj); +} +EXPORT_SYMBOL_NS_GPL(iommufd_object_abort, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 3df468f64e7d..2b9ee9b4a424 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -51,19 +51,6 @@ void iommufd_object_finalize(struct iommufd_ctx *ictx, WARN_ON(old != XA_ZERO_ENTRY); }
-/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */ -void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) -{ - XA_STATE(xas, &ictx->objects, obj->id); - void *old; - - xa_lock(&ictx->objects); - old = xas_store(&xas, NULL); - xa_unlock(&ictx->objects); - WARN_ON(old != XA_ZERO_ENTRY); - kfree(obj); -} - /* * Abort an object that has been fully initialized and needs destroy, but has * not been finalized.
On 4/26/25 13:58, Nicolin Chen wrote:
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use.
A brief explanation or a small code snippet illustrating a typical allocation and potential abort scenario would be helpful.
Move iommufd_object_abort() to the driver.c file and the public header, to introduce common iommufd_struct_destroy() helper that will abort all kinds of driver structures, not confined to iommufd_viommu but also the new ones being added in the future.
Reviewed-by: Jason Gunthorpejgg@nvidia.com Signed-off-by: Nicolin Chennicolinc@nvidia.com
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Thanks, baolu
On Sun, Apr 27, 2025 at 02:55:40PM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
An IOMMU driver that allocated a vIOMMU may want to revert the allocation, if it encounters an internal error after the allocation. So, there needs a destroy helper for drivers to use.
A brief explanation or a small code snippet illustrating a typical allocation and potential abort scenario would be helpful.
Will add the followings:
" For instance:
static my_viommu_alloc() { ... my_viommu = iommufd_vcmdq_alloc(viomm, struct my_viommu, core); ... ret = init_my_viommu(); if (ret) { iommufd_struct_destroy(viommu->ictx, my_viommu, core); return ERR_PTR(ret); } return &my_viommu->core; } "
Thanks Nicolin
Add a simple user_data for an input-to-output loopback test.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 13 +++++++++++++ drivers/iommu/iommufd/selftest.c | 19 +++++++++++++++++++ 2 files changed, 32 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 1cd7e8394129..fbf9ecb35a13 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -227,6 +227,19 @@ struct iommu_hwpt_invalidate_selftest {
#define IOMMU_VIOMMU_TYPE_SELFTEST 0xdeadbeef
+/** + * struct iommu_viommu_selftest - vIOMMU data for Mock driver + * (IOMMU_VIOMMU_TYPE_SELFTEST) + * @in_data: Input random data from user space + * @out_data: Output data (matching @in_data) to user space + * + * Simply set @out_data=@in_data for a loopback test + */ +struct iommu_viommu_selftest { + __u32 in_data; + __u32 out_data; +}; + /* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST 0xdeadbeef #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 8b8ba4fb91cd..b04bd2fbc53d 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -740,16 +740,35 @@ mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, { struct mock_iommu_device *mock_iommu = iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev); + struct iommu_viommu_selftest data; struct mock_viommu *mock_viommu; + int rc;
if (viommu_type != IOMMU_VIOMMU_TYPE_SELFTEST) return ERR_PTR(-EOPNOTSUPP);
+ if (user_data) { + rc = iommu_copy_struct_from_user( + &data, user_data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); + if (rc) + return ERR_PTR(rc); + } + mock_viommu = iommufd_viommu_alloc(ictx, struct mock_viommu, core, &mock_viommu_ops); if (IS_ERR(mock_viommu)) return ERR_CAST(mock_viommu);
+ if (user_data) { + data.out_data = data.in_data; + rc = iommu_copy_struct_to_user( + user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); + if (rc) { + iommufd_struct_destroy(ictx, mock_viommu, core); + return ERR_PTR(rc); + } + } + refcount_inc(&mock_iommu->users); return &mock_viommu->core; }
On Fri, Apr 25, 2025 at 10:58:01PM -0700, Nicolin Chen wrote:
Add a simple user_data for an input-to-output loopback test.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_test.h | 13 +++++++++++++ drivers/iommu/iommufd/selftest.c | 19 +++++++++++++++++++ 2 files changed, 32 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 1cd7e8394129..fbf9ecb35a13 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -227,6 +227,19 @@ struct iommu_hwpt_invalidate_selftest { #define IOMMU_VIOMMU_TYPE_SELFTEST 0xdeadbeef +/**
- struct iommu_viommu_selftest - vIOMMU data for Mock driver
(IOMMU_VIOMMU_TYPE_SELFTEST)
- @in_data: Input random data from user space
- @out_data: Output data (matching @in_data) to user space
- Simply set @out_data=@in_data for a loopback test
- */
+struct iommu_viommu_selftest {
- __u32 in_data;
- __u32 out_data;
+};
/* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST 0xdeadbeef #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 8b8ba4fb91cd..b04bd2fbc53d 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -740,16 +740,35 @@ mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, { struct mock_iommu_device *mock_iommu = iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev);
- struct iommu_viommu_selftest data; struct mock_viommu *mock_viommu;
- int rc;
if (viommu_type != IOMMU_VIOMMU_TYPE_SELFTEST) return ERR_PTR(-EOPNOTSUPP);
- if (user_data) {
rc = iommu_copy_struct_from_user(
&data, user_data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data);
if (rc)
return ERR_PTR(rc);
- }
- mock_viommu = iommufd_viommu_alloc(ictx, struct mock_viommu, core, &mock_viommu_ops); if (IS_ERR(mock_viommu)) return ERR_CAST(mock_viommu);
- if (user_data) {
data.out_data = data.in_data;
rc = iommu_copy_struct_to_user(
user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data);
if (rc) {
iommufd_struct_destroy(ictx, mock_viommu, core);
return ERR_PTR(rc);
}
- }
- refcount_inc(&mock_iommu->users); return &mock_viommu->core;
}
Builds fine for me.
Reviewed-by: Pranjal Shrivastava praan@google.com
-- 2.43.0
Extend the existing test_cmd/err_viommu_alloc helpers to accept optional user data. And add a TEST_F for a loopback test.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd_utils.h | 21 +++++++++----- tools/testing/selftests/iommu/iommufd.c | 29 +++++++++++++++---- .../selftests/iommu/iommufd_fail_nth.c | 5 ++-- 3 files changed, 39 insertions(+), 16 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 72f6636e5d90..a5d4cbd089ba 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -897,7 +897,8 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 pasid, pasid, fault_fd))
static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, - __u32 type, __u32 flags, __u32 *viommu_id) + __u32 flags, __u32 type, void *data, + __u32 data_len, __u32 *viommu_id) { struct iommu_viommu_alloc cmd = { .size = sizeof(cmd), @@ -905,6 +906,8 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, .type = type, .dev_id = device_id, .hwpt_id = hwpt_id, + .data_uptr = (uint64_t)data, + .data_len = data_len, }; int ret;
@@ -916,13 +919,15 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, return 0; }
-#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id) \ - ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ - type, 0, viommu_id)) -#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \ - EXPECT_ERRNO(_errno, \ - _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ - type, 0, viommu_id)) +#define test_cmd_viommu_alloc(device_id, hwpt_id, type, data, data_len, \ + viommu_id) \ + ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \ + type, data, data_len, viommu_id)) +#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, data, \ + data_len, viommu_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \ + type, data, data_len, viommu_id))
static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id, __u64 virt_id, __u32 *vdev_id) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 1a8e85afe9aa..8ebbb7fda02d 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2688,7 +2688,7 @@ FIXTURE_SETUP(iommufd_viommu)
/* Allocate a vIOMMU taking refcount of the parent hwpt */ test_cmd_viommu_alloc(self->device_id, self->hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &self->viommu_id);
/* Allocate a regular nested hwpt */ @@ -2727,24 +2727,27 @@ TEST_F(iommufd_viommu, viommu_negative_tests) if (self->device_id) { /* Negative test -- invalid hwpt (hwpt_id=0) */ test_err_viommu_alloc(ENOENT, device_id, 0, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL);
/* Negative test -- not a nesting parent hwpt */ test_cmd_hwpt_alloc(device_id, ioas_id, 0, &hwpt_id); test_err_viommu_alloc(EINVAL, device_id, hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL); test_ioctl_destroy(hwpt_id);
/* Negative test -- unsupported viommu type */ test_err_viommu_alloc(EOPNOTSUPP, device_id, self->hwpt_id, - 0xdead, NULL); + 0xdead, NULL, 0, NULL); EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->hwpt_id)); EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->viommu_id)); } else { test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + NULL); } }
@@ -2791,6 +2794,20 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf) } }
+TEST_F(iommufd_viommu, viommu_alloc_with_data) +{ + struct iommu_viommu_selftest data = { + .in_data = 0xbeef, + }; + + if (self->device_id) { + test_cmd_viommu_alloc(self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, &data, + sizeof(data), &self->viommu_id); + assert(data.out_data == data.in_data); + } +} + TEST_F(iommufd_viommu, vdevice_alloc) { uint32_t viommu_id = self->viommu_id; @@ -3105,7 +3122,7 @@ TEST_F(iommufd_device_pasid, pasid_attach)
/* Allocate a regular nested hwpt based on viommu */ test_cmd_viommu_alloc(self->device_id, parent_hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &viommu_id); test_cmd_hwpt_alloc_nested(self->device_id, viommu_id, IOMMU_HWPT_ALLOC_PASID, diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index e11ec4b121fc..f7ccf1822108 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -688,8 +688,9 @@ TEST_FAIL_NTH(basic_fail_nth, device) IOMMU_HWPT_DATA_NONE, 0, 0)) return -1;
- if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, 0, &viommu_id)) + if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, 0, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, + &viommu_id)) return -1;
if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id))
On Fri, Apr 25, 2025 at 10:58:02PM -0700, Nicolin Chen wrote:
Extend the existing test_cmd/err_viommu_alloc helpers to accept optional user data. And add a TEST_F for a loopback test.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Pranjal Shrivastava praan@google.com
tools/testing/selftests/iommu/iommufd_utils.h | 21 +++++++++----- tools/testing/selftests/iommu/iommufd.c | 29 +++++++++++++++---- .../selftests/iommu/iommufd_fail_nth.c | 5 ++-- 3 files changed, 39 insertions(+), 16 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 72f6636e5d90..a5d4cbd089ba 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -897,7 +897,8 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 pasid, pasid, fault_fd)) static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id,
__u32 type, __u32 flags, __u32 *viommu_id)
__u32 flags, __u32 type, void *data,
__u32 data_len, __u32 *viommu_id)
{ struct iommu_viommu_alloc cmd = { .size = sizeof(cmd), @@ -905,6 +906,8 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, .type = type, .dev_id = device_id, .hwpt_id = hwpt_id,
.data_uptr = (uint64_t)data,
}; int ret;.data_len = data_len,
@@ -916,13 +919,15 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, return 0; } -#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id) \
- ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \
type, 0, viommu_id))
-#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \
- EXPECT_ERRNO(_errno, \
_test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \
type, 0, viommu_id))
+#define test_cmd_viommu_alloc(device_id, hwpt_id, type, data, data_len, \
viommu_id) \
- ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \
type, data, data_len, viommu_id))
+#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, data, \
data_len, viommu_id) \
- EXPECT_ERRNO(_errno, \
_test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \
type, data, data_len, viommu_id))
static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id, __u64 virt_id, __u32 *vdev_id) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 1a8e85afe9aa..8ebbb7fda02d 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2688,7 +2688,7 @@ FIXTURE_SETUP(iommufd_viommu) /* Allocate a vIOMMU taking refcount of the parent hwpt */ test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
IOMMU_VIOMMU_TYPE_SELFTEST,
IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &self->viommu_id);
/* Allocate a regular nested hwpt */ @@ -2727,24 +2727,27 @@ TEST_F(iommufd_viommu, viommu_negative_tests) if (self->device_id) { /* Negative test -- invalid hwpt (hwpt_id=0) */ test_err_viommu_alloc(ENOENT, device_id, 0,
IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
NULL);
/* Negative test -- not a nesting parent hwpt */ test_cmd_hwpt_alloc(device_id, ioas_id, 0, &hwpt_id); test_err_viommu_alloc(EINVAL, device_id, hwpt_id,
IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
test_ioctl_destroy(hwpt_id);NULL);
/* Negative test -- unsupported viommu type */ test_err_viommu_alloc(EOPNOTSUPP, device_id, self->hwpt_id,
0xdead, NULL);
EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->hwpt_id)); EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->viommu_id)); } else { test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id,0xdead, NULL, 0, NULL);
IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
}NULL);
} @@ -2791,6 +2794,20 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf) } } +TEST_F(iommufd_viommu, viommu_alloc_with_data) +{
- struct iommu_viommu_selftest data = {
.in_data = 0xbeef,
- };
- if (self->device_id) {
test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
IOMMU_VIOMMU_TYPE_SELFTEST, &data,
sizeof(data), &self->viommu_id);
assert(data.out_data == data.in_data);
- }
+}
TEST_F(iommufd_viommu, vdevice_alloc) { uint32_t viommu_id = self->viommu_id; @@ -3105,7 +3122,7 @@ TEST_F(iommufd_device_pasid, pasid_attach) /* Allocate a regular nested hwpt based on viommu */ test_cmd_viommu_alloc(self->device_id, parent_hwpt_id,
IOMMU_VIOMMU_TYPE_SELFTEST,
test_cmd_hwpt_alloc_nested(self->device_id, viommu_id, IOMMU_HWPT_ALLOC_PASID,IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &viommu_id);
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index e11ec4b121fc..f7ccf1822108 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -688,8 +688,9 @@ TEST_FAIL_NTH(basic_fail_nth, device) IOMMU_HWPT_DATA_NONE, 0, 0)) return -1;
- if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id,
IOMMU_VIOMMU_TYPE_SELFTEST, 0, &viommu_id))
- if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, 0,
IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
return -1;&viommu_id))
if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id)) -- 2.43.0
The new vCMDQ object will be added for HW to access the guest memory for a HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the vCMDQ allocator. As the vCMDQ allocator will be a user-space triggered ioctl function, WARN_ON would not be a good fit in the new iopt_unpin_pages(), thus change them to use WARN_ON_ONCE instead.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/io_pagetable.h | 8 ++ drivers/iommu/iommufd/iommufd_private.h | 6 ++ drivers/iommu/iommufd/device.c | 117 ++---------------------- drivers/iommu/iommufd/io_pagetable.c | 95 +++++++++++++++++++ 4 files changed, 117 insertions(+), 109 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 10c928a9a463..4288a2b1a90f 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -114,6 +114,14 @@ static inline unsigned long iopt_area_iova_to_index(struct iopt_area *area, return iopt_area_start_byte(area, iova) / PAGE_SIZE; }
+static inline bool iopt_area_check_prot(struct iopt_area *area, + unsigned int flags) +{ + if (flags & IOMMUFD_ACCESS_RW_WRITE) + return area->iommu_prot & IOMMU_WRITE; + return area->iommu_prot & IOMMU_READ; +} + #define __make_iopt_iter(name) \ static inline struct iopt_##name *iopt_##name##_iter_first( \ struct io_pagetable *iopt, unsigned long start, \ diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 8d96aa514033..79160b039bc7 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -130,6 +130,12 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas, void iopt_enable_large_pages(struct io_pagetable *iopt); int iopt_disable_large_pages(struct io_pagetable *iopt);
+int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length, struct page **out_pages, + unsigned int flags); +void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length); + struct iommufd_ucmd { struct iommufd_ctx *ictx; void __user *ubuffer; diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 2111bad72c72..a5c6be164254 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -1240,58 +1240,17 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, void iommufd_access_unpin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length) { - struct iopt_area_contig_iter iter; - struct io_pagetable *iopt; - unsigned long last_iova; - struct iopt_area *area; - - if (WARN_ON(!length) || - WARN_ON(check_add_overflow(iova, length - 1, &last_iova))) - return; - - mutex_lock(&access->ioas_lock); + guard(mutex)(&access->ioas_lock); /* * The driver must be doing something wrong if it calls this before an * iommufd_access_attach() or after an iommufd_access_detach(). */ - if (WARN_ON(!access->ioas_unpin)) { - mutex_unlock(&access->ioas_lock); + if (WARN_ON(!access->ioas_unpin)) return; - } - iopt = &access->ioas_unpin->iopt; - - down_read(&iopt->iova_rwsem); - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) - iopt_area_remove_access( - area, iopt_area_iova_to_index(area, iter.cur_iova), - iopt_area_iova_to_index( - area, - min(last_iova, iopt_area_last_iova(area)))); - WARN_ON(!iopt_area_contig_done(&iter)); - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); + iopt_unpin_pages(&access->ioas_unpin->iopt, iova, length); } EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, "IOMMUFD");
-static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) -{ - if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE) - return false; - - if (!iopt_area_contig_done(iter) && - (iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) % - PAGE_SIZE) != (PAGE_SIZE - 1)) - return false; - return true; -} - -static bool check_area_prot(struct iopt_area *area, unsigned int flags) -{ - if (flags & IOMMUFD_ACCESS_RW_WRITE) - return area->iommu_prot & IOMMU_WRITE; - return area->iommu_prot & IOMMU_READ; -} - /** * iommufd_access_pin_pages() - Return a list of pages under the iova * @access: IOAS access to act on @@ -1315,76 +1274,16 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length, struct page **out_pages, unsigned int flags) { - struct iopt_area_contig_iter iter; - struct io_pagetable *iopt; - unsigned long last_iova; - struct iopt_area *area; - int rc; - /* Driver's ops don't support pin_pages */ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && WARN_ON(access->iova_alignment != PAGE_SIZE || !access->ops->unmap)) return -EINVAL;
- if (!length) - return -EINVAL; - if (check_add_overflow(iova, length - 1, &last_iova)) - return -EOVERFLOW; - - mutex_lock(&access->ioas_lock); - if (!access->ioas) { - mutex_unlock(&access->ioas_lock); + guard(mutex)(&access->ioas_lock); + if (!access->ioas) return -ENOENT; - } - iopt = &access->ioas->iopt; - - down_read(&iopt->iova_rwsem); - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { - unsigned long last = min(last_iova, iopt_area_last_iova(area)); - unsigned long last_index = iopt_area_iova_to_index(area, last); - unsigned long index = - iopt_area_iova_to_index(area, iter.cur_iova); - - if (area->prevent_access || - !iopt_area_contig_is_aligned(&iter)) { - rc = -EINVAL; - goto err_remove; - } - - if (!check_area_prot(area, flags)) { - rc = -EPERM; - goto err_remove; - } - - rc = iopt_area_add_access(area, index, last_index, out_pages, - flags); - if (rc) - goto err_remove; - out_pages += last_index - index + 1; - } - if (!iopt_area_contig_done(&iter)) { - rc = -ENOENT; - goto err_remove; - } - - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); - return 0; - -err_remove: - if (iova < iter.cur_iova) { - last_iova = iter.cur_iova - 1; - iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) - iopt_area_remove_access( - area, - iopt_area_iova_to_index(area, iter.cur_iova), - iopt_area_iova_to_index( - area, min(last_iova, - iopt_area_last_iova(area)))); - } - up_read(&iopt->iova_rwsem); - mutex_unlock(&access->ioas_lock); - return rc; + return iopt_pin_pages(&access->ioas->iopt, iova, length, out_pages, + flags); } EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, "IOMMUFD");
@@ -1431,7 +1330,7 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, goto err_out; }
- if (!check_area_prot(area, flags)) { + if (!iopt_area_check_prot(area, flags)) { rc = -EPERM; goto err_out; } diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 8a790e597e12..160eec49af1b 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1472,3 +1472,98 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, up_write(&iopt->iova_rwsem); return rc; } + +static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) +{ + if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE) + return false; + + if (!iopt_area_contig_done(iter) && + (iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) % + PAGE_SIZE) != (PAGE_SIZE - 1)) + return false; + return true; +} + +int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length, struct page **out_pages, + unsigned int flags) +{ + struct iopt_area_contig_iter iter; + unsigned long last_iova; + struct iopt_area *area; + int rc; + + if (!length) + return -EINVAL; + if (check_add_overflow(iova, length - 1, &last_iova)) + return -EOVERFLOW; + + down_read(&iopt->iova_rwsem); + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { + unsigned long last = min(last_iova, iopt_area_last_iova(area)); + unsigned long last_index = iopt_area_iova_to_index(area, last); + unsigned long index = + iopt_area_iova_to_index(area, iter.cur_iova); + + if (area->prevent_access || + !iopt_area_contig_is_aligned(&iter)) { + rc = -EINVAL; + goto err_remove; + } + + if (!iopt_area_check_prot(area, flags)) { + rc = -EPERM; + goto err_remove; + } + + rc = iopt_area_add_access(area, index, last_index, out_pages, + flags); + if (rc) + goto err_remove; + out_pages += last_index - index + 1; + } + if (!iopt_area_contig_done(&iter)) { + rc = -ENOENT; + goto err_remove; + } + + up_read(&iopt->iova_rwsem); + return 0; + +err_remove: + if (iova < iter.cur_iova) { + last_iova = iter.cur_iova - 1; + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) + iopt_area_remove_access( + area, + iopt_area_iova_to_index(area, iter.cur_iova), + iopt_area_iova_to_index( + area, min(last_iova, + iopt_area_last_iova(area)))); + } + up_read(&iopt->iova_rwsem); + return rc; +} + +void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova, + unsigned long length) +{ + struct iopt_area_contig_iter iter; + unsigned long last_iova; + struct iopt_area *area; + + if (WARN_ON_ONCE(!length) || + WARN_ON_ONCE(check_add_overflow(iova, length - 1, &last_iova))) + return; + + down_read(&iopt->iova_rwsem); + iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) + iopt_area_remove_access( + area, iopt_area_iova_to_index(area, iter.cur_iova), + iopt_area_iova_to_index( + area, + min(last_iova, iopt_area_last_iova(area)))); + WARN_ON_ONCE(!iopt_area_contig_done(&iter)); + up_read(&iopt->iova_rwsem); +}
On 4/26/25 13:58, Nicolin Chen wrote:
The new vCMDQ object will be added for HW to access the guest memory for a HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the vCMDQ allocator. As the vCMDQ allocator will be a user-space triggered ioctl function, WARN_ON would not be a good fit in the new iopt_unpin_pages(), thus change them to use WARN_ON_ONCE instead.
I'm uncertain, but perhaps pr_warn_ratelimited() would be a better alternative to WARN_ON() here? WARN_ON_ONCE() generates warning messages with kernel call traces in the kernel messages, which might lead users to believe that something serious has happened in the kernel.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Signed-off-by: Nicolin Chennicolinc@nvidia.com
drivers/iommu/iommufd/io_pagetable.h | 8 ++ drivers/iommu/iommufd/iommufd_private.h | 6 ++ drivers/iommu/iommufd/device.c | 117 ++---------------------- drivers/iommu/iommufd/io_pagetable.c | 95 +++++++++++++++++++ 4 files changed, 117 insertions(+), 109 deletions(-)
Thanks, baolu
On Sun, Apr 27, 2025 at 03:22:13PM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
The new vCMDQ object will be added for HW to access the guest memory for a HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the vCMDQ allocator. As the vCMDQ allocator will be a user-space triggered ioctl function, WARN_ON would not be a good fit in the new iopt_unpin_pages(), thus change them to use WARN_ON_ONCE instead.
I'm uncertain, but perhaps pr_warn_ratelimited() would be a better alternative to WARN_ON() here? WARN_ON_ONCE() generates warning messages with kernel call traces in the kernel messages, which might lead users to believe that something serious has happened in the kernel.
We already have similar practice, e.g. iommufd_hwpt_nested_alloc.
In my review, a WARN_ON/WARN_ON_ONCE means there is a kernel bug, which shouldn't occur in the first place and isn't something that user space should concern. In case that it is hit, a WARN_ON_ONCE only spits one piece of traces that is enough for kernel folks to identify what's wrong, while pr_warn_ratelimited would likely end up with periodical warnings (more lines) that are neither related to user space nor useful for kernel.
Thanks Nicolin
On Fri, Apr 25, 2025 at 10:58:03PM -0700, Nicolin Chen wrote:
The new vCMDQ object will be added for HW to access the guest memory for a HW-accelerated virtualization feature. It needs to ensure the guest memory pages are pinned when HW accesses them and they are contiguous in physical address space.
This is very like the existing iommufd_access_pin_pages() that outputs the pinned page list for the caller to test its contiguity.
Move those code from iommufd_access_pin/unpin_pages() and related function for a pair of iopt helpers that can be shared with the vCMDQ allocator. As the vCMDQ allocator will be a user-space triggered ioctl function, WARN_ON would not be a good fit in the new iopt_unpin_pages(), thus change them to use WARN_ON_ONCE instead.
Rename check_area_prot() to align with the existing iopt_area helpers, and inline it to the header since iommufd_access_rw() still uses it.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/io_pagetable.h | 8 ++ drivers/iommu/iommufd/iommufd_private.h | 6 ++ drivers/iommu/iommufd/device.c | 117 ++---------------------- drivers/iommu/iommufd/io_pagetable.c | 95 +++++++++++++++++++ 4 files changed, 117 insertions(+), 109 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 10c928a9a463..4288a2b1a90f 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -114,6 +114,14 @@ static inline unsigned long iopt_area_iova_to_index(struct iopt_area *area, return iopt_area_start_byte(area, iova) / PAGE_SIZE; } +static inline bool iopt_area_check_prot(struct iopt_area *area,
unsigned int flags)
+{
- if (flags & IOMMUFD_ACCESS_RW_WRITE)
return area->iommu_prot & IOMMU_WRITE;
- return area->iommu_prot & IOMMU_READ;
+}
#define __make_iopt_iter(name) \ static inline struct iopt_##name *iopt_##name##_iter_first( \ struct io_pagetable *iopt, unsigned long start, \ diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 8d96aa514033..79160b039bc7 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -130,6 +130,12 @@ int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas, void iopt_enable_large_pages(struct io_pagetable *iopt); int iopt_disable_large_pages(struct io_pagetable *iopt); +int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova,
unsigned long length, struct page **out_pages,
unsigned int flags);
+void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova,
unsigned long length);
struct iommufd_ucmd { struct iommufd_ctx *ictx; void __user *ubuffer; diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 2111bad72c72..a5c6be164254 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -1240,58 +1240,17 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, void iommufd_access_unpin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length) {
- struct iopt_area_contig_iter iter;
- struct io_pagetable *iopt;
- unsigned long last_iova;
- struct iopt_area *area;
- if (WARN_ON(!length) ||
WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
return;
- mutex_lock(&access->ioas_lock);
- guard(mutex)(&access->ioas_lock); /*
*/
- The driver must be doing something wrong if it calls this before an
- iommufd_access_attach() or after an iommufd_access_detach().
- if (WARN_ON(!access->ioas_unpin)) {
mutex_unlock(&access->ioas_lock);
- if (WARN_ON(!access->ioas_unpin)) return;
- }
- iopt = &access->ioas_unpin->iopt;
- down_read(&iopt->iova_rwsem);
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
iopt_area_remove_access(
area, iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area,
min(last_iova, iopt_area_last_iova(area))));
- WARN_ON(!iopt_area_contig_done(&iter));
- up_read(&iopt->iova_rwsem);
- mutex_unlock(&access->ioas_lock);
- iopt_unpin_pages(&access->ioas_unpin->iopt, iova, length);
} EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, "IOMMUFD"); -static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) -{
- if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE)
return false;
- if (!iopt_area_contig_done(iter) &&
(iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) %
PAGE_SIZE) != (PAGE_SIZE - 1))
return false;
- return true;
-}
-static bool check_area_prot(struct iopt_area *area, unsigned int flags) -{
- if (flags & IOMMUFD_ACCESS_RW_WRITE)
return area->iommu_prot & IOMMU_WRITE;
- return area->iommu_prot & IOMMU_READ;
-}
/**
- iommufd_access_pin_pages() - Return a list of pages under the iova
- @access: IOAS access to act on
@@ -1315,76 +1274,16 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length, struct page **out_pages, unsigned int flags) {
- struct iopt_area_contig_iter iter;
- struct io_pagetable *iopt;
- unsigned long last_iova;
- struct iopt_area *area;
- int rc;
- /* Driver's ops don't support pin_pages */ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && WARN_ON(access->iova_alignment != PAGE_SIZE || !access->ops->unmap)) return -EINVAL;
- if (!length)
return -EINVAL;
- if (check_add_overflow(iova, length - 1, &last_iova))
return -EOVERFLOW;
- mutex_lock(&access->ioas_lock);
- if (!access->ioas) {
mutex_unlock(&access->ioas_lock);
- guard(mutex)(&access->ioas_lock);
- if (!access->ioas) return -ENOENT;
- }
- iopt = &access->ioas->iopt;
- down_read(&iopt->iova_rwsem);
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
unsigned long last_index = iopt_area_iova_to_index(area, last);
unsigned long index =
iopt_area_iova_to_index(area, iter.cur_iova);
if (area->prevent_access ||
!iopt_area_contig_is_aligned(&iter)) {
rc = -EINVAL;
goto err_remove;
}
if (!check_area_prot(area, flags)) {
rc = -EPERM;
goto err_remove;
}
rc = iopt_area_add_access(area, index, last_index, out_pages,
flags);
if (rc)
goto err_remove;
out_pages += last_index - index + 1;
- }
- if (!iopt_area_contig_done(&iter)) {
rc = -ENOENT;
goto err_remove;
- }
- up_read(&iopt->iova_rwsem);
- mutex_unlock(&access->ioas_lock);
- return 0;
-err_remove:
- if (iova < iter.cur_iova) {
last_iova = iter.cur_iova - 1;
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
iopt_area_remove_access(
area,
iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area, min(last_iova,
iopt_area_last_iova(area))));
- }
- up_read(&iopt->iova_rwsem);
- mutex_unlock(&access->ioas_lock);
- return rc;
- return iopt_pin_pages(&access->ioas->iopt, iova, length, out_pages,
flags);
} EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, "IOMMUFD"); @@ -1431,7 +1330,7 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, goto err_out; }
if (!check_area_prot(area, flags)) {
}if (!iopt_area_check_prot(area, flags)) { rc = -EPERM; goto err_out;
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 8a790e597e12..160eec49af1b 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1472,3 +1472,98 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, up_write(&iopt->iova_rwsem); return rc; }
+static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter) +{
- if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE)
return false;
- if (!iopt_area_contig_done(iter) &&
(iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) %
PAGE_SIZE) != (PAGE_SIZE - 1))
return false;
- return true;
+}
+int iopt_pin_pages(struct io_pagetable *iopt, unsigned long iova,
unsigned long length, struct page **out_pages,
unsigned int flags)
+{
- struct iopt_area_contig_iter iter;
- unsigned long last_iova;
- struct iopt_area *area;
- int rc;
- if (!length)
return -EINVAL;
- if (check_add_overflow(iova, length - 1, &last_iova))
return -EOVERFLOW;
- down_read(&iopt->iova_rwsem);
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
unsigned long last_index = iopt_area_iova_to_index(area, last);
unsigned long index =
iopt_area_iova_to_index(area, iter.cur_iova);
if (area->prevent_access ||
Nit: Shouldn't we return -EBUSY or something if (area->prevent_access == 1) ? IIUC, this just means that an unmap attempt is in progress, hence avoid accessing the area.
!iopt_area_contig_is_aligned(&iter)) {
rc = -EINVAL;
goto err_remove;
}
if (!iopt_area_check_prot(area, flags)) {
rc = -EPERM;
goto err_remove;
}
rc = iopt_area_add_access(area, index, last_index, out_pages,
flags);
if (rc)
goto err_remove;
out_pages += last_index - index + 1;
- }
- if (!iopt_area_contig_done(&iter)) {
rc = -ENOENT;
goto err_remove;
- }
- up_read(&iopt->iova_rwsem);
- return 0;
+err_remove:
- if (iova < iter.cur_iova) {
last_iova = iter.cur_iova - 1;
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
iopt_area_remove_access(
area,
iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area, min(last_iova,
iopt_area_last_iova(area))));
- }
- up_read(&iopt->iova_rwsem);
- return rc;
+}
+void iopt_unpin_pages(struct io_pagetable *iopt, unsigned long iova,
unsigned long length)
+{
- struct iopt_area_contig_iter iter;
- unsigned long last_iova;
- struct iopt_area *area;
- if (WARN_ON_ONCE(!length) ||
WARN_ON_ONCE(check_add_overflow(iova, length - 1, &last_iova)))
return;
- down_read(&iopt->iova_rwsem);
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
iopt_area_remove_access(
area, iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area,
min(last_iova, iopt_area_last_iova(area))));
- WARN_ON_ONCE(!iopt_area_contig_done(&iter));
- up_read(&iopt->iova_rwsem);
+}
2.43.0
On Mon, Apr 28, 2025 at 08:14:19PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:03PM -0700, Nicolin Chen wrote:
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
unsigned long last_index = iopt_area_iova_to_index(area, last);
unsigned long index =
iopt_area_iova_to_index(area, iter.cur_iova);
if (area->prevent_access ||
Nit: Shouldn't we return -EBUSY or something if (area->prevent_access == 1) ? IIUC, this just means that an unmap attempt is in progress, hence avoid accessing the area.
Maybe. But this is what it was. So we need a different patch to change that.
Thanks Nicolin
On Mon, Apr 28, 2025 at 03:12:33PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 08:14:19PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:03PM -0700, Nicolin Chen wrote:
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
unsigned long last_index = iopt_area_iova_to_index(area, last);
unsigned long index =
iopt_area_iova_to_index(area, iter.cur_iova);
if (area->prevent_access ||
Nit: Shouldn't we return -EBUSY or something if (area->prevent_access == 1) ? IIUC, this just means that an unmap attempt is in progress, hence avoid accessing the area.
Maybe. But this is what it was. So we need a different patch to change that.
Rereading the code. The prevent_access is set by an unmap(), which means there shouldn't be any pin() and rw() as the caller should finish unmap() first.
In the newer use case of vCMDQ, it's similar. If VMM is unmapping the stage-2 mapping, it shouldn't try to allocate a vCMDQ.
-EBUSY makes some sense, but -EINVAL could still stand.
So, I am leaving it as is, since this patch is just about moving the functions for sharing.
Nicolin
On Mon, Apr 28, 2025 at 04:34:14PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 03:12:33PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 08:14:19PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:03PM -0700, Nicolin Chen wrote:
- iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
unsigned long last_index = iopt_area_iova_to_index(area, last);
unsigned long index =
iopt_area_iova_to_index(area, iter.cur_iova);
if (area->prevent_access ||
Nit: Shouldn't we return -EBUSY or something if (area->prevent_access == 1) ? IIUC, this just means that an unmap attempt is in progress, hence avoid accessing the area.
Maybe. But this is what it was. So we need a different patch to change that.
Rereading the code. The prevent_access is set by an unmap(), which means there shouldn't be any pin() and rw() as the caller should finish unmap() first.
In the newer use case of vCMDQ, it's similar. If VMM is unmapping the stage-2 mapping, it shouldn't try to allocate a vCMDQ.
-EBUSY makes some sense, but -EINVAL could still stand.
So, I am leaving it as is, since this patch is just about moving the functions for sharing.
Ack. I don't have a strong preference too. This should be fine, we can re-visit this if needed in the future.
Reviewed-by: Pranjal Shrivastava praan@google.com
Nicolin
Thanks!
Add a new IOMMUFD_OBJ_VCMDQ with an iommufd_vcmdq structure, representing a command queue type of physical HW passed to a user space VM. This vCMDQ object, is a subset of vIOMMU resources of a physical IOMMU's, such as: - NVIDIA's virtual command queue - AMD vIOMMU's command buffer
Inroduce a struct iommufd_vcmdq and its allocator iommufd_vcmdq_alloc(). Also add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ef0d3c4765cf..e91381aaec5a 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -37,6 +37,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_VIOMMU, IOMMUFD_OBJ_VDEVICE, IOMMUFD_OBJ_VEVENTQ, + IOMMUFD_OBJ_VCMDQ, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif @@ -112,6 +113,14 @@ struct iommufd_vdevice { u64 id; /* per-vIOMMU virtual ID */ };
+struct iommufd_vcmdq { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommufd_viommu *viommu; + dma_addr_t addr; + size_t length; +}; + /** * struct iommufd_viommu_ops - vIOMMU specific operations * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory @@ -135,6 +144,13 @@ struct iommufd_vdevice { * @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice. * The memory of the vDEVICE will be free-ed by iommufd core * after calling this op + * @vcmdq_alloc: Allocate a @type of iommufd_vcmdq as a user space command queue + * for a @viommu. @index carries the logical vcmdq ID (for a multi- + * queue case); @addr carries the guest physical base address of + * the queue memory; @length carries the size of the queue memory + * @vcmdq_destroy: Clean up all driver-specific parts of an iommufd_vcmdq. The + * memory of the iommufd_vcmdq will be free-ed by iommufd core + * after calling this op */ struct iommufd_viommu_ops { void (*destroy)(struct iommufd_viommu *viommu); @@ -147,6 +163,10 @@ struct iommufd_viommu_ops { struct device *dev, u64 virt_id); void (*vdevice_destroy)(struct iommufd_vdevice *vdev); + struct iommufd_vcmdq *(*vcmdq_alloc)(struct iommufd_viommu *viommu, + unsigned int type, u32 index, + dma_addr_t addr, size_t length); + void (*vcmdq_destroy)(struct iommufd_vcmdq *vcmdq); };
#if IS_ENABLED(CONFIG_IOMMUFD) @@ -286,6 +306,21 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret; \ })
+#define iommufd_vcmdq_alloc(viommu, drv_struct, member) \ + ({ \ + drv_struct *ret; \ + \ + static_assert(__same_type(struct iommufd_viommu, *viommu)); \ + static_assert(__same_type(struct iommufd_vcmdq, \ + ((drv_struct *)NULL)->member)); \ + static_assert(offsetof(drv_struct, member.obj) == 0); \ + ret = (drv_struct *)_iommufd_object_alloc( \ + viommu->ictx, sizeof(drv_struct), IOMMUFD_OBJ_VCMDQ); \ + if (!IS_ERR(ret)) \ + ret->member.viommu = viommu; \ + ret; \ + }) + /* Helper for IOMMU driver to destroy structures created by allocators above */ #define iommufd_struct_destroy(ictx, drv_struct, member) \ ({ \
On 4/26/25 13:58, Nicolin Chen wrote:
Add a new IOMMUFD_OBJ_VCMDQ with an iommufd_vcmdq structure, representing a command queue type of physical HW passed to a user space VM. This vCMDQ object, is a subset of vIOMMU resources of a physical IOMMU's, such as:
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
Inroduce a struct iommufd_vcmdq and its allocator iommufd_vcmdq_alloc(). Also add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Signed-off-by: Nicolin Chennicolinc@nvidia.com
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
with a small nit below ...
include/linux/iommufd.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ef0d3c4765cf..e91381aaec5a 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -37,6 +37,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_VIOMMU, IOMMUFD_OBJ_VDEVICE, IOMMUFD_OBJ_VEVENTQ,
- IOMMUFD_OBJ_VCMDQ, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif
@@ -112,6 +113,14 @@ struct iommufd_vdevice { u64 id; /* per-vIOMMU virtual ID */ }; +struct iommufd_vcmdq {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommufd_viommu *viommu;
- dma_addr_t addr;
It's better to add a comment to state that @addr is a guest physical address. Or not?
- size_t length;
+};
Thanks, baolu
On Mon, Apr 28, 2025 at 09:09:19AM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
Add a new IOMMUFD_OBJ_VCMDQ with an iommufd_vcmdq structure, representing a command queue type of physical HW passed to a user space VM. This vCMDQ object, is a subset of vIOMMU resources of a physical IOMMU's, such as:
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
Inroduce a struct iommufd_vcmdq and its allocator iommufd_vcmdq_alloc(). Also add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Signed-off-by: Nicolin Chennicolinc@nvidia.com
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
with a small nit below ...
include/linux/iommufd.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ef0d3c4765cf..e91381aaec5a 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -37,6 +37,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_VIOMMU, IOMMUFD_OBJ_VDEVICE, IOMMUFD_OBJ_VEVENTQ,
- IOMMUFD_OBJ_VCMDQ, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif
@@ -112,6 +113,14 @@ struct iommufd_vdevice { u64 id; /* per-vIOMMU virtual ID */ }; +struct iommufd_vcmdq {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommufd_viommu *viommu;
- dma_addr_t addr;
It's better to add a comment to state that @addr is a guest physical address. Or not?
Yea. Let's add one:
dma_addr_t addr; /* in guest physical address space */
Thanks Nicolin
On Fri, Apr 25, 2025 at 10:58:04PM -0700, Nicolin Chen wrote:
Add a new IOMMUFD_OBJ_VCMDQ with an iommufd_vcmdq structure, representing a command queue type of physical HW passed to a user space VM. This vCMDQ object, is a subset of vIOMMU resources of a physical IOMMU's, such as:
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
Inroduce a struct iommufd_vcmdq and its allocator iommufd_vcmdq_alloc(). Also add a pair of viommu ops for iommufd to forward user space ioctls to IOMMU drivers.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Pranjal Shrivastava praan@google.com
include/linux/iommufd.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ef0d3c4765cf..e91381aaec5a 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -37,6 +37,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_VIOMMU, IOMMUFD_OBJ_VDEVICE, IOMMUFD_OBJ_VEVENTQ,
- IOMMUFD_OBJ_VCMDQ,
#ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif @@ -112,6 +113,14 @@ struct iommufd_vdevice { u64 id; /* per-vIOMMU virtual ID */ }; +struct iommufd_vcmdq {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommufd_viommu *viommu;
- dma_addr_t addr;
- size_t length;
+};
/**
- struct iommufd_viommu_ops - vIOMMU specific operations
- @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory
@@ -135,6 +144,13 @@ struct iommufd_vdevice {
- @vdevice_destroy: Clean up all driver-specific parts of an iommufd_vdevice.
The memory of the vDEVICE will be free-ed by iommufd core
after calling this op
- @vcmdq_alloc: Allocate a @type of iommufd_vcmdq as a user space command queue
for a @viommu. @index carries the logical vcmdq ID (for a multi-
queue case); @addr carries the guest physical base address of
the queue memory; @length carries the size of the queue memory
- @vcmdq_destroy: Clean up all driver-specific parts of an iommufd_vcmdq. The
memory of the iommufd_vcmdq will be free-ed by iommufd core
*/
after calling this op
struct iommufd_viommu_ops { void (*destroy)(struct iommufd_viommu *viommu); @@ -147,6 +163,10 @@ struct iommufd_viommu_ops { struct device *dev, u64 virt_id); void (*vdevice_destroy)(struct iommufd_vdevice *vdev);
- struct iommufd_vcmdq *(*vcmdq_alloc)(struct iommufd_viommu *viommu,
unsigned int type, u32 index,
dma_addr_t addr, size_t length);
- void (*vcmdq_destroy)(struct iommufd_vcmdq *vcmdq);
}; #if IS_ENABLED(CONFIG_IOMMUFD) @@ -286,6 +306,21 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, ret; \ }) +#define iommufd_vcmdq_alloc(viommu, drv_struct, member) \
- ({ \
drv_struct *ret; \
\
static_assert(__same_type(struct iommufd_viommu, *viommu)); \
static_assert(__same_type(struct iommufd_vcmdq, \
((drv_struct *)NULL)->member)); \
static_assert(offsetof(drv_struct, member.obj) == 0); \
ret = (drv_struct *)_iommufd_object_alloc( \
viommu->ictx, sizeof(drv_struct), IOMMUFD_OBJ_VCMDQ); \
if (!IS_ERR(ret)) \
ret->member.viommu = viommu; \
ret; \
- })
/* Helper for IOMMU driver to destroy structures created by allocators above */ #define iommufd_struct_destroy(ictx, drv_struct, member) \ ({ \ -- 2.43.0
Introduce a new IOMMUFD_CMD_VCMDQ_ALLOC ioctl for user space to allocate a vCMDQ for a vIOMMU object. Simply increase the refcount of the vIOMMU.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 2 + include/uapi/linux/iommufd.h | 41 +++++++++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 94 +++++++++++++++++++++++++ 4 files changed, 143 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 79160b039bc7..b974c207ae8a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -611,6 +611,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj); +int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_vcmdq_destroy(struct iommufd_object *obj);
#ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cc90299a08d9..06a763fda47f 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -56,6 +56,7 @@ enum { IOMMUFD_CMD_VDEVICE_ALLOC = 0x91, IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92, IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93, + IOMMUFD_CMD_VCMDQ_ALLOC = 0x94, };
/** @@ -1147,4 +1148,44 @@ struct iommu_veventq_alloc { __u32 __reserved; }; #define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC) + +/** + * enum iommu_vcmdq_type - Virtual Command Queue Type + * @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use + */ +enum iommu_vcmdq_type { + IOMMU_VCMDQ_TYPE_DEFAULT = 0, +}; + +/** + * struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC) + * @size: sizeof(struct iommu_vcmdq_alloc) + * @flags: Must be 0 + * @viommu_id: Virtual IOMMU ID to associate the virtual command queue with + * @type: One of enum iommu_vcmdq_type + * @index: The logical index to the virtual command queue per virtual IOMMU, for + * a multi-queue model + * @out_vcmdq_id: The ID of the new virtual command queue + * @addr: Base address of the queue memory in the guest physical address space + * @length: Length of the queue memory in the guest physical address space + * + * Allocate a virtual command queue object for a vIOMMU-specific HW-accelerated + * feature that can access a guest queue memory described by @addr and @length. + * It's suggested for VMM to back the queue memory using a single huge page with + * a proper alignment for its contiguity in the host physical address space. The + * call will fail, if the queue memory is not contiguous in the physical address + * space. Upon success, its underlying physical pages will be pinned to prevent + * VMM from unmapping them in the IOAS, until the virtual CMDQ gets destroyed. + */ +struct iommu_vcmdq_alloc { + __u32 size; + __u32 flags; + __u32 viommu_id; + __u32 type; + __u32 index; + __u32 out_vcmdq_id; + __aligned_u64 addr; + __aligned_u64 length; +}; +#define IOMMU_VCMDQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VCMDQ_ALLOC) #endif diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 2b9ee9b4a424..ac51d5cfaa61 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -303,6 +303,7 @@ union ucmd_buffer { struct iommu_ioas_map map; struct iommu_ioas_unmap unmap; struct iommu_option option; + struct iommu_vcmdq_alloc vcmdq; struct iommu_vdevice_alloc vdev; struct iommu_veventq_alloc veventq; struct iommu_vfio_ioas vfio_ioas; @@ -358,6 +359,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap, length), IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64), + IOCTL_OP(IOMMU_VCMDQ_ALLOC, iommufd_vcmdq_alloc_ioctl, + struct iommu_vcmdq_alloc, length), IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl, struct iommu_vdevice_alloc, virt_id), IOCTL_OP(IOMMU_VEVENTQ_ALLOC, iommufd_veventq_alloc, @@ -501,6 +504,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_IOAS] = { .destroy = iommufd_ioas_destroy, }, + [IOMMUFD_OBJ_VCMDQ] = { + .destroy = iommufd_vcmdq_destroy, + }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, }, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; } + +void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{ + struct iommufd_vcmdq *vcmdq = + container_of(obj, struct iommufd_vcmdq, obj); + struct iommufd_viommu *viommu = vcmdq->viommu; + + if (viommu->ops->vcmdq_destroy) + viommu->ops->vcmdq_destroy(vcmdq); + iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length); + refcount_dec(&viommu->obj.users); +} + +int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{ + struct iommu_vcmdq_alloc *cmd = ucmd->cmd; + struct iommufd_viommu *viommu; + struct iommufd_vcmdq *vcmdq; + struct page **pages; + int max_npages, i; + dma_addr_t end; + int rc; + + if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT) + return -EOPNOTSUPP; + if (!cmd->addr || !cmd->length) + return -EINVAL; + if (check_add_overflow(cmd->addr, cmd->length - 1, &end)) + return -EOVERFLOW; + + max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE); + pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); + if (IS_ERR(viommu)) { + rc = PTR_ERR(viommu); + goto out_free; + } + + if (!viommu->ops || !viommu->ops->vcmdq_alloc) { + rc = -EOPNOTSUPP; + goto out_put_viommu; + } + + /* Quick test on the base address */ + if (!iommu_iova_to_phys(viommu->hwpt->common.domain, cmd->addr)) { + rc = -ENXIO; + goto out_put_viommu; + } + + /* The underlying physical pages must be pinned in the IOAS */ + rc = iopt_pin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length, + pages, 0); + if (rc) + goto out_put_viommu; + + /* Validate if the underlying physical pages are contiguous */ + for (i = 1; i < max_npages && pages[i]; i++) { + if (page_to_pfn(pages[i]) == page_to_pfn(pages[i - 1]) + 1) + continue; + rc = -EFAULT; + goto out_unpin; + } + + vcmdq = viommu->ops->vcmdq_alloc(viommu, cmd->type, cmd->index, + cmd->addr, cmd->length); + if (IS_ERR(vcmdq)) { + rc = PTR_ERR(vcmdq); + goto out_unpin; + } + + vcmdq->viommu = viommu; + refcount_inc(&viommu->obj.users); + vcmdq->addr = cmd->addr; + vcmdq->ictx = ucmd->ictx; + vcmdq->length = cmd->length; + cmd->out_vcmdq_id = vcmdq->obj.id; + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + iommufd_object_abort_and_destroy(ucmd->ictx, &vcmdq->obj); + else + iommufd_object_finalize(ucmd->ictx, &vcmdq->obj); + goto out_put_viommu; + +out_unpin: + iopt_unpin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length); +out_put_viommu: + iommufd_put_object(ucmd->ictx, &viommu->obj); +out_free: + kfree(pages); + return rc; +}
On 4/26/25 13:58, Nicolin Chen wrote:
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
- struct iommufd_vcmdq *vcmdq =
container_of(obj, struct iommufd_vcmdq, obj);
- struct iommufd_viommu *viommu = vcmdq->viommu;
- if (viommu->ops->vcmdq_destroy)
viommu->ops->vcmdq_destroy(vcmdq);
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length);
- refcount_dec(&viommu->obj.users);
+}
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
I don't follow the check of 'cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT' here. My understanding is that it states that "other values of type are not supported". If so, shouldn't it be,
if (cmd->flags || cmd->type != IOMMU_VCMDQ_TYPE_DEFAULT)
?
return -EOPNOTSUPP;
- if (!cmd->addr || !cmd->length)
return -EINVAL;
- if (check_add_overflow(cmd->addr, cmd->length - 1, &end))
return -EOVERFLOW;
Thanks, baolu
On Mon, Apr 28, 2025 at 09:32:04AM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
I don't follow the check of 'cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT' here. My understanding is that it states that "other values of type are not supported". If so, shouldn't it be,
if (cmd->flags || cmd->type != IOMMU_VCMDQ_TYPE_DEFAULT)
?
No. Only other (new) types will be supported. We have this: "* @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use" which means driver should define a new type.
We have the same DEFAULT type in vIOMMU/vEVENTQ allocators by the way.
Thanks Nicolin
On 4/29/25 02:58, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 09:32:04AM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
I don't follow the check of 'cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT' here. My understanding is that it states that "other values of type are not supported". If so, shouldn't it be,
if (cmd->flags || cmd->type != IOMMU_VCMDQ_TYPE_DEFAULT)
?
No. Only other (new) types will be supported. We have this: "* @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use" which means driver should define a new type.
We have the same DEFAULT type in vIOMMU/vEVENTQ allocators by the way.
Okay, thanks for the explanation.
The iommu driver's callback will return a failure if the type is not supported. Then it's fine.
Thanks, baolu
Hi Nicolin,
[+Suravee]
On 4/26/2025 11:28 AM, Nicolin Chen wrote:
Introduce a new IOMMUFD_CMD_VCMDQ_ALLOC ioctl for user space to allocate a vCMDQ for a vIOMMU object. Simply increase the refcount of the vIOMMU.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 2 + include/uapi/linux/iommufd.h | 41 +++++++++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 94 +++++++++++++++++++++++++ 4 files changed, 143 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 79160b039bc7..b974c207ae8a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -611,6 +611,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj); +int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_vcmdq_destroy(struct iommufd_object *obj); #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cc90299a08d9..06a763fda47f 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -56,6 +56,7 @@ enum { IOMMUFD_CMD_VDEVICE_ALLOC = 0x91, IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92, IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
- IOMMUFD_CMD_VCMDQ_ALLOC = 0x94,
}; /** @@ -1147,4 +1148,44 @@ struct iommu_veventq_alloc { __u32 __reserved; }; #define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC)
+/**
- enum iommu_vcmdq_type - Virtual Command Queue Type
- @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use
- */
+enum iommu_vcmdq_type {
- IOMMU_VCMDQ_TYPE_DEFAULT = 0,
+};
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
Sorry. I didn't get this part.
So here `addr` is command queue base address like - NVIDIA's virtual command queue - AMD vIOMMU's command buffer
.. and it will allocate vcmdq for each buffer type. Is that the correct understanding?
In case of AMD vIOMMU, buffer base address is programmed in different register (ex: MMIO Offset 0008h Command Buffer Base Address Register) and buffer enable/disable is done via different register (ex: MMIO Offset 0018h IOMMU Control Register). And we need to communicate both to hypervisor. Not sure this API can accommodate this as addr seems to be mandatory.
[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifi...
- @length: Length of the queue memory in the guest physical address space
- Allocate a virtual command queue object for a vIOMMU-specific HW-accelerated
- feature that can access a guest queue memory described by @addr and @length.
- It's suggested for VMM to back the queue memory using a single huge page with
- a proper alignment for its contiguity in the host physical address space. The
- call will fail, if the queue memory is not contiguous in the physical address
- space. Upon success, its underlying physical pages will be pinned to prevent
- VMM from unmapping them in the IOAS, until the virtual CMDQ gets destroyed.
- */
+struct iommu_vcmdq_alloc {
- __u32 size;
- __u32 flags;
- __u32 viommu_id;
- __u32 type;
- __u32 index;
- __u32 out_vcmdq_id;
- __aligned_u64 addr;
- __aligned_u64 length;
+}; +#define IOMMU_VCMDQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VCMDQ_ALLOC) #endif diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 2b9ee9b4a424..ac51d5cfaa61 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -303,6 +303,7 @@ union ucmd_buffer { struct iommu_ioas_map map; struct iommu_ioas_unmap unmap; struct iommu_option option;
- struct iommu_vcmdq_alloc vcmdq; struct iommu_vdevice_alloc vdev; struct iommu_veventq_alloc veventq; struct iommu_vfio_ioas vfio_ioas;
@@ -358,6 +359,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap, length), IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64),
- IOCTL_OP(IOMMU_VCMDQ_ALLOC, iommufd_vcmdq_alloc_ioctl,
IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl, struct iommu_vdevice_alloc, virt_id), IOCTL_OP(IOMMU_VEVENTQ_ALLOC, iommufd_veventq_alloc,struct iommu_vcmdq_alloc, length),
@@ -501,6 +504,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_IOAS] = { .destroy = iommufd_ioas_destroy, },
- [IOMMUFD_OBJ_VCMDQ] = {
.destroy = iommufd_vcmdq_destroy,
- }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, },
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
I didn't understood destroy flow in general. Can you please help me to understand:
VMM is expected to track all buffers and call this interface? OR iommufd will take care of it? What happens if VM crashes ?
- struct iommufd_vcmdq *vcmdq =
container_of(obj, struct iommufd_vcmdq, obj);
- struct iommufd_viommu *viommu = vcmdq->viommu;
- if (viommu->ops->vcmdq_destroy)
viommu->ops->vcmdq_destroy(vcmdq);
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length);
- refcount_dec(&viommu->obj.users);
+}
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
return -EOPNOTSUPP;
- if (!cmd->addr || !cmd->length)
return -EINVAL;
- if (check_add_overflow(cmd->addr, cmd->length - 1, &end))
return -EOVERFLOW;
- max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE);
- pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);
- if (!pages)
return -ENOMEM;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu)) {
rc = PTR_ERR(viommu);
goto out_free;
- }
- if (!viommu->ops || !viommu->ops->vcmdq_alloc) {
rc = -EOPNOTSUPP;
goto out_put_viommu;
- }
- /* Quick test on the base address */
- if (!iommu_iova_to_phys(viommu->hwpt->common.domain, cmd->addr)) {
rc = -ENXIO;
goto out_put_viommu;
- }
- /* The underlying physical pages must be pinned in the IOAS */
- rc = iopt_pin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length,
pages, 0);
Why do we need this? is it not pinned already as part of vfio binding?
-Vasant
- if (rc)
goto out_put_viommu;
- /* Validate if the underlying physical pages are contiguous */
- for (i = 1; i < max_npages && pages[i]; i++) {
if (page_to_pfn(pages[i]) == page_to_pfn(pages[i - 1]) + 1)
continue;
rc = -EFAULT;
goto out_unpin;
- }
- vcmdq = viommu->ops->vcmdq_alloc(viommu, cmd->type, cmd->index,
cmd->addr, cmd->length);
- if (IS_ERR(vcmdq)) {
rc = PTR_ERR(vcmdq);
goto out_unpin;
- }
- vcmdq->viommu = viommu;
- refcount_inc(&viommu->obj.users);
- vcmdq->addr = cmd->addr;
- vcmdq->ictx = ucmd->ictx;
- vcmdq->length = cmd->length;
- cmd->out_vcmdq_id = vcmdq->obj.id;
- rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
- if (rc)
iommufd_object_abort_and_destroy(ucmd->ictx, &vcmdq->obj);
- else
iommufd_object_finalize(ucmd->ictx, &vcmdq->obj);
- goto out_put_viommu;
+out_unpin:
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
+out_free:
- kfree(pages);
- return rc;
+}
On Mon, Apr 28, 2025 at 05:42:27PM +0530, Vasant Hegde wrote:
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
Sorry. I didn't get this part.
So here `addr` is command queue base address like
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
.. and it will allocate vcmdq for each buffer type. Is that the correct understanding?
Yes. For AMD "vIOMMU", it needs a new type for iommufd vIOMMU: IOMMU_VIOMMU_TYPE_AMD_VIOMMU,
For AMD "vIOMMU" command buffer, it needs a new type too: IOMMU_VCMDQ_TYPE_AMD_VIOMMU, /* Kdoc it to be Command Buffer */
Then, use IOMMUFD_CMD_VIOMMU_ALLOC ioctl to allocate an vIOMMU obj, and use IOMMUFD_CMD_VCMDQ_ALLOC ioctl(s) to allocate vCMDQ objs.
In case of AMD vIOMMU, buffer base address is programmed in different register (ex: MMIO Offset 0008h Command Buffer Base Address Register) and buffer enable/disable is done via different register (ex: MMIO Offset 0018h IOMMU Control Register). And we need to communicate both to hypervisor. Not sure this API can accommodate this as addr seems to be mandatory.
NVIDIA's CMDQV has all three of them too. What we do here is to let VMM trap the buffer base address (in guest physical address space) and forward it to kernel using this @addr. Then, kernel will translate this @addr to host physical address space, and program the physical address and size to the register.
[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifi...
Thanks for the doc. So, AMD has:
Command Buffer Base Address Register [MMIO Offset 0008h] "used to program the system physical base address and size of the command buffer. The command buffer occupies contiguous physical memory starting at the programmed base address, up to the programmed size." Command Buffer Head Pointer Register [MMIO Offset 2000h] Command Buffer Tail Pointer Register [MMIO Offset 2008h]
IIUIC, AMD should do the same: VMM traps VM's Command Buffer Base Address register when the guest kernel allocates a command buffer by programming the VM's Command Buffer Base Address register, to capture the guest PA and size. Then, VMM allocates a vCMDQ object (for this command buffer) forwarding its buffer address and size via @addr and @length to the host kernel. Then, the kernel should translate the guest PA to host PA to program the HW.
We can see that the Head/Tail registers are in a different MMIO page (offset by two 4K pages), which is very like NVIDIA CMDQV that allows VMM to mmap that MMIO page of the Head/Tail registers for guest OS to directly control the HW (i.e. VMM doesn't trap these two registers.
When guest OS wants to issue a new command, the guest kernel can just fill the guest command buffer at the entry that the Head register points to, and program the Tail register (backed by an mmap'd MMIO page), then the HW will read the programmed physical address from the entry (Head) till the entry (Tail) in the guest command buffer.
@@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
I didn't understood destroy flow in general. Can you please help me to understand:
VMM is expected to track all buffers and call this interface? OR iommufd will take care of it? What happens if VM crashes ?
In a normal routine, VMM gets a vCMDQ object ID for each vCMDQ object it allocated. So, it should track all the IDs and release them when VM shuts down.
The iommufd core does track all the objects that belong to an iommufd context (ictx), and automatically release them. But, it can't resolve certain dependency on other FD, e.g. vEVENTQ and FAULT QUEUE would return another FD that user space listens to and must be closed properly to destroy the QUEUE object.
- /* The underlying physical pages must be pinned in the IOAS */
- rc = iopt_pin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length,
pages, 0);
Why do we need this? is it not pinned already as part of vfio binding?
I think this could be clearer: /* * The underlying physical pages must be pinned to prevent them from * being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the life cycle * of the vCMDQ object. */
Thanks Nicolin
Hi Nicolin,
On 4/29/2025 1:32 AM, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 05:42:27PM +0530, Vasant Hegde wrote:
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
Sorry. I didn't get this part.
So here `addr` is command queue base address like
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
.. and it will allocate vcmdq for each buffer type. Is that the correct understanding?
Yes. For AMD "vIOMMU", it needs a new type for iommufd vIOMMU: IOMMU_VIOMMU_TYPE_AMD_VIOMMU,
For AMD "vIOMMU" command buffer, it needs a new type too: IOMMU_VCMDQ_TYPE_AMD_VIOMMU, /* Kdoc it to be Command Buffer */
You are suggesting we define one type for AMD and use it for all buffers like command buffer, event log, PPR buffet etc? and use iommu_vcmdq_alloc->index to identity different buffer type?
Then, use IOMMUFD_CMD_VIOMMU_ALLOC ioctl to allocate an vIOMMU obj, and use IOMMUFD_CMD_VCMDQ_ALLOC ioctl(s) to allocate vCMDQ objs.
In case of AMD vIOMMU, buffer base address is programmed in different register (ex: MMIO Offset 0008h Command Buffer Base Address Register) and buffer enable/disable is done via different register (ex: MMIO Offset 0018h IOMMU Control Register). And we need to communicate both to hypervisor. Not sure this API can accommodate this as addr seems to be mandatory.
NVIDIA's CMDQV has all three of them too. What we do here is to let VMM trap the buffer base address (in guest physical address space) and forward it to kernel using this @addr. Then, kernel will translate this @addr to host physical address space, and program the physical address and size to the register.
Right. For AMD IOMMU 1st 4K of MMIO space (which contains all buffer base address registers) is not accelerated. So we can trap it and pass GPA, size to iommufd.
.. but programming base register (like Command buffer base addr) is not sufficient. We have to enable the command buffer by setting particular bit in Control register. So at high level flow is something like below (@Suravee, correct me if I missed something here).
From guest side : Write command bufer base addr, size (MMIO offset 0x08) Set MMIO Offset 0x18[bit 12] Also we need to program few other bits that are not related to these buffers like `Completion wait interrupt enable`.
From VMM side: We need to trap both register and pass it to iommufd
From Host AMD IOMMU driver: We have to program VFCntlMMIO Offset {16’b[GuestID], 6’b10_0000}
We need a way to pass Control register details to iommufd -> AMD driver so that we can program the VF control MMIO register.
Since iommu_vcmdq_alloc structure doesn't have user_data, how do we communicate control register?
[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifi...
Thanks for the doc. So, AMD has:
Command Buffer Base Address Register [MMIO Offset 0008h] "used to program the system physical base address and size of the command buffer. The command buffer occupies contiguous physical memory starting at the programmed base address, up to the programmed size." Command Buffer Head Pointer Register [MMIO Offset 2000h] Command Buffer Tail Pointer Register [MMIO Offset 2008h]
IIUIC, AMD should do the same: VMM traps VM's Command Buffer Base Address register when the guest kernel allocates a command buffer by programming the VM's Command Buffer Base Address register, to capture the guest PA and size. Then, VMM allocates a vCMDQ object (for this command buffer) forwarding its buffer address and size via @addr and @length to the host kernel. Then, the kernel should translate the guest PA to host PA to program the HW.
We can see that the Head/Tail registers are in a different MMIO page (offset by two 4K pages), which is very like NVIDIA CMDQV that allows VMM to mmap that MMIO page of the Head/Tail registers for guest OS to directly control the HW (i.e. VMM doesn't trap these two registers.
When guest OS wants to issue a new command, the guest kernel can just fill the guest command buffer at the entry that the Head register points to, and program the Tail register (backed by an mmap'd MMIO page), then the HW will read the programmed physical address from the entry (Head) till the entry (Tail) in the guest command buffer.
Right.
@@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
I didn't understood destroy flow in general. Can you please help me to understand:
VMM is expected to track all buffers and call this interface? OR iommufd will take care of it? What happens if VM crashes ?
In a normal routine, VMM gets a vCMDQ object ID for each vCMDQ object it allocated. So, it should track all the IDs and release them when VM shuts down.
The iommufd core does track all the objects that belong to an iommufd context (ictx), and automatically release them. But, it can't resolve certain dependency on other FD, e.g. vEVENTQ and FAULT QUEUE would return another FD that user space listens to and must be closed properly to destroy the QUEUE object.
Got it.
- /* The underlying physical pages must be pinned in the IOAS */
- rc = iopt_pin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length,
pages, 0);
Why do we need this? is it not pinned already as part of vfio binding?
I think this could be clearer: /* * The underlying physical pages must be pinned to prevent them from * being unmapped (via IOMMUFD_CMD_IOAS_UNMAP) during the life cycle * of the vCMDQ object. */
Understood.
Thanks -Vasant
On Tue, Apr 29, 2025 at 11:04:06AM +0530, Vasant Hegde wrote:
On 4/29/2025 1:32 AM, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 05:42:27PM +0530, Vasant Hegde wrote:
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
Sorry. I didn't get this part.
So here `addr` is command queue base address like
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
.. and it will allocate vcmdq for each buffer type. Is that the correct understanding?
Yes. For AMD "vIOMMU", it needs a new type for iommufd vIOMMU: IOMMU_VIOMMU_TYPE_AMD_VIOMMU,
For AMD "vIOMMU" command buffer, it needs a new type too: IOMMU_VCMDQ_TYPE_AMD_VIOMMU, /* Kdoc it to be Command Buffer */
You are suggesting we define one type for AMD and use it for all buffers like command buffer, event log, PPR buffet etc? and use iommu_vcmdq_alloc->index to identity different buffer type?
We have vEVENTQ for event logging and FAULT_QUEUE for PRI, but both are not for hardware accelerated use cases.
I didn't check the details of AMD's event log and PPR buffers. But they seem to be the same ring buffers and can be consumed by guest kernel directly?
Will the hardware replace the physical device ID in the event with the virtual device ID when injecting the event to a guest event/PPR queue? If so, yea, I think you can define them separately using the vCMDQ infrastructures: - IOMMU_VCMDQ_TYPE_AMD_VIOMMU_CMDBUF - IOMMU_VCMDQ_TYPE_AMD_VIOMMU_EVENTLOG - IOMMU_VCMDQ_TYPE_AMD_VIOMMU_PPRLOG (@Kevin @Jason Hmm, in this case we might want to revert the naming "vCMDQ" back to "vQEUEUE", once Vasant confirms.)
Each of them will be allocated on top of one vIOMMU object.
As for index, it really depends on how vIOMMU manages those vCMDQ objects. In NVIDIA case, each VINTF can support multiple vCMDQs of the same type (like smp in CPU term). If AMD is a similar case, yea, apply an index to each of them is a good idea. Otherwise, if vIOMMU could only have three queues and their types are different, perhaps the driver-level vIOMMU structure could just hold three pointers to manage them without using the index.
Then, use IOMMUFD_CMD_VIOMMU_ALLOC ioctl to allocate an vIOMMU obj, and use IOMMUFD_CMD_VCMDQ_ALLOC ioctl(s) to allocate vCMDQ objs.
In case of AMD vIOMMU, buffer base address is programmed in different register (ex: MMIO Offset 0008h Command Buffer Base Address Register) and buffer enable/disable is done via different register (ex: MMIO Offset 0018h IOMMU Control Register). And we need to communicate both to hypervisor. Not sure this API can accommodate this as addr seems to be mandatory.
NVIDIA's CMDQV has all three of them too. What we do here is to let VMM trap the buffer base address (in guest physical address space) and forward it to kernel using this @addr. Then, kernel will translate this @addr to host physical address space, and program the physical address and size to the register.
Right. For AMD IOMMU 1st 4K of MMIO space (which contains all buffer base address registers) is not accelerated. So we can trap it and pass GPA, size to iommufd.
Yes.
.. but programming base register (like Command buffer base addr) is not sufficient. We have to enable the command buffer by setting particular bit in Control register. So at high level flow is something like below (@Suravee, correct me if I missed something here).
From guest side : Write command bufer base addr, size (MMIO offset 0x08) Set MMIO Offset 0x18[bit 12] Also we need to program few other bits that are not related to these buffers like `Completion wait interrupt enable`.
From VMM side: We need to trap both register and pass it to iommufd
From Host AMD IOMMU driver: We have to program VFCntlMMIO Offset {16’b[GuestID], 6’b10_0000}
We need a way to pass Control register details to iommufd -> AMD driver so that we can program the VF control MMIO register.
Since iommu_vcmdq_alloc structure doesn't have user_data, how do we communicate control register?
BIT(12) is the CMD enable bit. VMM can trap that as the trigger to forward the base address/length and other info.
And you'd likely need to define a driver structure:
// Add this to struct iommu_vcmdq_alloc; + * @data_len: Length of the type specific data + * @data_uptr: User pointer to the type specific data .. + __u32 data_len; + __aligned_u64 data_uptr;
// Refer to my patch in the v1 series that handles the user_data: // https://lore.kernel.org/linux-iommu/5cd2c7c4d92c79baf0cfc59e2a6b3e1db4e86ab8...
// Define a driver structure struct iommu_vcmdq_amd_viommu_cmdbuf { __u32 flags; __u32 data; };
Thanks Nicolin
Hi Nicolin,
On 4/29/2025 12:15 PM, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 11:04:06AM +0530, Vasant Hegde wrote:
On 4/29/2025 1:32 AM, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 05:42:27PM +0530, Vasant Hegde wrote:
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
Sorry. I didn't get this part.
So here `addr` is command queue base address like
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
.. and it will allocate vcmdq for each buffer type. Is that the correct understanding?
Yes. For AMD "vIOMMU", it needs a new type for iommufd vIOMMU: IOMMU_VIOMMU_TYPE_AMD_VIOMMU,
For AMD "vIOMMU" command buffer, it needs a new type too: IOMMU_VCMDQ_TYPE_AMD_VIOMMU, /* Kdoc it to be Command Buffer */
You are suggesting we define one type for AMD and use it for all buffers like command buffer, event log, PPR buffet etc? and use iommu_vcmdq_alloc->index to identity different buffer type?
We have vEVENTQ for event logging and FAULT_QUEUE for PRI, but both are not for hardware accelerated use cases.
I didn't check the details of AMD's event log and PPR buffers. But they seem to be the same ring buffers and can be consumed by guest kernel directly?
Right. Event log is accelerated and consumed by guest directly. Also we have Event Log B !
Will the hardware replace the physical device ID in the event with the virtual device ID when injecting the event to a guest event/PPR queue? If so, yea, I think you can define them separately using the> vCMDQ
infrastructures:
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_CMDBUF
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_EVENTLOG
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_PPRLOG
(@Kevin @Jason Hmm, in this case we might want to revert the naming "vCMDQ" back to "vQEUEUE", once Vasant confirms.)
Each of them will be allocated on top of one vIOMMU object.
As for index, it really depends on how vIOMMU manages those vCMDQ objects. In NVIDIA case, each VINTF can support multiple vCMDQs of the same type (like smp in CPU term). If AMD is a similar case, yea, apply an index to each of them is a good idea. Otherwise, if vIOMMU could only have three queues and their types are different, perhaps the driver-level vIOMMU structure could just hold three pointers to manage them without using the index.
Right. May be we can use index.
Then, use IOMMUFD_CMD_VIOMMU_ALLOC ioctl to allocate an vIOMMU obj, and use IOMMUFD_CMD_VCMDQ_ALLOC ioctl(s) to allocate vCMDQ objs.
In case of AMD vIOMMU, buffer base address is programmed in different register (ex: MMIO Offset 0008h Command Buffer Base Address Register) and buffer enable/disable is done via different register (ex: MMIO Offset 0018h IOMMU Control Register). And we need to communicate both to hypervisor. Not sure this API can accommodate this as addr seems to be mandatory.
NVIDIA's CMDQV has all three of them too. What we do here is to let VMM trap the buffer base address (in guest physical address space) and forward it to kernel using this @addr. Then, kernel will translate this @addr to host physical address space, and program the physical address and size to the register.
Right. For AMD IOMMU 1st 4K of MMIO space (which contains all buffer base address registers) is not accelerated. So we can trap it and pass GPA, size to iommufd.
Yes.
.. but programming base register (like Command buffer base addr) is not sufficient. We have to enable the command buffer by setting particular bit in Control register. So at high level flow is something like below (@Suravee, correct me if I missed something here).
From guest side : Write command bufer base addr, size (MMIO offset 0x08) Set MMIO Offset 0x18[bit 12] Also we need to program few other bits that are not related to these buffers like `Completion wait interrupt enable`.
From VMM side: We need to trap both register and pass it to iommufd
From Host AMD IOMMU driver: We have to program VFCntlMMIO Offset {16’b[GuestID], 6’b10_0000}
We need a way to pass Control register details to iommufd -> AMD driver so that we can program the VF control MMIO register.
Since iommu_vcmdq_alloc structure doesn't have user_data, how do we communicate control register?
BIT(12) is the CMD enable bit. VMM can trap that as the trigger to forward the base address/length and other info.
Right. For the control bits which has corresponding buffer we can do that. But there are few control register bits (like completion wait interrupt enable) which doesn't have buffer.
And you'd likely need to define a driver structure:
// Add this to struct iommu_vcmdq_alloc;
- @data_len: Length of the type specific data
- @data_uptr: User pointer to the type specific data
..
- __u32 data_len;
- __aligned_u64 data_uptr;
// Refer to my patch in the v1 series that handles the user_data: // https://lore.kernel.org/linux-iommu/5cd2c7c4d92c79baf0cfc59e2a6b3e1db4e86ab8...
Right. I have seen V1 series and that thought we can define driver specific data structure.
-Vasant
// Define a driver structure struct iommu_vcmdq_amd_viommu_cmdbuf { __u32 flags; __u32 data; };
Thanks Nicolin
On Tue, Apr 29, 2025 at 03:52:48PM +0530, Vasant Hegde wrote:
On 4/29/2025 12:15 PM, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 11:04:06AM +0530, Vasant Hegde wrote:
On 4/29/2025 1:32 AM, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 05:42:27PM +0530, Vasant Hegde wrote: Yes. For AMD "vIOMMU", it needs a new type for iommufd vIOMMU: IOMMU_VIOMMU_TYPE_AMD_VIOMMU,
For AMD "vIOMMU" command buffer, it needs a new type too: IOMMU_VCMDQ_TYPE_AMD_VIOMMU, /* Kdoc it to be Command Buffer */
You are suggesting we define one type for AMD and use it for all buffers like command buffer, event log, PPR buffet etc? and use iommu_vcmdq_alloc->index to identity different buffer type?
We have vEVENTQ for event logging and FAULT_QUEUE for PRI, but both are not for hardware accelerated use cases.
I didn't check the details of AMD's event log and PPR buffers. But they seem to be the same ring buffers and can be consumed by guest kernel directly?
Right. Event log is accelerated and consumed by guest directly. Also we have Event Log B !
Will the hardware replace the physical device ID in the event with the virtual device ID when injecting the event to a guest event/PPR queue? If so, yea, I think you can define them separately using the> vCMDQ
infrastructures:
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_CMDBUF
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_EVENTLOG
- IOMMU_VCMDQ_TYPE_AMD_VIOMMU_PPRLOG
(@Kevin @Jason Hmm, in this case we might want to revert the naming "vCMDQ" back to "vQEUEUE", once Vasant confirms.)
I think I should rename IOMMUFD_OBJ_VCMDQ back to IOMMUFD_OBJ_VQUEUE since the same object fits three types of queue now in the AMD case.
Or any better naming suggestion?
Thanks Nicolin
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
Introduce a new IOMMUFD_CMD_VCMDQ_ALLOC ioctl for user space to allocate a vCMDQ for a vIOMMU object. Simply increase the refcount of the vIOMMU.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 2 + include/uapi/linux/iommufd.h | 41 +++++++++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 94 +++++++++++++++++++++++++ 4 files changed, 143 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 79160b039bc7..b974c207ae8a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -611,6 +611,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_vdevice_destroy(struct iommufd_object *obj); +int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_vcmdq_destroy(struct iommufd_object *obj); #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cc90299a08d9..06a763fda47f 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -56,6 +56,7 @@ enum { IOMMUFD_CMD_VDEVICE_ALLOC = 0x91, IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92, IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
- IOMMUFD_CMD_VCMDQ_ALLOC = 0x94,
}; /** @@ -1147,4 +1148,44 @@ struct iommu_veventq_alloc { __u32 __reserved; }; #define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC)
+/**
- enum iommu_vcmdq_type - Virtual Command Queue Type
- @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use
- */
+enum iommu_vcmdq_type {
- IOMMU_VCMDQ_TYPE_DEFAULT = 0,
+};
+/**
- struct iommu_vcmdq_alloc - ioctl(IOMMU_VCMDQ_ALLOC)
- @size: sizeof(struct iommu_vcmdq_alloc)
- @flags: Must be 0
- @viommu_id: Virtual IOMMU ID to associate the virtual command queue with
- @type: One of enum iommu_vcmdq_type
- @index: The logical index to the virtual command queue per virtual IOMMU, for
a multi-queue model
- @out_vcmdq_id: The ID of the new virtual command queue
- @addr: Base address of the queue memory in the guest physical address space
- @length: Length of the queue memory in the guest physical address space
- Allocate a virtual command queue object for a vIOMMU-specific HW-accelerated
- feature that can access a guest queue memory described by @addr and @length.
- It's suggested for VMM to back the queue memory using a single huge page with
- a proper alignment for its contiguity in the host physical address space. The
- call will fail, if the queue memory is not contiguous in the physical address
- space. Upon success, its underlying physical pages will be pinned to prevent
- VMM from unmapping them in the IOAS, until the virtual CMDQ gets destroyed.
- */
+struct iommu_vcmdq_alloc {
- __u32 size;
- __u32 flags;
- __u32 viommu_id;
- __u32 type;
- __u32 index;
- __u32 out_vcmdq_id;
- __aligned_u64 addr;
- __aligned_u64 length;
+}; +#define IOMMU_VCMDQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VCMDQ_ALLOC) #endif diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 2b9ee9b4a424..ac51d5cfaa61 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -303,6 +303,7 @@ union ucmd_buffer { struct iommu_ioas_map map; struct iommu_ioas_unmap unmap; struct iommu_option option;
- struct iommu_vcmdq_alloc vcmdq; struct iommu_vdevice_alloc vdev; struct iommu_veventq_alloc veventq; struct iommu_vfio_ioas vfio_ioas;
@@ -358,6 +359,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap, length), IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64),
- IOCTL_OP(IOMMU_VCMDQ_ALLOC, iommufd_vcmdq_alloc_ioctl,
IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl, struct iommu_vdevice_alloc, virt_id), IOCTL_OP(IOMMU_VEVENTQ_ALLOC, iommufd_veventq_alloc,struct iommu_vcmdq_alloc, length),
@@ -501,6 +504,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_IOAS] = { .destroy = iommufd_ioas_destroy, },
- [IOMMUFD_OBJ_VCMDQ] = {
.destroy = iommufd_vcmdq_destroy,
- }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, },
When do we expect the VMM to use this ioctl? While it's spawning a new VM? IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf? Or do we plan to trap access to trap the access everytime the VM accesses an lvcmdq base register?
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
- struct iommufd_vcmdq *vcmdq =
container_of(obj, struct iommufd_vcmdq, obj);
- struct iommufd_viommu *viommu = vcmdq->viommu;
- if (viommu->ops->vcmdq_destroy)
viommu->ops->vcmdq_destroy(vcmdq);
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length);
- refcount_dec(&viommu->obj.users);
+}
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
return -EOPNOTSUPP;
The cmd->type check is a little confusing here, I think we could re-order the series and add this check when we have the CMDQV type. Alternatively, we could keep this in place and add the driver-specific vcmdq_alloc op calls when it's added/available for Tegra CMDQV while stubbing out the rest of this function accordingly.
- if (!cmd->addr || !cmd->length)
return -EINVAL;
- if (check_add_overflow(cmd->addr, cmd->length - 1, &end))
return -EOVERFLOW;
- max_npages = DIV_ROUND_UP(cmd->length, PAGE_SIZE);
- pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);
- if (!pages)
return -ENOMEM;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu)) {
rc = PTR_ERR(viommu);
goto out_free;
- }
- if (!viommu->ops || !viommu->ops->vcmdq_alloc) {
rc = -EOPNOTSUPP;
goto out_put_viommu;
- }
- /* Quick test on the base address */
- if (!iommu_iova_to_phys(viommu->hwpt->common.domain, cmd->addr)) {
rc = -ENXIO;
goto out_put_viommu;
- }
- /* The underlying physical pages must be pinned in the IOAS */
- rc = iopt_pin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length,
pages, 0);
- if (rc)
goto out_put_viommu;
- /* Validate if the underlying physical pages are contiguous */
- for (i = 1; i < max_npages && pages[i]; i++) {
if (page_to_pfn(pages[i]) == page_to_pfn(pages[i - 1]) + 1)
continue;
rc = -EFAULT;
goto out_unpin;
- }
- vcmdq = viommu->ops->vcmdq_alloc(viommu, cmd->type, cmd->index,
cmd->addr, cmd->length);
- if (IS_ERR(vcmdq)) {
rc = PTR_ERR(vcmdq);
goto out_unpin;
- }
- vcmdq->viommu = viommu;
- refcount_inc(&viommu->obj.users);
- vcmdq->addr = cmd->addr;
- vcmdq->ictx = ucmd->ictx;
- vcmdq->length = cmd->length;
- cmd->out_vcmdq_id = vcmdq->obj.id;
- rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
- if (rc)
iommufd_object_abort_and_destroy(ucmd->ictx, &vcmdq->obj);
- else
iommufd_object_finalize(ucmd->ictx, &vcmdq->obj);
- goto out_put_viommu;
+out_unpin:
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, cmd->addr, cmd->length);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
+out_free:
- kfree(pages);
- return rc;
+}
2.43.0
On Mon, Apr 28, 2025 at 09:34:05PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
@@ -501,6 +504,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_IOAS] = { .destroy = iommufd_ioas_destroy, },
- [IOMMUFD_OBJ_VCMDQ] = {
.destroy = iommufd_vcmdq_destroy,
- }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, },
When do we expect the VMM to use this ioctl? While it's spawning a new VM?
When guest OS clears the VCMDQ's base address register, or when guest OS reboots or shuts down.
IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf?
VINTF is a vSMMU instance per SMMU. Each VINTF can have multiple LVCMDQs. Each vCMDQ is allocated per IOMMUFD_CMD_VCMDQ_ALLOC. In other word, VM can issue multiple IOMMUFD_CMD_VCMDQ_ALLOC calls for each VTINF/vSMMU.
Or do we plan to trap access to trap the access everytime the VM accesses an lvcmdq base register?
Yes. That's the only place the VMM can trap. All other register accesses are going to the HW directly without trappings.
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
- struct iommufd_vcmdq *vcmdq =
container_of(obj, struct iommufd_vcmdq, obj);
- struct iommufd_viommu *viommu = vcmdq->viommu;
- if (viommu->ops->vcmdq_destroy)
viommu->ops->vcmdq_destroy(vcmdq);
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length);
- refcount_dec(&viommu->obj.users);
+}
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
return -EOPNOTSUPP;
The cmd->type check is a little confusing here, I think we could re-order the series and add this check when we have the CMDQV type.
This is the patch that introduces the IOMMU_VCMDQ_TYPE_DEFAULT. So, it's natural we check it here. The thing is that we have to introduce something to fill the enum iommu_vcmdq_type, so that it wouldn't be empty.
An unsupported DEFAULT type is what we have for vIOMMU/vEVENTQ also.
A driver patch should define its own type along with the driver patch. And it's what this series does. I think it's pretty clear?
Alternatively, we could keep this in place and
[..]
add the driver-specific vcmdq_alloc op calls when it's added/available for Tegra CMDQV while stubbing out the rest of this function accordingly.
Why?
The vcmdq_alloc op is already introduced in the prior patch. It is cleaner to keep all core code in one patch. And then another tegra patch to add driver type and its support.
Thanks Nicolin
On Mon, Apr 28, 2025 at 03:44:08PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 09:34:05PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
@@ -501,6 +504,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_IOAS] = { .destroy = iommufd_ioas_destroy, },
- [IOMMUFD_OBJ_VCMDQ] = {
.destroy = iommufd_vcmdq_destroy,
- }, [IOMMUFD_OBJ_VDEVICE] = { .destroy = iommufd_vdevice_destroy, },
When do we expect the VMM to use this ioctl? While it's spawning a new VM?
When guest OS clears the VCMDQ's base address register, or when guest OS reboots or shuts down.
Ack. So, basically any write to VCMDQ_BASE is trapped?
IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf?
VINTF is a vSMMU instance per SMMU. Each VINTF can have multiple LVCMDQs. Each vCMDQ is allocated per IOMMUFD_CMD_VCMDQ_ALLOC. In other word, VM can issue multiple IOMMUFD_CMD_VCMDQ_ALLOC calls for each VTINF/vSMMU.
Ack. I'm just wondering why would a single VM want more than one vCMDQ per vSMMU?
Or do we plan to trap access to trap the access everytime the VM accesses an lvcmdq base register?
Yes. That's the only place the VMM can trap. All other register accesses are going to the HW directly without trappings.
Got it.
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index a65153458a26..02a111710ffe 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -170,3 +170,97 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &viommu->obj); return rc; }
+void iommufd_vcmdq_destroy(struct iommufd_object *obj) +{
- struct iommufd_vcmdq *vcmdq =
container_of(obj, struct iommufd_vcmdq, obj);
- struct iommufd_viommu *viommu = vcmdq->viommu;
- if (viommu->ops->vcmdq_destroy)
viommu->ops->vcmdq_destroy(vcmdq);
- iopt_unpin_pages(&viommu->hwpt->ioas->iopt, vcmdq->addr, vcmdq->length);
- refcount_dec(&viommu->obj.users);
+}
+int iommufd_vcmdq_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_vcmdq_alloc *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vcmdq *vcmdq;
- struct page **pages;
- int max_npages, i;
- dma_addr_t end;
- int rc;
- if (cmd->flags || cmd->type == IOMMU_VCMDQ_TYPE_DEFAULT)
return -EOPNOTSUPP;
The cmd->type check is a little confusing here, I think we could re-order the series and add this check when we have the CMDQV type.
This is the patch that introduces the IOMMU_VCMDQ_TYPE_DEFAULT. So, it's natural we check it here. The thing is that we have to introduce something to fill the enum iommu_vcmdq_type, so that it wouldn't be empty.
An unsupported DEFAULT type is what we have for vIOMMU/vEVENTQ also.
A driver patch should define its own type along with the driver patch. And it's what this series does. I think it's pretty clear?
Alright. Agreed.
Alternatively, we could keep this in place and
[..]
add the driver-specific vcmdq_alloc op calls when it's added/available for Tegra CMDQV while stubbing out the rest of this function accordingly.
Why?
The vcmdq_alloc op is already introduced in the prior patch. It is cleaner to keep all core code in one patch. And then another tegra patch to add driver type and its support.
Alright.
Thanks Nicolin
Thanks, Praan
On Tue, Apr 29, 2025 at 08:28:01AM +0000, Pranjal Shrivastava wrote:
On Mon, Apr 28, 2025 at 03:44:08PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 09:34:05PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
[...]
IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf?
VINTF is a vSMMU instance per SMMU. Each VINTF can have multiple LVCMDQs. Each vCMDQ is allocated per IOMMUFD_CMD_VCMDQ_ALLOC. In other word, VM can issue multiple IOMMUFD_CMD_VCMDQ_ALLOC calls for each VTINF/vSMMU.
Ack. I'm just wondering why would a single VM want more than one vCMDQ per vSMMU? [...]
I guess the only thing on this patch from me was to understand why would a single VM want more than one vCMDQ per vSMMU? (Just curious to know :) )
Apart from that, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks Nicolin
Thanks, Praan
On Tue, Apr 29, 2025 at 06:10:31PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 08:28:01AM +0000, Pranjal Shrivastava wrote:
On Mon, Apr 28, 2025 at 03:44:08PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 09:34:05PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
[...]
IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf?
VINTF is a vSMMU instance per SMMU. Each VINTF can have multiple LVCMDQs. Each vCMDQ is allocated per IOMMUFD_CMD_VCMDQ_ALLOC. In other word, VM can issue multiple IOMMUFD_CMD_VCMDQ_ALLOC calls for each VTINF/vSMMU.
Ack. I'm just wondering why would a single VM want more than one vCMDQ per vSMMU? [...]
I guess the only thing on this patch from me was to understand why would a single VM want more than one vCMDQ per vSMMU? (Just curious to know :) )
It gives some perf gain since it has two portals to fill commands.
Nicolin
On Tue, Apr 29, 2025 at 11:15:00AM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 06:10:31PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 08:28:01AM +0000, Pranjal Shrivastava wrote:
On Mon, Apr 28, 2025 at 03:44:08PM -0700, Nicolin Chen wrote:
On Mon, Apr 28, 2025 at 09:34:05PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:05PM -0700, Nicolin Chen wrote:
[...]
IIUC, one vintf can have multiple lvcmdqs and looking at the series it looks like the vcmdq_alloc allocates a single lvcmdq. Is the plan to dedicate one lvcmdq to per VM? Which means VMs can share a vintf?
VINTF is a vSMMU instance per SMMU. Each VINTF can have multiple LVCMDQs. Each vCMDQ is allocated per IOMMUFD_CMD_VCMDQ_ALLOC. In other word, VM can issue multiple IOMMUFD_CMD_VCMDQ_ALLOC calls for each VTINF/vSMMU.
Ack. I'm just wondering why would a single VM want more than one vCMDQ per vSMMU? [...]
I guess the only thing on this patch from me was to understand why would a single VM want more than one vCMDQ per vSMMU? (Just curious to know :) )
It gives some perf gain since it has two portals to fill commands.
Ohh! I'm imagining concurrent invalidations / commands! Interesting!
Nicolin
Thanks! Praan
NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU features to user space VMs. Its hardware has a strict rule when mapping and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring mappings in ascending order and unmappings in descending order.
The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler, however it can't do the same for an unmapping since the destroy op returns void.
Add iommufd_vcmdq_depend/undepend() for-driver helpers, allowing LVCMDQ allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer to refcount_dec().
This is a bit of compromise, because a driver might end up with abusing the API that deadlocks the objects. So restrict the API to a dependency between two driver-allocated objects of the same type, as iommufd would unlikely build any core-level dependency in this case.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 47 ++++++++++++++++++++++++++++++++++ drivers/iommu/iommufd/driver.c | 28 ++++++++++++++++++++ 2 files changed, 75 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index e91381aaec5a..5dff154e8ce1 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -232,6 +232,10 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, enum iommufd_object_type type); void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); +int iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended); +void iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -252,6 +256,17 @@ static inline void iommufd_object_abort(struct iommufd_ctx *ictx, { }
+static inline int iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + return -EOPNOTSUPP; +} + +static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -329,4 +344,36 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, static_assert(offsetof(typeof(*drv_struct), member.obj) == 0); \ iommufd_object_abort(ictx, &drv_struct->member.obj); \ }) + +/* + * Helpers for IOMMU driver to build/destroy a dependency between two sibling + * structures created by one of the allocators above + */ +#define iommufd_vcmdq_depend(vcmdq_dependent, vcmdq_depended, member) \ + ({ \ + static_assert(__same_type(struct iommufd_object, \ + vcmdq_dependent->member.obj)); \ + static_assert(offsetof(typeof(*vcmdq_dependent), \ + member.obj) == 0); \ + static_assert(__same_type(struct iommufd_object, \ + vcmdq_depended->member.obj)); \ + static_assert(offsetof(typeof(*vcmdq_depended), \ + member.obj) == 0); \ + iommufd_object_depend(&vcmdq_dependent->member.obj, \ + &vcmdq_depended->member.obj); \ + }) + +#define iommufd_vcmdq_undepend(vcmdq_dependent, vcmdq_depended, member) \ + ({ \ + static_assert(__same_type(struct iommufd_object, \ + vcmdq_dependent->member.obj)); \ + static_assert(offsetof(typeof(*vcmdq_dependent), \ + member.obj) == 0); \ + static_assert(__same_type(struct iommufd_object, \ + vcmdq_depended->member.obj)); \ + static_assert(offsetof(typeof(*vcmdq_depended), \ + member.obj) == 0); \ + iommufd_object_undepend(&vcmdq_dependent->member.obj, \ + &vcmdq_depended->member.obj); \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 7980a09761c2..fb7f8fe40f95 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -50,6 +50,34 @@ void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) } EXPORT_SYMBOL_NS_GPL(iommufd_object_abort, "IOMMUFD");
+/* A per-structure helper is available in include/linux/iommufd.h */ +int iommufd_object_depend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + /* Reject self dependency that dead locks */ + if (obj_dependent == obj_depended) + return -EINVAL; + /* Only support dependency between two objects of the same type */ + if (obj_dependent->type != obj_depended->type) + return -EINVAL; + + refcount_inc(&obj_depended->users); + return 0; +} +EXPORT_SYMBOL_NS_GPL(iommufd_object_depend, "IOMMUFD"); + +/* A per-structure helper is available in include/linux/iommufd.h */ +void iommufd_object_undepend(struct iommufd_object *obj_dependent, + struct iommufd_object *obj_depended) +{ + if (WARN_ON_ONCE(obj_dependent == obj_depended || + obj_dependent->type != obj_depended->type)) + return; + + refcount_dec(&obj_depended->users); +} +EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
On 4/26/25 13:58, Nicolin Chen wrote:
NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU features to user space VMs. Its hardware has a strict rule when mapping and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring mappings in ascending order and unmappings in descending order.
The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler, however it can't do the same for an unmapping since the destroy op returns void.
The key point is that unmapping happens during object destroy. These depend/undepend helpers ensure a vCMDQ is not destroyed (and therefore unmapped) before any vCMDQs that depend on it. Do I get it right?
Add iommufd_vcmdq_depend/undepend() for-driver helpers, allowing LVCMDQ allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer to refcount_dec().
This is a bit of compromise, because a driver might end up with abusing the API that deadlocks the objects. So restrict the API to a dependency between two driver-allocated objects of the same type, as iommufd would unlikely build any core-level dependency in this case.
Signed-off-by: Nicolin Chennicolinc@nvidia.com
... if that's right,
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Thanks, baolu
On Mon, Apr 28, 2025 at 10:22:09AM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU features to user space VMs. Its hardware has a strict rule when mapping and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring mappings in ascending order and unmappings in descending order.
The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler, however it can't do the same for an unmapping since the destroy op returns void.
The key point is that unmapping happens during object destroy. These depend/undepend helpers ensure a vCMDQ is not destroyed (and therefore unmapped) before any vCMDQs that depend on it. Do I get it right?
Yea, I should add some additional words: " The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler. However, it can't do the same for an unmapping since user space could start random destroy calls breaking the rule, while the destroy op in the driver level can't reject a destroy call as it returns void.
Add iommufd_vcmdq_depend/undepend() for-driver helpers, allowing LVCMDQ allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer to refcount_dec(), so that iommufd core will help block a random destroy call that breaks the rule. "
Thanks Nicolin
On Fri, Apr 25, 2025 at 10:58:06PM -0700, Nicolin Chen wrote:
NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU features to user space VMs. Its hardware has a strict rule when mapping and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring mappings in ascending order and unmappings in descending order.
The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ allocation handler, however it can't do the same for an unmapping since the destroy op returns void.
Add iommufd_vcmdq_depend/undepend() for-driver helpers, allowing LVCMDQ allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer to refcount_dec().
This is a bit of compromise, because a driver might end up with abusing the API that deadlocks the objects. So restrict the API to a dependency between two driver-allocated objects of the same type, as iommufd would unlikely build any core-level dependency in this case.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommufd.h | 47 ++++++++++++++++++++++++++++++++++ drivers/iommu/iommufd/driver.c | 28 ++++++++++++++++++++ 2 files changed, 75 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index e91381aaec5a..5dff154e8ce1 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -232,6 +232,10 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, enum iommufd_object_type type); void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); +int iommufd_object_depend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended);
+void iommufd_object_undepend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended);
struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -252,6 +256,17 @@ static inline void iommufd_object_abort(struct iommufd_ctx *ictx, { } +static inline int iommufd_object_depend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended)
+{
- return -EOPNOTSUPP;
+}
+static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended)
+{ +}
static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { @@ -329,4 +344,36 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu, static_assert(offsetof(typeof(*drv_struct), member.obj) == 0); \ iommufd_object_abort(ictx, &drv_struct->member.obj); \ })
+/*
- Helpers for IOMMU driver to build/destroy a dependency between two sibling
- structures created by one of the allocators above
- */
+#define iommufd_vcmdq_depend(vcmdq_dependent, vcmdq_depended, member) \
- ({ \
static_assert(__same_type(struct iommufd_object, \
vcmdq_dependent->member.obj)); \
static_assert(offsetof(typeof(*vcmdq_dependent), \
member.obj) == 0); \
static_assert(__same_type(struct iommufd_object, \
vcmdq_depended->member.obj)); \
static_assert(offsetof(typeof(*vcmdq_depended), \
member.obj) == 0); \
iommufd_object_depend(&vcmdq_dependent->member.obj, \
&vcmdq_depended->member.obj); \
- })
+#define iommufd_vcmdq_undepend(vcmdq_dependent, vcmdq_depended, member) \
- ({ \
static_assert(__same_type(struct iommufd_object, \
vcmdq_dependent->member.obj)); \
static_assert(offsetof(typeof(*vcmdq_dependent), \
member.obj) == 0); \
static_assert(__same_type(struct iommufd_object, \
vcmdq_depended->member.obj)); \
static_assert(offsetof(typeof(*vcmdq_depended), \
member.obj) == 0); \
iommufd_object_undepend(&vcmdq_dependent->member.obj, \
&vcmdq_depended->member.obj); \
- })
#endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 7980a09761c2..fb7f8fe40f95 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -50,6 +50,34 @@ void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj) } EXPORT_SYMBOL_NS_GPL(iommufd_object_abort, "IOMMUFD"); +/* A per-structure helper is available in include/linux/iommufd.h */ +int iommufd_object_depend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended)
+{
- /* Reject self dependency that dead locks */
- if (obj_dependent == obj_depended)
return -EINVAL;
- /* Only support dependency between two objects of the same type */
- if (obj_dependent->type != obj_depended->type)
return -EINVAL;
- refcount_inc(&obj_depended->users);
- return 0;
+} +EXPORT_SYMBOL_NS_GPL(iommufd_object_depend, "IOMMUFD");
+/* A per-structure helper is available in include/linux/iommufd.h */ +void iommufd_object_undepend(struct iommufd_object *obj_dependent,
struct iommufd_object *obj_depended)
+{
- if (WARN_ON_ONCE(obj_dependent == obj_depended ||
obj_dependent->type != obj_depended->type))
return;
- refcount_dec(&obj_depended->users);
+} +EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD");
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
If I'm getting this right, I think we are setting up dependencies like: vcmdq[2] -> vcmdq[1] -> vcmdq[0] based on refcounts of each object, which ensures that the unmaps happen in descending order..
If that's right, Is it fair to have iommufd_vcmdq_depend/undepend in the core code itself? Since it's a driver-level limitation, I think we should just have iommufd_object_depend/undepend in the core code and the iommufd_vcmdq_depend/undepend can move into the CMDQV driver?
-- 2.43.0
Thanks, Praan
On Tue, Apr 29, 2025 at 12:40:07PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:06PM -0700, Nicolin Chen wrote:
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
If I'm getting this right, I think we are setting up dependencies like: vcmdq[2] -> vcmdq[1] -> vcmdq[0] based on refcounts of each object, which ensures that the unmaps happen in descending order..
Yes.
If that's right, Is it fair to have iommufd_vcmdq_depend/undepend in the core code itself? Since it's a driver-level limitation, I think we should just have iommufd_object_depend/undepend in the core code and the iommufd_vcmdq_depend/undepend can move into the CMDQV driver?
The moment we added iommufd_object_depend/undepend, we already had a blur boundary here since we had no choice to handle in the driver but to ask core for help.
The iommufd_vcmdq_depend/undepend is just a pair of macros to help validating the structure inputs that are core defined. It is quite fair to put next to the raw functions. I also had the notes on top of the raw functions suggesting callers to use the macros instead.
Thanks Nicolin
On Tue, Apr 29, 2025 at 10:10:28AM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 12:40:07PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:06PM -0700, Nicolin Chen wrote:
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
If I'm getting this right, I think we are setting up dependencies like: vcmdq[2] -> vcmdq[1] -> vcmdq[0] based on refcounts of each object, which ensures that the unmaps happen in descending order..
Yes.
If that's right, Is it fair to have iommufd_vcmdq_depend/undepend in the core code itself? Since it's a driver-level limitation, I think we should just have iommufd_object_depend/undepend in the core code and the iommufd_vcmdq_depend/undepend can move into the CMDQV driver?
The moment we added iommufd_object_depend/undepend, we already had a blur boundary here since we had no choice to handle in the driver but to ask core for help.
The iommufd_vcmdq_depend/undepend is just a pair of macros to help validating the structure inputs that are core defined. It is quite fair to put next to the raw functions. I also had the notes on top of the raw functions suggesting callers to use the macros instead.
Well, yes.. in that case let's call the macros something else? The current names suggest that the macros only setup dependencies for vcmdq and not any "two sibling structures created by one of the allocators above" as mentioned by the note. Maybe we could rename the macro to something like: `iommufd_container_obj_depend`?
With this nit, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks Nicolin
Thanks, Praan
On Tue, Apr 29, 2025 at 05:59:32PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 10:10:28AM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 12:40:07PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:06PM -0700, Nicolin Chen wrote:
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
If I'm getting this right, I think we are setting up dependencies like: vcmdq[2] -> vcmdq[1] -> vcmdq[0] based on refcounts of each object, which ensures that the unmaps happen in descending order..
Yes.
If that's right, Is it fair to have iommufd_vcmdq_depend/undepend in the core code itself? Since it's a driver-level limitation, I think we should just have iommufd_object_depend/undepend in the core code and the iommufd_vcmdq_depend/undepend can move into the CMDQV driver?
The moment we added iommufd_object_depend/undepend, we already had a blur boundary here since we had no choice to handle in the driver but to ask core for help.
The iommufd_vcmdq_depend/undepend is just a pair of macros to help validating the structure inputs that are core defined. It is quite fair to put next to the raw functions. I also had the notes on top of the raw functions suggesting callers to use the macros instead.
Well, yes.. in that case let's call the macros something else? The current names suggest that the macros only setup dependencies for vcmdq and not any "two sibling structures created by one of the allocators above" as mentioned by the note. Maybe we could rename the macro to something like: `iommufd_container_obj_depend`?
That's the intention of the macros: to validate vCMDQ structure and help covert a driver-defined vcmdq structure to the required core field, as we only have vCMDQ objects using them.
If we have use case for other objects in the future, we should add another iommufd_vxxxx_depend/undepend macros.
Nicolin
On Tue, Apr 29, 2025 at 11:07:42AM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 05:59:32PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 10:10:28AM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 12:40:07PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:06PM -0700, Nicolin Chen wrote:
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
If I'm getting this right, I think we are setting up dependencies like: vcmdq[2] -> vcmdq[1] -> vcmdq[0] based on refcounts of each object, which ensures that the unmaps happen in descending order..
Yes.
If that's right, Is it fair to have iommufd_vcmdq_depend/undepend in the core code itself? Since it's a driver-level limitation, I think we should just have iommufd_object_depend/undepend in the core code and the iommufd_vcmdq_depend/undepend can move into the CMDQV driver?
The moment we added iommufd_object_depend/undepend, we already had a blur boundary here since we had no choice to handle in the driver but to ask core for help.
The iommufd_vcmdq_depend/undepend is just a pair of macros to help validating the structure inputs that are core defined. It is quite fair to put next to the raw functions. I also had the notes on top of the raw functions suggesting callers to use the macros instead.
Well, yes.. in that case let's call the macros something else? The current names suggest that the macros only setup dependencies for vcmdq and not any "two sibling structures created by one of the allocators above" as mentioned by the note. Maybe we could rename the macro to something like: `iommufd_container_obj_depend`?
That's the intention of the macros: to validate vCMDQ structure and help covert a driver-defined vcmdq structure to the required core field, as we only have vCMDQ objects using them.
If we have use case for other objects in the future, we should add another iommufd_vxxxx_depend/undepend macros.
Thanks for clarifying the rationale behind the VCMDQ-specific naming.
On the point of needing new iommufd_vxxxx_depend macros for future object types, I don't think that would be required because the current static_asserts within these macros validate the container->member.obj embedding pattern, not the struct type of the container itself which makes the macro logic inherently reusable for any other object type that adopts the same embedding.
However, if there's a strong preference against making it generic, I don't have any issues since we only use it for vCMDQs right now.
My main point was to keep the core code seem generic to aid other implementations in the future... today NVIDIA has CMDQV, tomorrow maybe someone else would have something for vdevice or something. Anyway, I don't feel strongly about this. Just trying to help :)
Nicolin
Thanks, Praan
Some simple tests for IOMMUFD_CMD_VCMDQ_ALLOC infrastructure covering the new iommufd_vcmdq_depend/undepend() helpers.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 3 + tools/testing/selftests/iommu/iommufd_utils.h | 30 +++++++++ drivers/iommu/iommufd/selftest.c | 67 +++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 59 ++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 6 ++ 5 files changed, 165 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index fbf9ecb35a13..a0831d78fef1 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -265,4 +265,7 @@ struct iommu_viommu_event_selftest { __u32 virt_id; };
+#define IOMMU_VCMDQ_TYPE_SELFTEST 0xdeadbeef +#define IOMMU_TEST_VCMDQ_MAX 2 + #endif diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index a5d4cbd089ba..d6d8fedf2226 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -956,6 +956,36 @@ static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id, _test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, \ virt_id, vdev_id))
+static int _test_cmd_vcmdq_alloc(int fd, __u32 viommu_id, __u32 type, __u32 idx, + __u64 addr, __u64 length, __u32 *vcmdq_id) +{ + struct iommu_vcmdq_alloc cmd = { + .size = sizeof(cmd), + .viommu_id = viommu_id, + .type = type, + .index = idx, + .addr = addr, + .length = length, + }; + int ret; + + ret = ioctl(fd, IOMMU_VCMDQ_ALLOC, &cmd); + if (ret) + return ret; + if (vcmdq_id) + *vcmdq_id = cmd.out_vcmdq_id; + return 0; +} + +#define test_cmd_vcmdq_alloc(viommu_id, type, idx, addr, len, vcmdq_id) \ + ASSERT_EQ(0, _test_cmd_vcmdq_alloc(self->fd, viommu_id, type, idx, \ + addr, len, vcmdq_id)) +#define test_err_vcmdq_alloc(_errno, viommu_id, type, idx, addr, len, \ + vcmdq_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_vcmdq_alloc(self->fd, viommu_id, type, idx, \ + addr, len, vcmdq_id)) + static int _test_cmd_veventq_alloc(int fd, __u32 viommu_id, __u32 type, __u32 *veventq_id, __u32 *veventq_fd) { diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index b04bd2fbc53d..d6cc5b78821b 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -148,6 +148,7 @@ to_mock_nested(struct iommu_domain *domain) struct mock_viommu { struct iommufd_viommu core; struct mock_iommu_domain *s2_parent; + struct mock_vcmdq *mock_vcmdq[IOMMU_TEST_VCMDQ_MAX]; };
static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) @@ -155,6 +156,18 @@ static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) return container_of(viommu, struct mock_viommu, core); }
+struct mock_vcmdq { + struct iommufd_vcmdq core; + struct mock_viommu *mock_viommu; + struct mock_vcmdq *prev; + u16 index; +}; + +static inline struct mock_vcmdq *to_mock_vcmdq(struct iommufd_vcmdq *vcmdq) +{ + return container_of(vcmdq, struct mock_vcmdq, core); +} + enum selftest_obj_type { TYPE_IDEV, }; @@ -727,10 +740,64 @@ static int mock_viommu_cache_invalidate(struct iommufd_viommu *viommu, return rc; }
+/* Test iommufd_vcmdq_depend/_undepend() */ +static struct iommufd_vcmdq * +mock_vcmdq_alloc(struct iommufd_viommu *viommu, unsigned int type, u32 index, + dma_addr_t addr, size_t length) +{ + struct mock_viommu *mock_viommu = to_mock_viommu(viommu); + struct mock_vcmdq *mock_vcmdq, *prev = 0; + int rc; + + if (type != IOMMU_VCMDQ_TYPE_SELFTEST) + return ERR_PTR(-EOPNOTSUPP); + if (index >= IOMMU_TEST_VCMDQ_MAX) + return ERR_PTR(-EINVAL); + if (mock_viommu->mock_vcmdq[index]) + return ERR_PTR(-EEXIST); + if (index) { + prev = mock_viommu->mock_vcmdq[index - 1]; + if (!prev) + return ERR_PTR(-EIO); + } + + mock_vcmdq = iommufd_vcmdq_alloc(viommu, struct mock_vcmdq, core); + if (IS_ERR(mock_vcmdq)) + return ERR_CAST(mock_vcmdq); + + if (prev) { + rc = iommufd_vcmdq_depend(mock_vcmdq, prev, core); + if (rc) + goto free_vcmdq; + } + mock_vcmdq->prev = prev; + mock_vcmdq->mock_viommu = mock_viommu; + mock_viommu->mock_vcmdq[index] = mock_vcmdq; + + return &mock_vcmdq->core; +free_vcmdq: + iommufd_struct_destroy(viommu->ictx, mock_vcmdq, core); + return ERR_PTR(rc); +} + +static void mock_vcmdq_destroy(struct iommufd_vcmdq *vcmdq) +{ + struct mock_vcmdq *mock_vcmdq = to_mock_vcmdq(vcmdq); + struct mock_viommu *mock_viommu = mock_vcmdq->mock_viommu; + + mock_viommu->mock_vcmdq[mock_vcmdq->index] = NULL; + if (mock_vcmdq->prev) + iommufd_vcmdq_undepend(mock_vcmdq, mock_vcmdq->prev, core); + + /* iommufd core frees mock_vcmdq and vcmdq */ +} + static struct iommufd_viommu_ops mock_viommu_ops = { .destroy = mock_viommu_destroy, .alloc_domain_nested = mock_viommu_alloc_domain_nested, .cache_invalidate = mock_viommu_cache_invalidate, + .vcmdq_alloc = mock_vcmdq_alloc, + .vcmdq_destroy = mock_vcmdq_destroy, };
static struct iommufd_viommu * diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 8ebbb7fda02d..7c464f6eb37b 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -3031,6 +3031,65 @@ TEST_F(iommufd_viommu, vdevice_cache) } }
+TEST_F(iommufd_viommu, vcmdq) +{ + uint32_t viommu_id = self->viommu_id; + __u64 iova = MOCK_APERTURE_START; + uint32_t vcmdq_id[2]; + + if (viommu_id) { + /* Fail IOMMU_VCMDQ_TYPE_DEFAULT */ + test_err_vcmdq_alloc(EOPNOTSUPP, viommu_id, + IOMMU_VCMDQ_TYPE_DEFAULT, 0, iova, + PAGE_SIZE, &vcmdq_id[0]); + /* Fail queue addr and length */ + test_err_vcmdq_alloc(EINVAL, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, 0, PAGE_SIZE, + &vcmdq_id[0]); + test_err_vcmdq_alloc(EINVAL, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, iova, 0, + &vcmdq_id[0]); + test_err_vcmdq_alloc(EOVERFLOW, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, ~(uint64_t)0, + PAGE_SIZE, &vcmdq_id[0]); + /* Fail missing iova */ + test_err_vcmdq_alloc(ENXIO, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, iova, + PAGE_SIZE, &vcmdq_id[0]); + + /* Map iova */ + test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova); + + /* Fail index=1 and =MAX; must start from index=0 */ + test_err_vcmdq_alloc(EIO, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 1, iova, + PAGE_SIZE, &vcmdq_id[0]); + test_err_vcmdq_alloc(EINVAL, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, + IOMMU_TEST_VCMDQ_MAX, iova, PAGE_SIZE, + &vcmdq_id[0]); + + /* Allocate index=0 */ + test_cmd_vcmdq_alloc(viommu_id, IOMMU_VCMDQ_TYPE_SELFTEST, 0, + iova, PAGE_SIZE, &vcmdq_id[0]); + /* Fail duplicate */ + test_err_vcmdq_alloc(EEXIST, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, + iova, PAGE_SIZE, &vcmdq_id[0]); + + /* Allocate index=1 */ + test_cmd_vcmdq_alloc(viommu_id, IOMMU_VCMDQ_TYPE_SELFTEST, 1, + iova, PAGE_SIZE, &vcmdq_id[1]); + /* Fail to destroy, due to dependency */ + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, vcmdq_id[0])); + + /* Destroy in descending order */ + test_ioctl_destroy(vcmdq_id[1]); + test_ioctl_destroy(vcmdq_id[0]); + } +} + FIXTURE(iommufd_device_pasid) { int fd; diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index f7ccf1822108..ffad3f2875bd 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -634,6 +634,7 @@ TEST_FAIL_NTH(basic_fail_nth, device) uint32_t idev_id; uint32_t hwpt_id; uint32_t viommu_id; + uint32_t vcmdq_id; uint32_t vdev_id; __u64 iova;
@@ -696,6 +697,11 @@ TEST_FAIL_NTH(basic_fail_nth, device) if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id)) return -1;
+ if (_test_cmd_vcmdq_alloc(self->fd, viommu_id, + IOMMU_VCMDQ_TYPE_SELFTEST, 0, iova, PAGE_SIZE, + &vcmdq_id)) + return -1; + if (_test_ioctl_fault_alloc(self->fd, &fault_id, &fault_fd)) return -1; close(fault_fd);
For vIOMMU passing through HW resources to user space (VMs), add an mmap infrastructure to map a region of hardware MMIO pages.
Maintain an mt_mmap per ictx for validations. To allow IOMMU drivers to add and delete mmappable regions to/from the mt_mmap, add a pair of new helpers: iommufd_ctx_alloc_mmap() and iommufd_ctx_free_mmap().
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 8 +++++ include/linux/iommufd.h | 15 ++++++++++ drivers/iommu/iommufd/driver.c | 39 +++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 39 +++++++++++++++++++++++++ 4 files changed, 101 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index b974c207ae8a..db5b62ec4abb 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/iommufd.h> #include <linux/iova_bitmap.h> +#include <linux/maple_tree.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -44,6 +45,7 @@ struct iommufd_ctx { struct xarray groups; wait_queue_head_t destroy_wait; struct rw_semaphore ioas_creation_lock; + struct maple_tree mt_mmap;
struct mutex sw_msi_lock; struct list_head sw_msi_list; @@ -55,6 +57,12 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; };
+/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap { + unsigned long pfn_start; + unsigned long pfn_end; +}; + /* * The IOVA to PFN map. The map automatically copies the PFNs into multiple * domains and permits sharing of PFNs between io_pagetable instances. This diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 5dff154e8ce1..d63e2d91be0d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -236,6 +236,9 @@ int iommufd_object_depend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base, + size_t size, unsigned long *immap_id); +void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -262,11 +265,23 @@ static inline int iommufd_object_depend(struct iommufd_object *obj_dependent, return -EOPNOTSUPP; }
+static inline int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, + phys_addr_t base, size_t size, + unsigned long *immap_id) +{ + return -EOPNOTSUPP; +} + static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended) { }
+static inline void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, + unsigned long immap_id) +{ +} + static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index fb7f8fe40f95..c55336c580dc 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,45 @@ void iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD");
+/* Driver should report the output @immap_id to user space for mmap() syscall */ +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base, + size_t size, unsigned long *immap_id) +{ + struct iommufd_mmap *immap; + int rc; + + if (WARN_ON_ONCE(!immap_id)) + return -EINVAL; + if (base & ~PAGE_MASK) + return -EINVAL; + if (!size || size & ~PAGE_MASK) + return -EINVAL; + + immap = kzalloc(sizeof(*immap), GFP_KERNEL); + if (!immap) + return -ENOMEM; + immap->pfn_start = base >> PAGE_SHIFT; + immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1; + + rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap), + 0, LONG_MAX >> PAGE_SHIFT, GFP_KERNEL); + if (rc < 0) { + kfree(immap); + return rc; + } + + /* mmap() syscall will right-shift the immap_id to vma->vm_pgoff */ + *immap_id <<= PAGE_SHIFT; + return 0; +} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_alloc_mmap, "IOMMUFD"); + +void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id) +{ + kfree(mtree_erase(&ictx->mt_mmap, immap_id >> PAGE_SHIFT)); +} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_free_mmap, "IOMMUFD"); + /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index ac51d5cfaa61..4b46ea47164d 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -213,6 +213,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp) xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT); xa_init(&ictx->groups); ictx->file = filp; + mt_init_flags(&ictx->mt_mmap, MT_FLAGS_ALLOC_RANGE); init_waitqueue_head(&ictx->destroy_wait); mutex_init(&ictx->sw_msi_lock); INIT_LIST_HEAD(&ictx->sw_msi_list); @@ -410,11 +411,49 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, return ret; }
+/* + * Kernel driver must first do iommufd_ctx_alloc_mmap() to register an mmappable + * MMIO region to the iommufd core to receive an "immap_id". Then, driver should + * report to user space this immap_id and the size of the registered MMIO region + * for @vm_pgoff and @size of an mmap() call, via an IOMMU_VIOMMU_ALLOC ioctl in + * the output fields of its driver-type data structure. + * + * Note the @size is allowed to be smaller than the registered size as a partial + * mmap starting from the registered base address. + */ +static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct iommufd_ctx *ictx = filp->private_data; + size_t size = vma->vm_end - vma->vm_start; + struct iommufd_mmap *immap; + + if (size & ~PAGE_MASK) + return -EINVAL; + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + if (vma->vm_flags & VM_EXEC) + return -EPERM; + + /* vm_pgoff carries an index (immap_id) to an mtree entry (immap) */ + immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff); + if (!immap) + return -ENXIO; + if (size >> PAGE_SHIFT > immap->pfn_end - immap->pfn_start + 1) + return -ENXIO; + + vma->vm_pgoff = 0; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO); + return remap_pfn_range(vma, vma->vm_start, immap->pfn_start, size, + vma->vm_page_prot); +} + static const struct file_operations iommufd_fops = { .owner = THIS_MODULE, .open = iommufd_fops_open, .release = iommufd_fops_release, .unlocked_ioctl = iommufd_fops_ioctl, + .mmap = iommufd_fops_mmap, };
/**
On 4/26/25 13:58, Nicolin Chen wrote:
For vIOMMU passing through HW resources to user space (VMs), add an mmap infrastructure to map a region of hardware MMIO pages.
Maintain an mt_mmap per ictx for validations. To allow IOMMU drivers to add and delete mmappable regions to/from the mt_mmap, add a pair of new helpers: iommufd_ctx_alloc_mmap() and iommufd_ctx_free_mmap().
I am wondering why the dma_buf mechanism isn't used here, considering that this also involves an export and import pattern.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 8 +++++ include/linux/iommufd.h | 15 ++++++++++ drivers/iommu/iommufd/driver.c | 39 +++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 39 +++++++++++++++++++++++++ 4 files changed, 101 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index b974c207ae8a..db5b62ec4abb 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/iommufd.h> #include <linux/iova_bitmap.h> +#include <linux/maple_tree.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -44,6 +45,7 @@ struct iommufd_ctx { struct xarray groups; wait_queue_head_t destroy_wait; struct rw_semaphore ioas_creation_lock;
- struct maple_tree mt_mmap;
struct mutex sw_msi_lock; struct list_head sw_msi_list; @@ -55,6 +57,12 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; }; +/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap {
- unsigned long pfn_start;
- unsigned long pfn_end;
+};
This structure is introduced to represent a mappable/mapped region, right? It would be better to add comments specifying whether the start and end are inclusive or exclusive.
- /*
- The IOVA to PFN map. The map automatically copies the PFNs into multiple
- domains and permits sharing of PFNs between io_pagetable instances. This
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 5dff154e8ce1..d63e2d91be0d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -236,6 +236,9 @@ int iommufd_object_depend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id);
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -262,11 +265,23 @@ static inline int iommufd_object_depend(struct iommufd_object *obj_dependent, return -EOPNOTSUPP; } +static inline int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx,
phys_addr_t base, size_t size,
unsigned long *immap_id)
+{
- return -EOPNOTSUPP;
+}
- static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended) { }
+static inline void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx,
unsigned long immap_id)
+{ +}
- static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) {
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index fb7f8fe40f95..c55336c580dc 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,45 @@ void iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD"); +/* Driver should report the output @immap_id to user space for mmap() syscall */ +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id)
+{
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
Is it equal to PAGE_ALIGNED()?
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
0, LONG_MAX >> PAGE_SHIFT, GFP_KERNEL);
- if (rc < 0) {
kfree(immap);
return rc;
- }
- /* mmap() syscall will right-shift the immap_id to vma->vm_pgoff */
- *immap_id <<= PAGE_SHIFT;
- return 0;
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_alloc_mmap, "IOMMUFD");
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id) +{
- kfree(mtree_erase(&ictx->mt_mmap, immap_id >> PAGE_SHIFT));
MMIO lifecycle question: what happens if a region is removed from the maple tree (and is therefore no longer mappable), but is still mapped and in use by userspace?
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_free_mmap, "IOMMUFD");
- /* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
Thanks, baolu
On Mon, Apr 28, 2025 at 10:50:32AM +0800, Baolu Lu wrote:
On 4/26/25 13:58, Nicolin Chen wrote:
For vIOMMU passing through HW resources to user space (VMs), add an mmap infrastructure to map a region of hardware MMIO pages.
Maintain an mt_mmap per ictx for validations. To allow IOMMU drivers to add and delete mmappable regions to/from the mt_mmap, add a pair of new helpers: iommufd_ctx_alloc_mmap() and iommufd_ctx_free_mmap().
I am wondering why the dma_buf mechanism isn't used here, considering that this also involves an export and import pattern.
The use case here is to expose one small MMIO page for user space to directly HW control, so mmap seems to be a good fit. What would it benefit from using dma_buf here?
@@ -55,6 +57,12 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; }; +/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap {
- unsigned long pfn_start;
- unsigned long pfn_end;
+};
This structure is introduced to represent a mappable/mapped region, right? It would be better to add comments specifying whether the start and end are inclusive or exclusive.
Yes. Sure I can add that pfn_start/pfn_end are inclusive.
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index fb7f8fe40f95..c55336c580dc 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,45 @@ void iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD"); +/* Driver should report the output @immap_id to user space for mmap() syscall */ +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id)
+{
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
Is it equal to PAGE_ALIGNED()?
Yes. Will change.
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id) +{
- kfree(mtree_erase(&ictx->mt_mmap, immap_id >> PAGE_SHIFT));
MMIO lifecycle question: what happens if a region is removed from the maple tree (and is therefore no longer mappable), but is still mapped and in use by userspace?
That's a good point!
Yea, mmap() should refcount an object to prevent its destroy call where this iommufd_ctx_free_mmap gets called. So, these two could probably be iommufd_object_alloc_mmap/unmmap().
And I need to find some callback in the munmap path to release the reference..
Thanks Nicolin
On Fri, Apr 25, 2025 at 10:58:08PM -0700, Nicolin Chen wrote:
For vIOMMU passing through HW resources to user space (VMs), add an mmap infrastructure to map a region of hardware MMIO pages.
Maintain an mt_mmap per ictx for validations. To allow IOMMU drivers to add and delete mmappable regions to/from the mt_mmap, add a pair of new helpers: iommufd_ctx_alloc_mmap() and iommufd_ctx_free_mmap().
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 8 +++++ include/linux/iommufd.h | 15 ++++++++++ drivers/iommu/iommufd/driver.c | 39 +++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 39 +++++++++++++++++++++++++ 4 files changed, 101 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index b974c207ae8a..db5b62ec4abb 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/iommufd.h> #include <linux/iova_bitmap.h> +#include <linux/maple_tree.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -44,6 +45,7 @@ struct iommufd_ctx { struct xarray groups; wait_queue_head_t destroy_wait; struct rw_semaphore ioas_creation_lock;
- struct maple_tree mt_mmap;
struct mutex sw_msi_lock; struct list_head sw_msi_list; @@ -55,6 +57,12 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; }; +/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap {
- unsigned long pfn_start;
- unsigned long pfn_end;
+};
/*
- The IOVA to PFN map. The map automatically copies the PFNs into multiple
- domains and permits sharing of PFNs between io_pagetable instances. This
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 5dff154e8ce1..d63e2d91be0d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -236,6 +236,9 @@ int iommufd_object_depend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id);
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -262,11 +265,23 @@ static inline int iommufd_object_depend(struct iommufd_object *obj_dependent, return -EOPNOTSUPP; } +static inline int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx,
phys_addr_t base, size_t size,
unsigned long *immap_id)
+{
- return -EOPNOTSUPP;
+}
static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended) { } +static inline void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx,
unsigned long immap_id)
+{ +}
static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index fb7f8fe40f95..c55336c580dc 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,45 @@ void iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD"); +/* Driver should report the output @immap_id to user space for mmap() syscall */ +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id)
+{
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
I believe this should be sizeof(*immap) ?
0, LONG_MAX >> PAGE_SHIFT, GFP_KERNEL);
- if (rc < 0) {
kfree(immap);
return rc;
- }
- /* mmap() syscall will right-shift the immap_id to vma->vm_pgoff */
- *immap_id <<= PAGE_SHIFT;
- return 0;
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_alloc_mmap, "IOMMUFD");
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id) +{
- kfree(mtree_erase(&ictx->mt_mmap, immap_id >> PAGE_SHIFT));
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_free_mmap, "IOMMUFD");
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index ac51d5cfaa61..4b46ea47164d 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -213,6 +213,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp) xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT); xa_init(&ictx->groups); ictx->file = filp;
- mt_init_flags(&ictx->mt_mmap, MT_FLAGS_ALLOC_RANGE); init_waitqueue_head(&ictx->destroy_wait); mutex_init(&ictx->sw_msi_lock); INIT_LIST_HEAD(&ictx->sw_msi_list);
@@ -410,11 +411,49 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, return ret; } +/*
- Kernel driver must first do iommufd_ctx_alloc_mmap() to register an mmappable
- MMIO region to the iommufd core to receive an "immap_id". Then, driver should
- report to user space this immap_id and the size of the registered MMIO region
- for @vm_pgoff and @size of an mmap() call, via an IOMMU_VIOMMU_ALLOC ioctl in
- the output fields of its driver-type data structure.
- Note the @size is allowed to be smaller than the registered size as a partial
- mmap starting from the registered base address.
- */
+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{
- struct iommufd_ctx *ictx = filp->private_data;
- size_t size = vma->vm_end - vma->vm_start;
- struct iommufd_mmap *immap;
- if (size & ~PAGE_MASK)
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;
- if (vma->vm_flags & VM_EXEC)
return -EPERM;
- /* vm_pgoff carries an index (immap_id) to an mtree entry (immap) */
- immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff);
- if (!immap)
return -ENXIO;
- if (size >> PAGE_SHIFT > immap->pfn_end - immap->pfn_start + 1)
return -ENXIO;
- vma->vm_pgoff = 0;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO);
- return remap_pfn_range(vma, vma->vm_start, immap->pfn_start, size,
vma->vm_page_prot);
+}
static const struct file_operations iommufd_fops = { .owner = THIS_MODULE, .open = iommufd_fops_open, .release = iommufd_fops_release, .unlocked_ioctl = iommufd_fops_ioctl,
- .mmap = iommufd_fops_mmap,
}; /**
Thanks, Praan
-- 2.43.0
On Tue, Apr 29, 2025 at 08:24:33PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:08PM -0700, Nicolin Chen wrote:
For vIOMMU passing through HW resources to user space (VMs), add an mmap infrastructure to map a region of hardware MMIO pages.
Maintain an mt_mmap per ictx for validations. To allow IOMMU drivers to add and delete mmappable regions to/from the mt_mmap, add a pair of new helpers: iommufd_ctx_alloc_mmap() and iommufd_ctx_free_mmap().
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 8 +++++ include/linux/iommufd.h | 15 ++++++++++ drivers/iommu/iommufd/driver.c | 39 +++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 39 +++++++++++++++++++++++++ 4 files changed, 101 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index b974c207ae8a..db5b62ec4abb 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/iommufd.h> #include <linux/iova_bitmap.h> +#include <linux/maple_tree.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -44,6 +45,7 @@ struct iommufd_ctx { struct xarray groups; wait_queue_head_t destroy_wait; struct rw_semaphore ioas_creation_lock;
- struct maple_tree mt_mmap;
struct mutex sw_msi_lock; struct list_head sw_msi_list; @@ -55,6 +57,12 @@ struct iommufd_ctx { struct iommufd_ioas *vfio_ioas; }; +/* Entry for iommufd_ctx::mt_mmap */ +struct iommufd_mmap {
- unsigned long pfn_start;
- unsigned long pfn_end;
+};
/*
- The IOVA to PFN map. The map automatically copies the PFNs into multiple
- domains and permits sharing of PFNs between io_pagetable instances. This
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 5dff154e8ce1..d63e2d91be0d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -236,6 +236,9 @@ int iommufd_object_depend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended); +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id);
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id); struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id); int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, @@ -262,11 +265,23 @@ static inline int iommufd_object_depend(struct iommufd_object *obj_dependent, return -EOPNOTSUPP; } +static inline int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx,
phys_addr_t base, size_t size,
unsigned long *immap_id)
+{
- return -EOPNOTSUPP;
+}
static inline void iommufd_object_undepend(struct iommufd_object *obj_dependent, struct iommufd_object *obj_depended) { } +static inline void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx,
unsigned long immap_id)
+{ +}
static inline struct device * iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) { diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index fb7f8fe40f95..c55336c580dc 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -78,6 +78,45 @@ void iommufd_object_undepend(struct iommufd_object *obj_dependent, } EXPORT_SYMBOL_NS_GPL(iommufd_object_undepend, "IOMMUFD"); +/* Driver should report the output @immap_id to user space for mmap() syscall */ +int iommufd_ctx_alloc_mmap(struct iommufd_ctx *ictx, phys_addr_t base,
size_t size, unsigned long *immap_id)
+{
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
I believe this should be sizeof(*immap) ?
Ugh, Sorry, shouldn't this be size >> PAGE_SHIFT (num_indices to alloc) ?
0, LONG_MAX >> PAGE_SHIFT, GFP_KERNEL);
- if (rc < 0) {
kfree(immap);
return rc;
- }
- /* mmap() syscall will right-shift the immap_id to vma->vm_pgoff */
- *immap_id <<= PAGE_SHIFT;
- return 0;
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_alloc_mmap, "IOMMUFD");
+void iommufd_ctx_free_mmap(struct iommufd_ctx *ictx, unsigned long immap_id) +{
- kfree(mtree_erase(&ictx->mt_mmap, immap_id >> PAGE_SHIFT));
+} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_free_mmap, "IOMMUFD");
/* Caller should xa_lock(&viommu->vdevs) to protect the return value */ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index ac51d5cfaa61..4b46ea47164d 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -213,6 +213,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp) xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT); xa_init(&ictx->groups); ictx->file = filp;
- mt_init_flags(&ictx->mt_mmap, MT_FLAGS_ALLOC_RANGE); init_waitqueue_head(&ictx->destroy_wait); mutex_init(&ictx->sw_msi_lock); INIT_LIST_HEAD(&ictx->sw_msi_list);
@@ -410,11 +411,49 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, return ret; } +/*
- Kernel driver must first do iommufd_ctx_alloc_mmap() to register an mmappable
- MMIO region to the iommufd core to receive an "immap_id". Then, driver should
- report to user space this immap_id and the size of the registered MMIO region
- for @vm_pgoff and @size of an mmap() call, via an IOMMU_VIOMMU_ALLOC ioctl in
- the output fields of its driver-type data structure.
- Note the @size is allowed to be smaller than the registered size as a partial
- mmap starting from the registered base address.
- */
+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma) +{
- struct iommufd_ctx *ictx = filp->private_data;
- size_t size = vma->vm_end - vma->vm_start;
- struct iommufd_mmap *immap;
- if (size & ~PAGE_MASK)
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;
- if (vma->vm_flags & VM_EXEC)
return -EPERM;
- /* vm_pgoff carries an index (immap_id) to an mtree entry (immap) */
- immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff);
- if (!immap)
return -ENXIO;
- if (size >> PAGE_SHIFT > immap->pfn_end - immap->pfn_start + 1)
return -ENXIO;
- vma->vm_pgoff = 0;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO);
- return remap_pfn_range(vma, vma->vm_start, immap->pfn_start, size,
vma->vm_page_prot);
+}
static const struct file_operations iommufd_fops = { .owner = THIS_MODULE, .open = iommufd_fops_open, .release = iommufd_fops_release, .unlocked_ioctl = iommufd_fops_ioctl,
- .mmap = iommufd_fops_mmap,
}; /**
Thanks, Praan
-- 2.43.0
On Tue, Apr 29, 2025 at 08:34:56PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 08:24:33PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:08PM -0700, Nicolin Chen wrote:
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
I believe this should be sizeof(*immap) ?
Ugh, Sorry, shouldn't this be size >> PAGE_SHIFT (num_indices to alloc) ?
mtree_load() returns a "struct iommufd_map *" pointer.
Nicolin
On Tue, Apr 29, 2025 at 01:39:09PM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 08:34:56PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 08:24:33PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:08PM -0700, Nicolin Chen wrote:
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
I believe this should be sizeof(*immap) ?
Ugh, Sorry, shouldn't this be size >> PAGE_SHIFT (num_indices to alloc) ?
mtree_load() returns a "struct iommufd_map *" pointer.
I'm not talking about mtree_load. I meant mtree_alloc_range takes in a "size" parameter, which is being passed as sizeof(imap) in this patch. IIUC, the mtree_alloc_range, via mas_empty_area, gets a range that is sufficient for the given "size".
Now in this case, "size" would be the no. of pfns which are mmap-able. By passing sizeof(immap), we're simply reserving sizeof(ptr) i.e. 8 pfns for a 64-bit machine. Whereas we really, just want to reserve a range for size >> PAGE_SHIFT pfns.
Nicolin
Thanks, Praan
On Tue, Apr 29, 2025 at 08:55:47PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 01:39:09PM -0700, Nicolin Chen wrote:
On Tue, Apr 29, 2025 at 08:34:56PM +0000, Pranjal Shrivastava wrote:
On Tue, Apr 29, 2025 at 08:24:33PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:08PM -0700, Nicolin Chen wrote:
- struct iommufd_mmap *immap;
- int rc;
- if (WARN_ON_ONCE(!immap_id))
return -EINVAL;
- if (base & ~PAGE_MASK)
return -EINVAL;
- if (!size || size & ~PAGE_MASK)
return -EINVAL;
- immap = kzalloc(sizeof(*immap), GFP_KERNEL);
- if (!immap)
return -ENOMEM;
- immap->pfn_start = base >> PAGE_SHIFT;
- immap->pfn_end = immap->pfn_start + (size >> PAGE_SHIFT) - 1;
- rc = mtree_alloc_range(&ictx->mt_mmap, immap_id, immap, sizeof(immap),
I believe this should be sizeof(*immap) ?
Ugh, Sorry, shouldn't this be size >> PAGE_SHIFT (num_indices to alloc) ?
mtree_load() returns a "struct iommufd_map *" pointer.
I'm not talking about mtree_load. I meant mtree_alloc_range takes in a "size" parameter, which is being passed as sizeof(imap) in this patch. IIUC, the mtree_alloc_range, via mas_empty_area, gets a range that is sufficient for the given "size".
Now in this case, "size" would be the no. of pfns which are mmap-able. By passing sizeof(immap), we're simply reserving sizeof(ptr) i.e. 8 pfns for a 64-bit machine. Whereas we really, just want to reserve a range for size >> PAGE_SHIFT pfns.
But we are not storing pfns but the immap pointer..
Nicolin
Extend the loopback test to a new mmap page.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 4 +++ drivers/iommu/iommufd/selftest.c | 37 ++++++++++++++++++++++--- tools/testing/selftests/iommu/iommufd.c | 5 ++++ 3 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index a0831d78fef1..86a983f59552 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -232,12 +232,16 @@ struct iommu_hwpt_invalidate_selftest { * (IOMMU_VIOMMU_TYPE_SELFTEST) * @in_data: Input random data from user space * @out_data: Output data (matching @in_data) to user space + * @out_mmap_pgoff: The offset argument for mmap syscall + * @out_mmap_pgsz: Maximum page size for mmap syscall * * Simply set @out_data=@in_data for a loopback test */ struct iommu_viommu_selftest { __u32 in_data; __u32 out_data; + __aligned_u64 out_mmap_pgoff; + __aligned_u64 out_mmap_pgsz; };
/* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index d6cc5b78821b..cd058dcd5984 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -149,6 +149,9 @@ struct mock_viommu { struct iommufd_viommu core; struct mock_iommu_domain *s2_parent; struct mock_vcmdq *mock_vcmdq[IOMMU_TEST_VCMDQ_MAX]; + + unsigned long mmap_pgoff; + u32 *page; /* Mmap page to test u32 type of in_data */ };
static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu) @@ -645,9 +648,12 @@ static void mock_viommu_destroy(struct iommufd_viommu *viommu) { struct mock_iommu_device *mock_iommu = container_of( viommu->iommu_dev, struct mock_iommu_device, iommu_dev); + struct mock_viommu *mock_viommu = to_mock_viommu(viommu);
if (refcount_dec_and_test(&mock_iommu->users)) complete(&mock_iommu->complete); + iommufd_ctx_free_mmap(viommu->ictx, mock_viommu->mmap_pgoff); + free_page((unsigned long)mock_viommu->page);
/* iommufd core frees mock_viommu and viommu */ } @@ -827,17 +833,40 @@ mock_viommu_alloc(struct device *dev, struct iommu_domain *domain, return ERR_CAST(mock_viommu);
if (user_data) { + mock_viommu->page = + (u32 *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + if (!mock_viommu->page) { + rc = -ENOMEM; + goto err_destroy_struct; + } + + rc = iommufd_ctx_alloc_mmap(ictx, __pa(mock_viommu->page), + PAGE_SIZE, + &mock_viommu->mmap_pgoff); + if (rc) + goto err_free_page; + + /* For loopback tests on both the page and out_data */ + *mock_viommu->page = data.in_data; data.out_data = data.in_data; + data.out_mmap_pgsz = PAGE_SIZE; + data.out_mmap_pgoff = mock_viommu->mmap_pgoff; rc = iommu_copy_struct_to_user( user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data); - if (rc) { - iommufd_struct_destroy(ictx, mock_viommu, core); - return ERR_PTR(rc); - } + if (rc) + goto err_free_mmap; }
refcount_inc(&mock_iommu->users); return &mock_viommu->core; + +err_free_mmap: + iommufd_ctx_free_mmap(ictx, mock_viommu->mmap_pgoff); +err_free_page: + free_page((unsigned long)mock_viommu->page); +err_destroy_struct: + iommufd_struct_destroy(ictx, mock_viommu, core); + return ERR_PTR(rc); }
static const struct iommu_ops mock_ops = { diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 7c464f6eb37b..f6dbee6a352c 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2799,12 +2799,17 @@ TEST_F(iommufd_viommu, viommu_alloc_with_data) struct iommu_viommu_selftest data = { .in_data = 0xbeef, }; + uint32_t *test;
if (self->device_id) { test_cmd_viommu_alloc(self->device_id, self->hwpt_id, IOMMU_VIOMMU_TYPE_SELFTEST, &data, sizeof(data), &self->viommu_id); assert(data.out_data == data.in_data); + test = mmap(NULL, data.out_mmap_pgsz, PROT_READ | PROT_WRITE, + MAP_SHARED, self->fd, data.out_mmap_pgoff); + assert(test && *test == data.in_data); + munmap(test, data.out_mmap_pgsz); } }
With the introduction of the new object and its infrastructure, update the doc to reflect that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- Documentation/userspace-api/iommufd.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst index b0df15865dec..afca749652ee 100644 --- a/Documentation/userspace-api/iommufd.rst +++ b/Documentation/userspace-api/iommufd.rst @@ -124,6 +124,19 @@ Following IOMMUFD objects are exposed to userspace: used to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS, but is confined to one vEVENTQ per vEVENTQ type.
+- IOMMUFD_OBJ_VCMDQ, representing a hardware queue as a subset of a vIOMMU's + virtualization feature for a VM to directly execute guest-issued commands to + invalidate HW cache entries holding the mappings or translations of a guest- + owned stage-1 page table. Along with this queue object, iommufd provides the + user space an mmap interface for VMM to mmap a physical MMIO region from the + host physical address space to a guest physical address space, to exclusively + control the allocated vCMDQ HW. Thus, when allocating a vCMDQ, the VMM must + request a pair of VMA info (vm_pgoff/size) for a later mmap call. The length + argument of an mmap call could be smaller than the given size for a paritial + mmap, but the given vm_pgoff (as the addr argument of the mmap call) should + never be offsetted, which also implies that the mmap will always start from + the beginning of the physical MMIO region. + All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
The diagrams below show relationships between user-visible objects and kernel @@ -270,6 +283,7 @@ User visible objects are backed by following datastructures: - iommufd_viommu for IOMMUFD_OBJ_VIOMMU. - iommufd_vdevice for IOMMUFD_OBJ_VDEVICE. - iommufd_veventq for IOMMUFD_OBJ_VEVENTQ. +- iommufd_vcmdq for IOMMUFD_OBJ_VCMDQ.
Several terminologies when looking at these datastructures:
On Fri, Apr 25, 2025 at 10:58:10PM -0700, Nicolin Chen wrote:
+- IOMMUFD_OBJ_VCMDQ, representing a hardware queue as a subset of a vIOMMU's
- virtualization feature for a VM to directly execute guest-issued commands to
- invalidate HW cache entries holding the mappings or translations of a guest-
- owned stage-1 page table. Along with this queue object, iommufd provides the
- user space an mmap interface for VMM to mmap a physical MMIO region from the
- host physical address space to a guest physical address space, to exclusively
- control the allocated vCMDQ HW. Thus, when allocating a vCMDQ, the VMM must
- request a pair of VMA info (vm_pgoff/size) for a later mmap call. The length
- argument of an mmap call could be smaller than the given size for a paritial
- mmap, but the given vm_pgoff (as the addr argument of the mmap call) should
"... partial mmap, ..."
- never be offsetted, which also implies that the mmap will always start from
- the beginning of the physical MMIO region.
Thanks.
On Mon, Apr 28, 2025 at 09:31:31PM +0700, Bagas Sanjaya wrote:
On Fri, Apr 25, 2025 at 10:58:10PM -0700, Nicolin Chen wrote:
+- IOMMUFD_OBJ_VCMDQ, representing a hardware queue as a subset of a vIOMMU's
- virtualization feature for a VM to directly execute guest-issued commands to
- invalidate HW cache entries holding the mappings or translations of a guest-
- owned stage-1 page table. Along with this queue object, iommufd provides the
- user space an mmap interface for VMM to mmap a physical MMIO region from the
- host physical address space to a guest physical address space, to exclusively
- control the allocated vCMDQ HW. Thus, when allocating a vCMDQ, the VMM must
- request a pair of VMA info (vm_pgoff/size) for a later mmap call. The length
- argument of an mmap call could be smaller than the given size for a paritial
- mmap, but the given vm_pgoff (as the addr argument of the mmap call) should
"... partial mmap, ..."
Fixed. Thanks!
Nicolin
An impl driver might support its own vIOMMU object, as tegra241-cmdqv will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.
Add a vsmmu_alloc op to give impl a try, upon failure fallback to standard vsmmu allocation for IOMMU_VIOMMU_TYPE_ARM_SMMUV3.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 6 ++++++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 17 +++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 6b8f0d20dac3..a5835af72417 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -16,6 +16,7 @@ #include <linux/sizes.h>
struct arm_smmu_device; +struct arm_smmu_domain;
/* MMIO registers */ #define ARM_SMMU_IDR0 0x0 @@ -720,6 +721,11 @@ struct arm_smmu_impl_ops { int (*init_structures)(struct arm_smmu_device *smmu); struct arm_smmu_cmdq *(*get_secondary_cmdq)( struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent); + struct arm_vsmmu *(*vsmmu_alloc)( + struct arm_smmu_device *smmu, + struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx, + unsigned int viommu_type, + const struct iommu_user_data *user_data); };
/* An SMMUv3 instance */ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index 66855cae775e..a8a78131702d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -392,10 +392,7 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, iommu_get_iommu_dev(dev, struct arm_smmu_device, iommu); struct arm_smmu_master *master = dev_iommu_priv_get(dev); struct arm_smmu_domain *s2_parent = to_smmu_domain(parent); - struct arm_vsmmu *vsmmu; - - if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3) - return ERR_PTR(-EOPNOTSUPP); + struct arm_vsmmu *vsmmu = ERR_PTR(-EOPNOTSUPP);
if (!(smmu->features & ARM_SMMU_FEAT_NESTING)) return ERR_PTR(-EOPNOTSUPP); @@ -423,8 +420,16 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, !(smmu->features & ARM_SMMU_FEAT_S2FWB)) return ERR_PTR(-EOPNOTSUPP);
- vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, core, - &arm_vsmmu_ops); + if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc) + vsmmu = master->smmu->impl_ops->vsmmu_alloc( + master->smmu, s2_parent, ictx, viommu_type, user_data); + if (PTR_ERR(vsmmu) == -EOPNOTSUPP) { + if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3) + return ERR_PTR(-EOPNOTSUPP); + /* Fallback to standard SMMUv3 type if viommu_type matches */ + vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, core, + &arm_vsmmu_ops); + } if (IS_ERR(vsmmu)) return ERR_CAST(vsmmu);
Repurpose the @__reserved field in the struct iommu_hw_info_arm_smmuv3, to an HW implementation-defined field @impl.
This will be used by Tegra241 CMDQV implementation on top of a standard ARM SMMUv3 IOMMU. The @impl will be only valid if @flags is set with an implementation-defined flag.
Thus in the driver-level, add an hw_info impl op that will return such a flag to use the impl field.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + include/uapi/linux/iommufd.h | 4 ++-- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 16 +++++++++++++--- 3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index a5835af72417..bab7a9ce1283 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -726,6 +726,7 @@ struct arm_smmu_impl_ops { struct arm_smmu_domain *smmu_domain, struct iommufd_ctx *ictx, unsigned int viommu_type, const struct iommu_user_data *user_data); + u32 (*hw_info)(struct arm_smmu_device *smmu, u32 *impl); };
/* An SMMUv3 instance */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 06a763fda47f..b2614f0f1547 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -554,7 +554,7 @@ struct iommu_hw_info_vtd { * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) * * @flags: Must be set to 0 - * @__reserved: Must be 0 + * @impl: Must be 0 * @idr: Implemented features for ARM SMMU Non-secure programming interface * @iidr: Information about the implementation and implementer of ARM SMMU, * and architecture version supported @@ -585,7 +585,7 @@ struct iommu_hw_info_vtd { */ struct iommu_hw_info_arm_smmuv3 { __u32 flags; - __u32 __reserved; + __u32 impl; __u32 idr[6]; __u32 iidr; __u32 aidr; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index a8a78131702d..63861c60b615 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -10,7 +10,9 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type) { struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct arm_smmu_device *smmu = master->smmu; struct iommu_hw_info_arm_smmuv3 *info; + u32 flags = 0, impl = 0; u32 __iomem *base_idr; unsigned int i;
@@ -18,15 +20,23 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type) if (!info) return ERR_PTR(-ENOMEM);
- base_idr = master->smmu->base + ARM_SMMU_IDR0; + base_idr = smmu->base + ARM_SMMU_IDR0; for (i = 0; i <= 5; i++) info->idr[i] = readl_relaxed(base_idr + i); - info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR); - info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR); + info->iidr = readl_relaxed(smmu->base + ARM_SMMU_IIDR); + info->aidr = readl_relaxed(smmu->base + ARM_SMMU_AIDR);
*length = sizeof(*info); *type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
+ if (smmu->impl_ops && smmu->impl_ops->hw_info) { + flags = smmu->impl_ops->hw_info(smmu, &impl); + if (flags) { + info->impl = impl; + info->flags |= flags; + } + } + return info; }
A vIRQ can be reported only from a threaded IRQ context. Change to use to request_threaded_irq to support that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index dd7d030d2e89..ba029f7d24ce 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -824,8 +824,9 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->dev = smmu->impl_dev;
if (cmdqv->irq > 0) { - ret = request_irq(irq, tegra241_cmdqv_isr, 0, "tegra241-cmdqv", - cmdqv); + ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr, + IRQF_ONESHOT, "tegra241-cmdqv", + cmdqv); if (ret) { dev_err(cmdqv->dev, "failed to request irq (%d): %d\n", cmdqv->irq, ret);
The current flow of tegra241_cmdqv_remove_vintf() is: 1. For each LVCMDQ, tegra241_vintf_remove_lvcmdq(): a. Disable the LVCMDQ HW b. Release the LVCMDQ SW resource 2. For current VINTF, tegra241_vintf_hw_deinit(): c. Disable all LVCMDQ HWs d. Disable VINTF HW
Obviously, the step 1.a and the step 2.c are redundant.
Since tegra241_vintf_hw_deinit() disables all of its LVCMDQ HWs, it could simplify the flow in tegra241_cmdqv_remove_vintf() by calling that first: 1. For current VINTF, tegra241_vintf_hw_deinit(): a. Disable all LVCMDQ HWs b. Disable VINTF HW 2. Release all LVCMDQ SW resources
Drop tegra241_vintf_remove_lvcmdq(), and move tegra241_vintf_free_lvcmdq() as the new step 2.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index ba029f7d24ce..8d418c131b1b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -628,24 +628,17 @@ static int tegra241_cmdqv_init_vintf(struct tegra241_cmdqv *cmdqv, u16 max_idx,
/* Remove Helpers */
-static void tegra241_vintf_remove_lvcmdq(struct tegra241_vintf *vintf, u16 lidx) -{ - tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]); - tegra241_vintf_free_lvcmdq(vintf, lidx); -} - static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx) { struct tegra241_vintf *vintf = cmdqv->vintfs[idx]; u16 lidx;
+ tegra241_vintf_hw_deinit(vintf); + /* Remove LVCMDQ resources */ for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) if (vintf->lvcmdqs[lidx]) - tegra241_vintf_remove_lvcmdq(vintf, lidx); - - /* Remove VINTF resources */ - tegra241_vintf_hw_deinit(vintf); + tegra241_vintf_free_lvcmdq(vintf, lidx);
dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx); tegra241_cmdqv_deinit_vintf(cmdqv, idx);
To simplify the mappings from global VCMDQs to VINTFs' LVCMDQs, the design chose to do static allocations and mappings in the global reset function.
However, with the user-owned VINTF support, it exposes a security concern: if user space VM only wants one LVCMDQ for a VINTF, statically mapping two LVCMDQs creates a hidden VCMDQ that user space could DoS attack by writing ramdon stuff to overwhelm the kernel with unhandleable IRQs.
Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be done dynamically.
HW allows pre-assigning global VCMDQs in the CMDQ_ALLOC registers, without finalizing the mappings by keeping CMDQV_CMDQ_ALLOCATED=0. So, add a pair of map/unmap helper that simply sets/clears that bit.
Delay the LVCMDQ mappings to tegra241_vintf_hw_init(), and the unmappings to tegra241_vintf_hw_deinit().
However, the dynamic LVCMDQ mapping/unmapping can complicate the timing of calling tegra241_vcmdq_hw_init/deinit(), which write LVCMDQ address space, i.e. requiring LVCMDQ to be mapped. Highlight that with a note to the top of either of them.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 37 +++++++++++++++++-- 1 file changed, 33 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 8d418c131b1b..869c90b660c1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -351,6 +351,7 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
/* HW Reset Functions */
+/* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) { char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); @@ -379,6 +380,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h); }
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) { char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); @@ -404,16 +406,42 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) return 0; }
+/* Unmap a global VCMDQ from the pre-assigned LVCMDQ */ +static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq) +{ + u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); + + writel(regval & ~CMDQV_CMDQ_ALLOCATED, + REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + dev_dbg(vcmdq->cmdqv->dev, "%sunmapped\n", h); +} + static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) { - u16 lidx; + u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
- for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) - if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) + /* HW requires to unmap LVCMDQs in descending order */ + while (lidx--) { + if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) { tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]); + tegra241_vcmdq_unmap_lvcmdq(vintf->lvcmdqs[lidx]); + } + } vintf_write_config(vintf, 0); }
+/* Map a global VCMDQ to the pre-assigned LVCMDQ */ +static void tegra241_vcmdq_map_lvcmdq(struct tegra241_vcmdq *vcmdq) +{ + u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + char header[64], *h = lvcmdq_error_header(vcmdq, header, 64); + + writel(regval | CMDQV_CMDQ_ALLOCATED, + REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx))); + dev_dbg(vcmdq->cmdqv->dev, "%smapped\n", h); +} + static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) { u32 regval; @@ -441,8 +469,10 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) */ vintf->hyp_own = !!(VINTF_HYP_OWN & readl(REG_VINTF(vintf, CONFIG)));
+ /* HW requires to map LVCMDQs in ascending order */ for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) { if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) { + tegra241_vcmdq_map_lvcmdq(vintf->lvcmdqs[lidx]); ret = tegra241_vcmdq_hw_init(vintf->lvcmdqs[lidx]); if (ret) { tegra241_vintf_hw_deinit(vintf); @@ -476,7 +506,6 @@ static int tegra241_cmdqv_hw_reset(struct arm_smmu_device *smmu) for (lidx = 0; lidx < cmdqv->num_lvcmdqs_per_vintf; lidx++) { regval = FIELD_PREP(CMDQV_CMDQ_ALLOC_VINTF, idx); regval |= FIELD_PREP(CMDQV_CMDQ_ALLOC_LVCMDQ, lidx); - regval |= CMDQV_CMDQ_ALLOCATED; writel_relaxed(regval, REG_CMDQV(cmdqv, CMDQ_ALLOC(qidx++))); }
On 26-04-2025 11:28, Nicolin Chen wrote:
However, with the user-owned VINTF support, it exposes a security concern: if user space VM only wants one LVCMDQ for a VINTF, statically mapping two LVCMDQs creates a hidden VCMDQ that user space could DoS attack by writing ramdon stuff to overwhelm the kernel with unhandleable IRQs.
typo ramdon -> random
Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be done dynamically.
Thanks, Alok
The CMDQV HW supports a user-space use for virtualization cases. It allows the VM to issue guest-level TLBI or ATC_INV commands directly to the queue and executes them without a VMEXIT, as HW will replace the VMID field in a TLBI command and the SID field in an ATC_INV command with the preset VMID and SID.
This is built upon the vIOMMU infrastructure by allowing VMM to allocate a VINTF (as a vIOMMU object) and assign VCMDQs (vCMDQ objects) to the VINTF.
So firstly, replace the standard vSMMU model with the VINTF implementation but reuse the standard cache_invalidate op (for unsupported commands) and the standard alloc_domain_nested op (for standard nested STE).
Each VINTF has two 64KB MMIO pages (128B per logical vCMDQ): - Page0 (directly accessed by guest) has all the control and status bits. - Page1 (trapped by VMM) has guest-owned queue memory location/size info.
VMM should trap the emulated VINTF0's page1 of the guest VM for the guest- level VCMDQ location/size info and forward that to the kernel to translate to a physical memory location to program the VCMDQ HW during an allocation call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0 of the guest VM. This allows the guest OS to read and write the guest-own VINTF's page0 for direct control of the VCMDQ HW.
For ATC invalidation commands that hold an SID, it requires all devices to register their virtual SIDs to the SID_MATCH registers and their physical SIDs to the pairing SID_REPLACE registers, so that HW can use those as a lookup table to replace those virtual SIDs with the correct physical SIDs. Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid structure to allocate SID_REPLACE and to program the SIDs accordingly.
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 15 + include/uapi/linux/iommufd.h | 49 ++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 6 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 374 +++++++++++++++++- 4 files changed, 435 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index bab7a9ce1283..d3f18a286447 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1000,6 +1000,14 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq, u64 *cmds, int n, bool sync);
+static inline phys_addr_t +arm_smmu_domain_ipa_to_pa(struct arm_smmu_domain *smmu_domain, u64 ipa) +{ + if (WARN_ON_ONCE(smmu_domain->stage != ARM_SMMU_DOMAIN_S2)) + return 0; + return iommu_iova_to_phys(&smmu_domain->domain, ipa); +} + #ifdef CONFIG_ARM_SMMU_V3_SVA bool arm_smmu_sva_supported(struct arm_smmu_device *smmu); bool arm_smmu_master_sva_supported(struct arm_smmu_master *master); @@ -1076,9 +1084,16 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state); void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master); int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt); +struct iommu_domain * +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags, + const struct iommu_user_data *user_data); +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array); #else #define arm_smmu_hw_info NULL #define arm_vsmmu_alloc NULL +#define arm_vsmmu_alloc_domain_nested NULL +#define arm_vsmmu_cache_invalidate NULL
static inline int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index b2614f0f1547..d69e7c1d39ea 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -549,12 +549,25 @@ struct iommu_hw_info_vtd { __aligned_u64 ecap_reg; };
+/** + * enum iommu_hw_info_arm_smmuv3_flags - Flags for ARM SMMUv3 hw_info + * @IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV: Tegra241 implementation with + * CMDQV support; @impl is valid + */ +enum iommu_hw_info_arm_smmuv3_flags { + IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV = 1 << 0, +}; + /** * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) * - * @flags: Must be set to 0 - * @impl: Must be 0 + * @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags + * @impl: Implementation-defined bits when the following flags are set: + * - IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV + * Bits[15:12] - Log2 of the total number of SID replacements + * Bits[07:04] - Log2 of the total number of vCMDQs per vIOMMU + * Bits[03:00] - Version number for the CMDQ-V HW * @idr: Implemented features for ARM SMMU Non-secure programming interface * @iidr: Information about the implementation and implementer of ARM SMMU, * and architecture version supported @@ -952,10 +965,28 @@ struct iommu_fault_alloc { * enum iommu_viommu_type - Virtual IOMMU Type * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3 */ enum iommu_viommu_type { IOMMU_VIOMMU_TYPE_DEFAULT = 0, IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2, +}; + +/** + * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface + * (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV) + * @out_vintf_page0_pgoff: Offset of the VINTF page0 for mmap syscall + * @out_vintf_page0_pgsz: Size of the VINTF page0 for mmap syscall + * + * Both @out_vintf_page0_pgoff and @out_vintf_page0_pgsz are given by the kernel + * for user space to mmap the VINTF page0 from the host physical address space + * to the guest physical address space so that a guest kernel can directly R/W + * access to the VINTF page0 in order to control its virtual comamnd queues. + */ +struct iommu_viommu_tegra241_cmdqv { + __aligned_u64 out_vintf_page0_pgoff; + __aligned_u64 out_vintf_page0_pgsz; };
/** @@ -1152,9 +1183,23 @@ struct iommu_veventq_alloc { /** * enum iommu_vcmdq_type - Virtual Command Queue Type * @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use + * @IOMMU_VCMDQ_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3 */ enum iommu_vcmdq_type { IOMMU_VCMDQ_TYPE_DEFAULT = 0, + /* + * TEGRA241_CMDQV requirements (otherwise it will fail) + * - alloc starts from the lowest @index=0 in ascending order + * - destroy starts from the last allocated @index in descending order + * - @addr must be aligned to @length in bytes and be mmapped in IOAS + * - @length must be a power of 2, with a minimum 32 bytes and a maximum + * 1 ^ idr[1].CMDQS x 16 bytes (do GET_HW_INFO call to read idr[1] in + * struct iommu_hw_info_arm_smmuv3) + * - suggest to back the queue memory with contiguous physical pages or + * a single huge page with alignment of the queue size, limit vSMMU's + * IDR1.CMDQS to the huge page size divided by 16 bytes + */ + IOMMU_VCMDQ_TYPE_TEGRA241_CMDQV = 1, };
/** diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index 63861c60b615..40246cd04656 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -226,7 +226,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg, return 0; }
-static struct iommu_domain * +struct iommu_domain * arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags, const struct iommu_user_data *user_data) { @@ -337,8 +337,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu, return 0; }
-static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, - struct iommu_user_data_array *array) +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array) { struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core); struct arm_smmu_device *smmu = vsmmu->smmu; diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 869c90b660c1..88e2b6506b3a 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -8,7 +8,9 @@ #include <linux/dma-mapping.h> #include <linux/interrupt.h> #include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/iopoll.h> +#include <uapi/linux/iommufd.h>
#include <acpi/acpixf.h>
@@ -26,8 +28,10 @@ #define CMDQV_EN BIT(0)
#define TEGRA241_CMDQV_PARAM 0x0004 +#define CMDQV_NUM_SID_PER_VM_LOG2 GENMASK(15, 12) #define CMDQV_NUM_VINTF_LOG2 GENMASK(11, 8) #define CMDQV_NUM_VCMDQ_LOG2 GENMASK(7, 4) +#define CMDQV_VER GENMASK(3, 0)
#define TEGRA241_CMDQV_STATUS 0x0008 #define CMDQV_ENABLED BIT(0) @@ -53,6 +57,9 @@ #define VINTF_STATUS GENMASK(3, 1) #define VINTF_ENABLED BIT(0)
+#define TEGRA241_VINTF_SID_MATCH(s) (0x0040 + 0x4*(s)) +#define TEGRA241_VINTF_SID_REPLACE(s) (0x0080 + 0x4*(s)) + #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \ (0x00C0 + 0x8*(m)) #define LVCMDQ_ERR_MAP_NUM_64 2 @@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
/** * struct tegra241_vcmdq - Virtual Command Queue + * @core: Embedded iommufd_vcmdq structure * @idx: Global index in the CMDQV * @lidx: Local index in the VINTF * @enabled: Enable status * @cmdqv: Parent CMDQV pointer * @vintf: Parent VINTF pointer + * @prev: Previous LVCMDQ to depend on * @cmdq: Command Queue struct * @page0: MMIO Page0 base address * @page1: MMIO Page1 base address */ struct tegra241_vcmdq { + struct iommufd_vcmdq core; + u16 idx; u16 lidx;
@@ -131,22 +142,29 @@ struct tegra241_vcmdq {
struct tegra241_cmdqv *cmdqv; struct tegra241_vintf *vintf; + struct tegra241_vcmdq *prev; struct arm_smmu_cmdq cmdq;
void __iomem *page0; void __iomem *page1; }; +#define core_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
/** * struct tegra241_vintf - Virtual Interface + * @vsmmu: Embedded arm_vsmmu structure * @idx: Global index in the CMDQV * @enabled: Enable status * @hyp_own: Owned by hypervisor (in-kernel) * @cmdqv: Parent CMDQV pointer * @lvcmdqs: List of logical VCMDQ pointers * @base: MMIO base address + * @immap_id: Allocated immap_id ID for mmap() call + * @sids: Stream ID replacement resources */ struct tegra241_vintf { + struct arm_vsmmu vsmmu; + u16 idx;
bool enabled; @@ -156,6 +174,24 @@ struct tegra241_vintf { struct tegra241_vcmdq **lvcmdqs;
void __iomem *base; + unsigned long immap_id; + + struct ida sids; +}; +#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core) + +/** + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID + * @vintf: Parent VINTF pointer + * @sid: Physical Stream ID + * @id: Slot index in the VINTF + */ +struct tegra241_vintf_sid { + struct iommufd_vdevice core; + struct tegra241_vintf *vintf; + u32 sid; + u8 idx; };
/** @@ -163,10 +199,12 @@ struct tegra241_vintf { * @smmu: SMMUv3 device * @dev: CMDQV device * @base: MMIO base address + * @base_phys: MMIO physical base address, for mmap * @irq: IRQ number * @num_vintfs: Total number of VINTFs * @num_vcmdqs: Total number of VCMDQs * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF + * @num_sids_per_vintf: Total number of SID replacements per VINTF * @vintf_ids: VINTF id allocator * @vintfs: List of VINTFs */ @@ -175,12 +213,14 @@ struct tegra241_cmdqv { struct device *dev;
void __iomem *base; + phys_addr_t base_phys; int irq;
/* CMDQV Hardware Params */ u16 num_vintfs; u16 num_vcmdqs; u16 num_lvcmdqs_per_vintf; + u16 num_sids_per_vintf;
struct ida vintf_ids;
@@ -380,6 +420,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h); }
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */ +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) +{ + writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); +} + /* This function is for LVCMDQ, so @vcmdq must be mapped prior */ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) { @@ -390,7 +436,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) tegra241_vcmdq_hw_deinit(vcmdq);
/* Configure and enable VCMDQ */ - writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); + _tegra241_vcmdq_hw_init(vcmdq);
ret = vcmdq_write_config(vcmdq, VCMDQ_EN); if (ret) { @@ -420,6 +466,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq) static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) { u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf; + int sidx;
/* HW requires to unmap LVCMDQs in descending order */ while (lidx--) { @@ -429,6 +476,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf) } } vintf_write_config(vintf, 0); + for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) { + writel_relaxed(0, REG_VINTF(vintf, SID_REPLACE(sidx))); + writel_relaxed(0, REG_VINTF(vintf, SID_MATCH(sidx))); + } }
/* Map a global VCMDQ to the pre-assigned LVCMDQ */ @@ -457,7 +508,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own) * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a * restricted set of supported commands. */ - regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own); + regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) | + FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid); writel(regval, REG_VINTF(vintf, CONFIG));
ret = vintf_write_config(vintf, regval | VINTF_EN); @@ -584,7 +636,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
dev_dbg(vintf->cmdqv->dev, "%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64)); - kfree(vcmdq); + /* Guest-owned VCMDQ is free-ed with vcmdq by iommufd core */ + if (vcmdq->vintf->hyp_own) + kfree(vcmdq); }
static struct tegra241_vcmdq * @@ -623,6 +677,9 @@ tegra241_vintf_alloc_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
static void tegra241_cmdqv_deinit_vintf(struct tegra241_cmdqv *cmdqv, u16 idx) { + if (cmdqv->vintfs[idx]->immap_id) + iommufd_ctx_free_mmap(cmdqv->vintfs[idx]->vsmmu.core.ictx, + cmdqv->vintfs[idx]->immap_id); kfree(cmdqv->vintfs[idx]->lvcmdqs); ida_free(&cmdqv->vintf_ids, idx); cmdqv->vintfs[idx] = NULL; @@ -671,7 +728,11 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx); tegra241_cmdqv_deinit_vintf(cmdqv, idx); - kfree(vintf); + if (!vintf->hyp_own) + ida_destroy(&vintf->sids); + /* Guest-owned VINTF is free-ed with viommu by iommufd core */ + if (vintf->hyp_own) + kfree(vintf); }
static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu) @@ -699,10 +760,32 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu) put_device(cmdqv->dev); /* smmu->impl_dev */ }
+static struct arm_vsmmu * +tegra241_cmdqv_vsmmu_alloc(struct arm_smmu_device *smmu, + struct arm_smmu_domain *smmu_domain, + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data); + +static u32 tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *impl) +{ + struct tegra241_cmdqv *cmdqv = + container_of(smmu, struct tegra241_cmdqv, smmu); + u32 regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM)); + + *impl = FIELD_GET(CMDQV_VER, regval); + *impl |= FIELD_PREP(CMDQV_NUM_VCMDQ_LOG2, + ilog2(cmdqv->num_lvcmdqs_per_vintf)); + *impl |= FIELD_PREP(CMDQV_NUM_SID_PER_VM_LOG2, + ilog2(cmdqv->num_sids_per_vintf)); + return IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV; +} + static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = { .get_secondary_cmdq = tegra241_cmdqv_get_cmdq, .device_reset = tegra241_cmdqv_hw_reset, .device_remove = tegra241_cmdqv_remove, + .vsmmu_alloc = tegra241_cmdqv_vsmmu_alloc, + .hw_info = tegra241_cmdqv_hw_info, };
/* Probe Functions */ @@ -844,6 +927,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->irq = irq; cmdqv->base = base; cmdqv->dev = smmu->impl_dev; + cmdqv->base_phys = res->start;
if (cmdqv->irq > 0) { ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr, @@ -860,6 +944,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res, cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval); cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval); cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs; + cmdqv->num_sids_per_vintf = + 1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
cmdqv->vintfs = kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL); @@ -913,3 +999,283 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu) put_device(smmu->impl_dev); return ERR_PTR(-ENODEV); } + +/* User-space vIOMMU and vCMDQ Functions */ + +static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq) +{ + char header[64]; + + /* Configure the vcmdq only; User space does the enabling */ + _tegra241_vcmdq_hw_init(vcmdq); + + dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n", + lvcmdq_error_header(vcmdq, header, 64), + vcmdq->cmdq.q.q_base & VCMDQ_ADDR, + 1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE)); + return 0; +} + +static struct iommufd_vcmdq * +tegra241_vintf_alloc_lvcmdq_user(struct iommufd_viommu *viommu, + unsigned int type, u32 index, dma_addr_t addr, + size_t length) +{ + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + struct tegra241_cmdqv *cmdqv = vintf->cmdqv; + struct arm_smmu_device *smmu = &cmdqv->smmu; + struct tegra241_vcmdq *vcmdq, *prev = NULL; + u32 log2size, max_n_shift; + phys_addr_t q_base; + char header[64]; + int ret; + + if (type != IOMMU_VCMDQ_TYPE_TEGRA241_CMDQV) + return ERR_PTR(-EOPNOTSUPP); + if (index >= cmdqv->num_lvcmdqs_per_vintf) + return ERR_PTR(-EINVAL); + if (vintf->lvcmdqs[index]) + return ERR_PTR(-EEXIST); + /* + * HW requires to map LVCMDQs in ascending order, so reject if the + * previous lvcmdqs is not allocated yet. + */ + if (index) { + prev = vintf->lvcmdqs[index - 1]; + if (!prev) + return ERR_PTR(-EIO); + } + /* + * @length must be a power of 2, in range of + * [ 32, 1 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ] + */ + max_n_shift = FIELD_GET(IDR1_CMDQS, + readl_relaxed(smmu->base + ARM_SMMU_IDR1)); + if (!is_power_of_2(length) || length < 32 || + length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) + return ERR_PTR(-EINVAL); + log2size = ilog2(length) - CMDQ_ENT_SZ_SHIFT; + + /* @addr must be aligned to @length and be mapped in s2_parent domain */ + if (addr & ~VCMDQ_ADDR || addr & (length - 1)) + return ERR_PTR(-EINVAL); + q_base = arm_smmu_domain_ipa_to_pa(vintf->vsmmu.s2_parent, addr); + if (!q_base) + return ERR_PTR(-ENXIO); + + vcmdq = iommufd_vcmdq_alloc(viommu, struct tegra241_vcmdq, core); + if (!vcmdq) + return ERR_PTR(-ENOMEM); + + /* + * HW requires to unmap LVCMDQs in descending order, so destroy() must + * follow this rule. Set a dependency on its previous LVCMDQ so iommufd + * core will help enforce it. + */ + if (prev) { + ret = iommufd_vcmdq_depend(vcmdq, prev, core); + if (ret) + goto free_vcmdq; + } + vcmdq->prev = prev; + + ret = tegra241_vintf_init_lvcmdq(vintf, index, vcmdq); + if (ret) + goto free_vcmdq; + + dev_dbg(cmdqv->dev, "%sallocated\n", + lvcmdq_error_header(vcmdq, header, 64)); + + tegra241_vcmdq_map_lvcmdq(vcmdq); + + vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR; + vcmdq->cmdq.q.q_base |= log2size; + + ret = tegra241_vcmdq_hw_init_user(vcmdq); + if (ret) + goto free_vcmdq; + vintf->lvcmdqs[index] = vcmdq; + + return &vcmdq->core; +free_vcmdq: + iommufd_struct_destroy(viommu->ictx, vcmdq, core); + return ERR_PTR(ret); +} + +static void tegra241_vintf_destroy_lvcmdq_user(struct iommufd_vcmdq *core) +{ + struct tegra241_vcmdq *vcmdq = core_to_vcmdq(core); + + tegra241_vcmdq_hw_deinit(vcmdq); + tegra241_vcmdq_unmap_lvcmdq(vcmdq); + tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx); + if (vcmdq->prev) + iommufd_vcmdq_undepend(vcmdq, vcmdq->prev, core); + + /* IOMMUFD core frees the memory of vcmdq and vcmdq */ +} + +static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu) +{ + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + + tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx); + + /* IOMMUFD core frees the memory of vintf and viommu */ +} + +static struct iommufd_vdevice * +tegra241_vintf_alloc_vdevice(struct iommufd_viommu *viommu, struct device *dev, + u64 dev_id) +{ + struct tegra241_vintf *vintf = viommu_to_vintf(viommu); + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct arm_smmu_stream *stream = &master->streams[0]; + struct tegra241_vintf_sid *vsid; + int sidx; + + if (dev_id > UINT_MAX) + return ERR_PTR(-EINVAL); + + vsid = iommufd_vdevice_alloc(viommu, struct tegra241_vintf_sid, core); + if (!vsid) + return ERR_PTR(-ENOMEM); + + WARN_ON_ONCE(master->num_streams != 1); + + /* Find an empty pair of SID_REPLACE and SID_MATCH */ + sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1, + GFP_KERNEL); + if (sidx < 0) { + iommufd_struct_destroy(viommu->ictx, vsid, core); + return ERR_PTR(sidx); + } + + writel_relaxed(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx))); + writel_relaxed(dev_id << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx))); + dev_dbg(vintf->cmdqv->dev, + "VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n", + vintf->idx, sidx, stream->id, (u32)dev_id); + + vsid->idx = sidx; + vsid->vintf = vintf; + vsid->sid = stream->id; + + return &vsid->core; +} + +static void tegra241_vintf_destroy_vdevice(struct iommufd_vdevice *vdev) +{ + struct tegra241_vintf_sid *vsid = + container_of(vdev, struct tegra241_vintf_sid, core); + struct tegra241_vintf *vintf = vsid->vintf; + + writel_relaxed(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx))); + writel_relaxed(0, REG_VINTF(vintf, SID_MATCH(vsid->idx))); + ida_free(&vintf->sids, vsid->idx); + dev_dbg(vintf->cmdqv->dev, + "VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx, + vsid->idx, vsid->sid); + + /* IOMMUFD core frees the memory of vsid and vdev */ +} + +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = { + .destroy = tegra241_cmdqv_destroy_vintf_user, + .alloc_domain_nested = arm_vsmmu_alloc_domain_nested, + .cache_invalidate = arm_vsmmu_cache_invalidate, + .vdevice_alloc = tegra241_vintf_alloc_vdevice, + .vdevice_destroy = tegra241_vintf_destroy_vdevice, + .vcmdq_alloc = tegra241_vintf_alloc_lvcmdq_user, + .vcmdq_destroy = tegra241_vintf_destroy_lvcmdq_user, +}; + +static struct arm_vsmmu * +tegra241_cmdqv_vsmmu_alloc(struct arm_smmu_device *smmu, + struct arm_smmu_domain *s2_parent, + struct iommufd_ctx *ictx, unsigned int viommu_type, + const struct iommu_user_data *user_data) +{ + struct tegra241_cmdqv *cmdqv = + container_of(smmu, struct tegra241_cmdqv, smmu); + struct iommu_viommu_tegra241_cmdqv data; + struct tegra241_vintf *vintf; + phys_addr_t page0_base; + int ret; + + if (viommu_type != IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV) + return ERR_PTR(-EOPNOTSUPP); + if (!user_data) + return ERR_PTR(-EINVAL); + + ret = iommu_copy_struct_from_user(&data, user_data, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV, + out_vintf_page0_pgsz); + if (ret) + return ERR_PTR(ret); + + vintf = iommufd_viommu_alloc(ictx, struct tegra241_vintf, vsmmu.core, + &tegra241_cmdqv_viommu_ops); + if (!vintf) + return ERR_PTR(-ENOMEM); + + ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf); + if (ret < 0) { + dev_err(cmdqv->dev, "no more available vintf\n"); + goto free_vintf; + } + + vintf->vsmmu.smmu = smmu; + vintf->vsmmu.s2_parent = s2_parent; + /* FIXME Move VMID allocation from the S2 domain allocation to here */ + vintf->vsmmu.vmid = s2_parent->s2_cfg.vmid; + + /* + * Initialize the user-owned VINTF without a LVCMDQ, because it has to + * wait for the allocation of a user-owned LVCMDQ, for security reason. + * It is different than the kernel-owned VINTF0, which had pre-assigned + * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs + * by the tegra241_vintf_hw_init() call. + */ + ret = tegra241_vintf_hw_init(vintf, false); + if (ret) + goto deinit_vintf; + + vintf->lvcmdqs = kcalloc(cmdqv->num_lvcmdqs_per_vintf, + sizeof(*vintf->lvcmdqs), GFP_KERNEL); + if (!vintf->lvcmdqs) { + ret = -ENOMEM; + goto hw_deinit_vintf; + } + + page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx); + ret = iommufd_ctx_alloc_mmap(ictx, page0_base, SZ_64K, + &vintf->immap_id); + if (ret) + goto hw_deinit_vintf; + + data.out_vintf_page0_pgsz = SZ_64K; + data.out_vintf_page0_pgoff = vintf->immap_id; + ret = iommu_copy_struct_to_user(user_data, &data, + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV, + out_vintf_page0_pgsz); + if (ret) + goto free_mmap; + + ida_init(&vintf->sids); + + dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx, + vintf->vsmmu.vmid); + + return &vintf->vsmmu; + +free_mmap: + iommufd_ctx_free_mmap(ictx, vintf->immap_id); +hw_deinit_vintf: + tegra241_vintf_hw_deinit(vintf); +deinit_vintf: + tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx); +free_vintf: + iommufd_struct_destroy(ictx, vintf, vsmmu.core); + return ERR_PTR(ret); +}
Hi Nicolin,
On 26-04-2025 11:28, Nicolin Chen wrote:
The CMDQV HW supports a user-space use for virtualization cases. It allows the VM to issue guest-level TLBI or ATC_INV commands directly to the queue and executes them without a VMEXIT, as HW will replace the VMID field in a TLBI command and the SID field in an ATC_INV command with the preset VMID and SID.
[clip]
+/**
- struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface
(IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
- @out_vintf_page0_pgoff: Offset of the VINTF page0 for mmap syscall
- @out_vintf_page0_pgsz: Size of the VINTF page0 for mmap syscall
- Both @out_vintf_page0_pgoff and @out_vintf_page0_pgsz are given by the kernel
- for user space to mmap the VINTF page0 from the host physical address space
- to the guest physical address space so that a guest kernel can directly R/W
- access to the VINTF page0 in order to control its virtual comamnd queues.
typo comamnd
- */
+struct iommu_viommu_tegra241_cmdqv {
- __aligned_u64 out_vintf_page0_pgoff;
- __aligned_u64 out_vintf_page0_pgsz; };
/** @@ -1152,9 +1183,23 @@ struct iommu_veventq_alloc { /**
- enum iommu_vcmdq_type - Virtual Command Queue Type
- @IOMMU_VCMDQ_TYPE_DEFAULT: Reserved for future use
*/ enum iommu_vcmdq_type { IOMMU_VCMDQ_TYPE_DEFAULT = 0,
- @IOMMU_VCMDQ_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
- /*
* TEGRA241_CMDQV requirements (otherwise it will fail)
* - alloc starts from the lowest @index=0 in ascending order
* - destroy starts from the last allocated @index in descending order
* - @addr must be aligned to @length in bytes and be mmapped in IOAS
* - @length must be a power of 2, with a minimum 32 bytes and a maximum
* 1 ^ idr[1].CMDQS x 16 bytes (do GET_HW_INFO call to read idr[1] in
This line is ambiguous to me intended to express a power of 2. 1 ^ idr[1].CMDQS x 16 bytes -> (2 ^ idr[1].CMDQS) x 16 bytes ?
You could consider like this (2 ^ idr[1].CMDQS) * 16 bytes (use GET_HW_INFO call to read idr[1] from struct iommu_hw_info_arm_smmuv3) or more clear (2 to the power of idr[1].CMDQS)
* struct iommu_hw_info_arm_smmuv3)
* - suggest to back the queue memory with contiguous physical pages or
* a single huge page with alignment of the queue size, limit vSMMU's
* IDR1.CMDQS to the huge page size divided by 16 bytes
*/
- IOMMU_VCMDQ_TYPE_TEGRA241_CMDQV = 1, };
/**
[clip]
struct arm_smmu_device *smmu = vsmmu->smmu; diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 869c90b660c1..88e2b6506b3a 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -8,7 +8,9 @@ #include <linux/dma-mapping.h> #include <linux/interrupt.h> #include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/iopoll.h> +#include <uapi/linux/iommufd.h> #include <acpi/acpixf.h>
[clip]
+/**
- struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
- @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
- @vintf: Parent VINTF pointer
- @sid: Physical Stream ID
- @id: Slot index in the VINTF
@idx
- */
+struct tegra241_vintf_sid {
- struct iommufd_vdevice core;
- struct tegra241_vintf *vintf;
- u32 sid;
- u8 idx; };
[clip]
- /*
* HW requires to map LVCMDQs in ascending order, so reject if the
* previous lvcmdqs is not allocated yet.
*/
- if (index) {
prev = vintf->lvcmdqs[index - 1];
if (!prev)
return ERR_PTR(-EIO);
- }
- /*
* @length must be a power of 2, in range of
* [ 32, 1 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) or 1 << idr[1].CMDQS
*/
- max_n_shift = FIELD_GET(IDR1_CMDQS,
readl_relaxed(smmu->base + ARM_SMMU_IDR1));
LGTM, aside from a minor cosmetic thing.
Thanks, Alok
On Wed, Apr 30, 2025 at 01:17:48AM +0530, ALOK TIWARI wrote:
- /*
* @length must be a power of 2, in range of
* [ 32, 1 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) or 1 << idr[1].CMDQS
*/
- max_n_shift = FIELD_GET(IDR1_CMDQS,
readl_relaxed(smmu->base + ARM_SMMU_IDR1));
LGTM, aside from a minor cosmetic thing.
Fixed all those and the typo in the other mail. Picked "2 ^ " btw.
Thanks Nicolin
Add a new vEVENTQ type for VINTFs that are assigned to the user space. Simply report the two 64-bit LVCMDQ_ERR_MAPs register values.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/uapi/linux/iommufd.h | 15 +++++++++++++ .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 22 +++++++++++++++++++ 2 files changed, 37 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index d69e7c1d39ea..d814b0f61fad 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -1113,10 +1113,12 @@ struct iommufd_vevent_header { * enum iommu_veventq_type - Virtual Event Queue Type * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use * @IOMMU_VEVENTQ_TYPE_ARM_SMMUV3: ARM SMMUv3 Virtual Event Queue + * @IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension IRQ */ enum iommu_veventq_type { IOMMU_VEVENTQ_TYPE_DEFAULT = 0, IOMMU_VEVENTQ_TYPE_ARM_SMMUV3 = 1, + IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV = 2, };
/** @@ -1140,6 +1142,19 @@ struct iommu_vevent_arm_smmuv3 { __aligned_le64 evt[4]; };
+/** + * struct iommu_vevent_tegra241_cmdqv - Tegra241 CMDQV IRQ + * (IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV) + * @lvcmdq_err_map: 128-bit logical vcmdq error map, little-endian. + * (Refer to register LVCMDQ_ERR_MAPs per VINTF ) + * + * The 128-bit register value from HW exclusively reflect the error bits for a + * Virtual Interface represented by a vIOMMU object. Read and report directly. + */ +struct iommu_vevent_tegra241_cmdqv { + __aligned_le64 lvcmdq_err_map[2]; +}; + /** * struct iommu_veventq_alloc - ioctl(IOMMU_VEVENTQ_ALLOC) * @size: sizeof(struct iommu_veventq_alloc) diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index 88e2b6506b3a..d8830b526601 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -292,6 +292,20 @@ static inline int vcmdq_write_config(struct tegra241_vcmdq *vcmdq, u32 regval)
/* ISR Functions */
+static void tegra241_vintf_user_handle_error(struct tegra241_vintf *vintf) +{ + struct iommufd_viommu *viommu = &vintf->vsmmu.core; + struct iommu_vevent_tegra241_cmdqv vevent_data; + int i; + + for (i = 0; i < LVCMDQ_ERR_MAP_NUM_64; i++) + vevent_data.lvcmdq_err_map[i] = + readq_relaxed(REG_VINTF(vintf, LVCMDQ_ERR_MAP_64(i))); + + iommufd_viommu_report_event(viommu, IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV, + &vevent_data, sizeof(vevent_data)); +} + static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf) { int i; @@ -337,6 +351,14 @@ static irqreturn_t tegra241_cmdqv_isr(int irq, void *devid) vintf_map &= ~BIT_ULL(0); }
+ /* Handle other user VINTFs and their LVCMDQs */ + while (vintf_map) { + unsigned long idx = __ffs64(vintf_map); + + tegra241_vintf_user_handle_error(cmdqv->vintfs[idx]); + vintf_map &= ~BIT_ULL(idx); + } + return IRQ_HANDLED; }
linux-kselftest-mirror@lists.linaro.org