This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a parent HWPT typically hold one stage-2 IO pagetable and tag it with only one ID in the cache entries. When sharing one large stage-2 IO pagetable across physical IOMMU instances, that one ID may not always be available across all the IOMMU instances. In other word, it's ideal for SW to have a different container for the stage-2 IO pagetable so it can hold another ID that's available.
For this "different container", add vIOMMU, an additional layer to hold extra virtualization information: _______________________________________________________________________ | iommufd (with vIOMMU) | | | | [5] | | _____________ | | | | | | |----------------| vIOMMU | | | | | | | | | | | | | | [1] | | [4] [2] | | | ______ | | _____________ ________ | | | | | | [3] | | | | | | | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | | | |______| |_____________| |_____________| |________| | | | | | | | | |______|________|______________|__________________|_______________|_____| | | | | | ______v_____ | ______v_____ ______v_____ ___v__ | struct | | PFN | (paging) | | (nested) | |struct| |iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device| |____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance that is passed to or shared with a VM. That can be some HW/SW resources: - Security namespace for guest owned ID, e.g. guest-controlled cache tags - Access to a sharable nesting parent pagetable across physical IOMMUs - Virtualization of various platforms IDs, e.g. RIDs and others - Delivery of paravirtualized invalidation - Direct assigned invalidation queues - Direct assigned interrupts - Non-affiliated event reporting
On a multi-IOMMU system, the vIOMMU object must be instanced to the number of the physical IOMMUs that are passed to (via devices) a guest VM, while being able to hold the shareable parent HWPT. Each vIOMMU then just needs to allocate its own individual ID to tag its own cache: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------ ---------------- | | IDx | ---------------------------- ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested1 |--->| viommu1 ------------------ ---------------- | | IDy | ----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we repurposed the vIOMMU object from an earlier RFC, just for a referece: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4 (paring QEMU branch for testing will be provided with the part2 series)
Changelog v4 * Added "Reviewed-by" from Jason * Dropped IOMMU_VIOMMU_TYPE_DEFAULT support * Dropped iommufd_object_alloc_elm renamings * Renamed iommufd's viommu_api.c to driver.c * Reworked iommufd_viommu_alloc helper * Added a separate iommufd_hwpt_nested_alloc_for_viommu function for hwpt_nested allocations on a vIOMMU, and added comparison between viommu->iommu_dev->ops and dev_iommu_ops(idev->dev) * Replaced s2_parent with vsmmu in arm_smmu_nested_domain * Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in viommu_ops * Replaced wait_queue_head_t with a completion, to delay the unplug of mock_iommu_dev * Corrected documentation graph that was missing struct iommu_device * Added an iommufd_verify_unfinalized_object helper to verify driver- allocated vIOMMU/vDEVICE objects * Added missing test cases for TEST_LENGTH and fail_nth v3 https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/ * Rebased on top of Jason's nesting v3 series https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvidi... * Split the series into smaller parts * Added Jason's Reviewed-by * Added back viommu->iommu_dev * Added support for driver-allocated vIOMMU v.s. core-allocated * Dropped arm_smmu_cache_invalidate_user * Added an iommufd_test_wait_for_users() in selftest * Reworked test code to make viommu an individual FIXTURE * Added missing TEST_LENGTH case for the new ioctl command v2 https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/ * Limited vdev_id to one per idev * Added a rw_sem to protect the vdev_id list * Reworked driver-level APIs with proper lockings * Added a new viommu_api file for IOMMUFD_DRIVER config * Dropped useless iommu_dev point from the viommu structure * Added missing index numnbers to new types in the uAPI header * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one * Reworked mock_viommu_cache_invalidate() using the new iommu helper * Reordered details of set/unset_vdev_id handlers for proper lockings v1 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks! Nicolin
Nicolin Chen (11): iommufd: Move struct iommufd_object to public iommufd header iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct iommufd: Add iommufd_verify_unfinalized_object iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl iommufd: Add domain_alloc_nested op to iommufd_viommu_ops iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC iommufd/selftest: Add refcount to mock_iommu_device iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage Documentation: userspace-api: iommufd: Update vIOMMU iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++--- drivers/iommu/iommufd/iommufd_private.h | 36 ++------ drivers/iommu/iommufd/iommufd_test.h | 2 + include/linux/iommu.h | 14 +++ include/linux/iommufd.h | 89 +++++++++++++++++++ include/uapi/linux/iommufd.h | 56 ++++++++++-- tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++ .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +- drivers/iommu/iommufd/driver.c | 38 ++++++++ drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++- drivers/iommu/iommufd/main.c | 58 ++++++------ drivers/iommu/iommufd/selftest.c | 73 +++++++++++++-- drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 11 +++ Documentation/userspace-api/iommufd.rst | 69 +++++++++++++- 18 files changed, 701 insertions(+), 124 deletions(-) create mode 100644 drivers/iommu/iommufd/driver.c create mode 100644 drivers/iommu/iommufd/viommu.c
Prepare for an embedded structure design for driver-level iommufd_viommu objects: // include/linux/iommufd.h struct iommufd_viommu { struct iommufd_object obj; .... };
// Some IOMMU driver struct iommu_driver_viommu { struct iommufd_viommu core; .... };
It has to expose struct iommufd_object and enum iommufd_object_type from the core-level private header to the public iommufd header.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 25 +------------------------ include/linux/iommufd.h | 24 ++++++++++++++++++++++++ 2 files changed, 25 insertions(+), 24 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index f1d865e6fab6..1bb8c0aaecd1 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -5,8 +5,8 @@ #define __IOMMUFD_PRIVATE_H
#include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/iova_bitmap.h> -#include <linux/refcount.h> #include <linux/rwsem.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -122,29 +122,6 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd, return 0; }
-enum iommufd_object_type { - IOMMUFD_OBJ_NONE, - IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE, - IOMMUFD_OBJ_DEVICE, - IOMMUFD_OBJ_HWPT_PAGING, - IOMMUFD_OBJ_HWPT_NESTED, - IOMMUFD_OBJ_IOAS, - IOMMUFD_OBJ_ACCESS, - IOMMUFD_OBJ_FAULT, -#ifdef CONFIG_IOMMUFD_TEST - IOMMUFD_OBJ_SELFTEST, -#endif - IOMMUFD_OBJ_MAX, -}; - -/* Base struct for all objects with a userspace ID handle. */ -struct iommufd_object { - refcount_t shortterm_users; - refcount_t users; - enum iommufd_object_type type; - unsigned int id; -}; - static inline bool iommufd_lock_obj(struct iommufd_object *obj) { if (!refcount_inc_not_zero(&obj->users)) diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 30f832a60ccb..22948dd03d67 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -8,6 +8,7 @@
#include <linux/err.h> #include <linux/errno.h> +#include <linux/refcount.h> #include <linux/types.h>
struct device; @@ -18,6 +19,29 @@ struct iommufd_ctx; struct iommufd_device; struct page;
+enum iommufd_object_type { + IOMMUFD_OBJ_NONE, + IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE, + IOMMUFD_OBJ_DEVICE, + IOMMUFD_OBJ_HWPT_PAGING, + IOMMUFD_OBJ_HWPT_NESTED, + IOMMUFD_OBJ_IOAS, + IOMMUFD_OBJ_ACCESS, + IOMMUFD_OBJ_FAULT, +#ifdef CONFIG_IOMMUFD_TEST + IOMMUFD_OBJ_SELFTEST, +#endif + IOMMUFD_OBJ_MAX, +}; + +/* Base struct for all objects with a userspace ID handle. */ +struct iommufd_object { + refcount_t shortterm_users; + refcount_t users; + enum iommufd_object_type type; + unsigned int id; +}; + struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, struct device *dev, u32 *id); void iommufd_device_unbind(struct iommufd_device *idev);
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
Prepare for an embedded structure design for driver-level iommufd_viommu objects: // include/linux/iommufd.h struct iommufd_viommu { struct iommufd_object obj; .... };
// Some IOMMU driver struct iommu_driver_viommu { struct iommufd_viommu core; .... };
It has to expose struct iommufd_object and enum iommufd_object_type from the core-level private header to the public iommufd header.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent a slice of physical IOMMU device passed to or shared with a user space VM. This slice, now a vIOMMU object, is a group of virtualization resources of a physical IOMMU's, such as: - Security namespace for guest owned ID, e.g. guest-controlled cache tags - Access to a sharable nesting parent pagetable across physical IOMMUs - Virtualization of various platforms IDs, e.g. RIDs and others - Delivery of paravirtualized invalidation - Direct assigned invalidation queues - Direct assigned interrupts - Non-affiliated event reporting
Add a new viommu_alloc op in iommu_ops, for drivers to allocate their own vIOMMU structures. And this allocation also needs a free(), so add struct iommufd_viommu_ops.
To simplify a vIOMMU allocation, provide a iommufd_viommu_alloc() helper. It's suggested that a driver should embed a core-level viommu structure in its driver-level viommu struct and call the iommufd_viommu_alloc() helper, meanwhile the driver can also implement a viommu ops: struct my_driver_viommu { struct iommufd_viommu core; /* driver-owned properties/features */ .... };
static const struct iommufd_viommu_ops my_driver_viommu_ops = { .free = my_driver_viommu_free, /* future ops for virtualization features */ .... };
static struct iommufd_viommu my_driver_viommu_alloc(...) { struct my_driver_viommu *my_viommu = iommufd_viommu_alloc(ictx, my_driver_viommu, core, my_driver_viommu_ops); /* Init my_viommu and related HW feature */ .... return &my_viommu->core; }
static struct iommu_domain_ops my_driver_domain_ops = { .... .viommu_alloc = my_driver_viommu_alloc, };
To make the Kernel config work between a driver and the iommufd core, move the _iommufd_object_alloc helper into a new driver.c file that builds with CONFIG_IOMMUFD_DRIVER.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/iommufd_private.h | 4 -- include/linux/iommu.h | 14 +++++++ include/linux/iommufd.h | 56 +++++++++++++++++++++++++ drivers/iommu/iommufd/driver.c | 38 +++++++++++++++++ drivers/iommu/iommufd/main.c | 32 -------------- 6 files changed, 109 insertions(+), 37 deletions(-) create mode 100644 drivers/iommu/iommufd/driver.c
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile index cf4605962bea..435124a8e1f1 100644 --- a/drivers/iommu/iommufd/Makefile +++ b/drivers/iommu/iommufd/Makefile @@ -12,4 +12,4 @@ iommufd-y := \ iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
obj-$(CONFIG_IOMMUFD) += iommufd.o -obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o +obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o driver.o diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 1bb8c0aaecd1..5bd41257f2ef 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -202,10 +202,6 @@ iommufd_object_put_and_try_destroy(struct iommufd_ctx *ictx, iommufd_object_remove(ictx, obj, obj->id, 0); }
-struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, - size_t size, - enum iommufd_object_type type); - #define __iommufd_object_alloc(ictx, ptr, type, obj) \ container_of(_iommufd_object_alloc( \ ictx, \ diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 4ad9b9ec6c9b..14f24b5cd16f 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -42,6 +42,8 @@ struct notifier_block; struct iommu_sva; struct iommu_dma_cookie; struct iommu_fault_param; +struct iommufd_ctx; +struct iommufd_viommu;
#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */ #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */ @@ -542,6 +544,14 @@ static inline int __iommu_copy_struct_from_user_array( * @remove_dev_pasid: Remove any translation configurations of a specific * pasid, so that any DMA transactions with this pasid * will be blocked by the hardware. + * @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU instance behind + * the @dev, as the set of virtualization resources shared/passed + * to user space IOMMU instance. And associate it with a nesting + * @parent_domain. The @viommu_type must be defined in the header + * include/uapi/linux/iommufd.h + * It is suggested to call iommufd_viommu_alloc() helper for + * a bundled allocation of the core and the driver structures, + * using the given @ictx pointer. * @pgsize_bitmap: bitmap of all possible supported page sizes * @owner: Driver module providing these ops * @identity_domain: An always available, always attachable identity @@ -591,6 +601,10 @@ struct iommu_ops { void (*remove_dev_pasid)(struct device *dev, ioasid_t pasid, struct iommu_domain *domain);
+ struct iommufd_viommu *(*viommu_alloc)( + struct device *dev, struct iommu_domain *parent_domain, + struct iommufd_ctx *ictx, unsigned int viommu_type); + const struct iommu_domain_ops *default_domain_ops; unsigned long pgsize_bitmap; struct module *owner; diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 22948dd03d67..55054fbc793c 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -17,6 +17,7 @@ struct iommu_group; struct iommufd_access; struct iommufd_ctx; struct iommufd_device; +struct iommufd_viommu_ops; struct page;
enum iommufd_object_type { @@ -28,6 +29,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_IOAS, IOMMUFD_OBJ_ACCESS, IOMMUFD_OBJ_FAULT, + IOMMUFD_OBJ_VIOMMU, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif @@ -78,6 +80,26 @@ void iommufd_access_detach(struct iommufd_access *access);
void iommufd_ctx_get(struct iommufd_ctx *ictx);
+struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommufd_hwpt_paging *hwpt; + + const struct iommufd_viommu_ops *ops; + + unsigned int type; +}; + +/** + * struct iommufd_viommu_ops - vIOMMU specific operations + * @free: Free all driver-specific parts of an iommufd_viommu. The memory of the + * vIOMMU will be free-ed by iommufd core after calling this free op. + */ +struct iommufd_viommu_ops { + void (*free)(struct iommufd_viommu *viommu); +}; + #if IS_ENABLED(CONFIG_IOMMUFD) struct iommufd_ctx *iommufd_ctx_from_file(struct file *file); struct iommufd_ctx *iommufd_ctx_from_fd(int fd); @@ -135,4 +157,38 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx) return -EOPNOTSUPP; } #endif /* CONFIG_IOMMUFD */ + +#if IS_ENABLED(CONFIG_IOMMUFD_DRIVER) +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, + size_t size, + enum iommufd_object_type type); +#else /* !CONFIG_IOMMUFD_DRIVER */ +static inline struct iommufd_object * +_iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size, + enum iommufd_object_type type) +{ + return ERR_PTR(-EOPNOTSUPP); +} +#endif /* CONFIG_IOMMUFD_DRIVER */ + +/* + * Helpers for IOMMU driver to allocate driver structures that will be freed by + * the iommufd core. The free op will be called prior to freeing the memory. + */ +#define iommufd_viommu_alloc(ictx, drv_struct, member, viommu_ops) \ + ({ \ + struct drv_struct *ret; \ + \ + static_assert( \ + __same_type(struct iommufd_viommu, \ + ((struct drv_struct *)NULL)->member)); \ + static_assert(offsetof(struct drv_struct, member.obj) == 0); \ + ret = container_of( \ + _iommufd_object_alloc(ictx, sizeof(struct drv_struct), \ + IOMMUFD_OBJ_VIOMMU), \ + struct drv_struct, member.obj); \ + if (!IS_ERR(ret)) \ + ret->member.ops = viommu_ops; \ + ret; \ + }) #endif diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c new file mode 100644 index 000000000000..c0876d3f91c7 --- /dev/null +++ b/drivers/iommu/iommufd/driver.c @@ -0,0 +1,38 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES + */ + +#include "iommufd_private.h" + +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, + size_t size, + enum iommufd_object_type type) +{ + struct iommufd_object *obj; + int rc; + + obj = kzalloc(size, GFP_KERNEL_ACCOUNT); + if (!obj) + return ERR_PTR(-ENOMEM); + obj->type = type; + /* Starts out bias'd by 1 until it is removed from the xarray */ + refcount_set(&obj->shortterm_users, 1); + refcount_set(&obj->users, 1); + + /* + * Reserve an ID in the xarray but do not publish the pointer yet since + * the caller hasn't initialized it yet. Once the pointer is published + * in the xarray and visible to other threads we can't reliably destroy + * it anymore, so the caller must complete all errorable operations + * before calling iommufd_object_finalize(). + */ + rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY, xa_limit_31b, + GFP_KERNEL_ACCOUNT); + if (rc) + goto out_free; + return obj; +out_free: + kfree(obj); + return ERR_PTR(rc); +} +EXPORT_SYMBOL_NS_GPL(_iommufd_object_alloc, IOMMUFD); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index b5f5d27ee963..92bd075108e5 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -29,38 +29,6 @@ struct iommufd_object_ops { static const struct iommufd_object_ops iommufd_object_ops[]; static struct miscdevice vfio_misc_dev;
-struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx, - size_t size, - enum iommufd_object_type type) -{ - struct iommufd_object *obj; - int rc; - - obj = kzalloc(size, GFP_KERNEL_ACCOUNT); - if (!obj) - return ERR_PTR(-ENOMEM); - obj->type = type; - /* Starts out bias'd by 1 until it is removed from the xarray */ - refcount_set(&obj->shortterm_users, 1); - refcount_set(&obj->users, 1); - - /* - * Reserve an ID in the xarray but do not publish the pointer yet since - * the caller hasn't initialized it yet. Once the pointer is published - * in the xarray and visible to other threads we can't reliably destroy - * it anymore, so the caller must complete all errorable operations - * before calling iommufd_object_finalize(). - */ - rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY, - xa_limit_31b, GFP_KERNEL_ACCOUNT); - if (rc) - goto out_free; - return obj; -out_free: - kfree(obj); - return ERR_PTR(rc); -} - /* * Allow concurrent access to the object. *
On 2024/10/22 8:19, Nicolin Chen wrote:
- @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU instance behind
the @dev, as the set of virtualization resources shared/passed
to user space IOMMU instance. And associate it with a nesting
@parent_domain. The @viommu_type must be defined in the header
include/uapi/linux/iommufd.h
It is suggested to call iommufd_viommu_alloc() helper for
a bundled allocation of the core and the driver structures,
using the given @ictx pointer.
- @pgsize_bitmap: bitmap of all possible supported page sizes
- @owner: Driver module providing these ops
- @identity_domain: An always available, always attachable identity
@@ -591,6 +601,10 @@ struct iommu_ops { void (*remove_dev_pasid)(struct device *dev, ioasid_t pasid, struct iommu_domain *domain);
- struct iommufd_viommu *(*viommu_alloc)(
struct device *dev, struct iommu_domain *parent_domain,
struct iommufd_ctx *ictx, unsigned int viommu_type);
Is the vIOMMU object limited to a parent domain?
Thanks, baolu
On Tue, Oct 22, 2024 at 10:28:30AM +0800, Baolu Lu wrote:
On 2024/10/22 8:19, Nicolin Chen wrote:
- @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU instance behind
the @dev, as the set of virtualization resources shared/passed
to user space IOMMU instance. And associate it with a nesting
@parent_domain. The @viommu_type must be defined in the header
include/uapi/linux/iommufd.h
It is suggested to call iommufd_viommu_alloc() helper for
a bundled allocation of the core and the driver structures,
using the given @ictx pointer.
- @pgsize_bitmap: bitmap of all possible supported page sizes
- @owner: Driver module providing these ops
- @identity_domain: An always available, always attachable identity
@@ -591,6 +601,10 @@ struct iommu_ops { void (*remove_dev_pasid)(struct device *dev, ioasid_t pasid, struct iommu_domain *domain);
struct iommufd_viommu *(*viommu_alloc)(
struct device *dev, struct iommu_domain *parent_domain,
struct iommufd_ctx *ictx, unsigned int viommu_type);
Is the vIOMMU object limited to a parent domain?
Yes, for each vIOMMU (a slice of physical IOMMU per VM), one S2 parent domain is enough. Typically, it has the mappings between gPAs and hPAs. If its format/compatibility allows, this single parent domain can be shared with other vIOMMUs.
Thanks Nicolin
On 2024/10/22 12:40, Nicolin Chen wrote:
On Tue, Oct 22, 2024 at 10:28:30AM +0800, Baolu Lu wrote:
On 2024/10/22 8:19, Nicolin Chen wrote:
- @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU instance behind
the @dev, as the set of virtualization resources shared/passed
to user space IOMMU instance. And associate it with a nesting
@parent_domain. The @viommu_type must be defined in the header
include/uapi/linux/iommufd.h
It is suggested to call iommufd_viommu_alloc() helper for
a bundled allocation of the core and the driver structures,
using the given @ictx pointer.
- @pgsize_bitmap: bitmap of all possible supported page sizes
- @owner: Driver module providing these ops
- @identity_domain: An always available, always attachable identity
@@ -591,6 +601,10 @@ struct iommu_ops { void (*remove_dev_pasid)(struct device *dev, ioasid_t pasid, struct iommu_domain *domain);
struct iommufd_viommu *(*viommu_alloc)(
struct device *dev, struct iommu_domain *parent_domain,
struct iommufd_ctx *ictx, unsigned int viommu_type);
Is the vIOMMU object limited to a parent domain?
Yes, for each vIOMMU (a slice of physical IOMMU per VM), one S2 parent domain is enough. Typically, it has the mappings between gPAs and hPAs. If its format/compatibility allows, this single parent domain can be shared with other vIOMMUs.
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain that translates gPAs to hPAs could also have a vIOMMU object associated with it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Thanks, baolu
On Tue, Oct 22, 2024 at 04:59:07PM +0800, Baolu Lu wrote:
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain that translates gPAs to hPAs could also have a vIOMMU object associated with it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Why? The nested parent flavour of the domain is basically free to create, what reason would be to not do that?
If the HW doesn't support it, then does the HW really need/support a VIOMMU?
I suppose it could be made up to the driver, but for now I think we should leave it as is in the core code requiring it until we have a reason to relax it.
Jason
On 2024/10/22 21:15, Jason Gunthorpe wrote:
On Tue, Oct 22, 2024 at 04:59:07PM +0800, Baolu Lu wrote:
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain that translates gPAs to hPAs could also have a vIOMMU object associated with it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Why? The nested parent flavour of the domain is basically free to create, what reason would be to not do that?
Above addressed my question. The software using vIOMMU should allocate a domain of nested parent type.
If the HW doesn't support it, then does the HW really need/support a VIOMMU?
I suppose it could be made up to the driver, but for now I think we should leave it as is in the core code requiring it until we have a reason to relax it.
Yes. In such cases, the iommu driver could always allow nested parent domain allocation, but fails to allocate a nested domain if the hardware is not capable of nesting translation.
Thanks, baolu
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, October 22, 2024 9:16 PM
On Tue, Oct 22, 2024 at 04:59:07PM +0800, Baolu Lu wrote:
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain that translates gPAs to hPAs could also have a vIOMMU object associated with it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Why? The nested parent flavour of the domain is basically free to create, what reason would be to not do that?
If the HW doesn't support it, then does the HW really need/support a VIOMMU?
Now it's agreed to build trusted I/O on top of this new vIOMMU object. format-wise probably it's free to assume that nested parent is supported on any new platform which will support trusted I/O. But I'm not sure all the conditions around allowing nested are same as for trusted I/O, e.g. for ARM nesting is allowed only for CANWBS/S2FWB. Are they always guaranteed in trusted I/O configuration?
Baolu did raise a good open to confirm given it will be used beyond nesting. 😊
On Fri, Oct 25, 2024 at 08:47:40AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, October 22, 2024 9:16 PM
On Tue, Oct 22, 2024 at 04:59:07PM +0800, Baolu Lu wrote:
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain that translates gPAs to hPAs could also have a vIOMMU object associated with it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Why? The nested parent flavour of the domain is basically free to create, what reason would be to not do that?
If the HW doesn't support it, then does the HW really need/support a VIOMMU?
Now it's agreed to build trusted I/O on top of this new vIOMMU object. format-wise probably it's free to assume that nested parent is supported on any new platform which will support trusted I/O. But I'm not sure all the conditions around allowing nested are same as for trusted I/O, e.g. for ARM nesting is allowed only for CANWBS/S2FWB. Are they always guaranteed in trusted I/O configuration?
ARM is a big ? what exactly will come, but I'm expecting that to be resolved either with continued HW support or Linux will add the cache flushing and relax the test.
Baolu did raise a good open to confirm given it will be used beyond nesting. 😊
Even CC is "nesting", it is just nested with a fixed Identity S1 in the baseline case. The S2 translation still exists and still has to be consistent with whatever the secure world is doing.
So, my feeling is that the S2 nested domain is mandatory for the viommu, especially for CC, it must exists. In the end there may be more options than just a nested parent.
For instance if the CC design relies on the secure world sharing the CPU and IOMMU page table we might need a new HWPT type to represent that configuration.
From a uapi perspective we seem OK here as the hwpt input could be anything. We might have to adjust some checks in the kernel someday.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, October 25, 2024 11:24 PM
On Fri, Oct 25, 2024 at 08:47:40AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, October 22, 2024 9:16 PM
On Tue, Oct 22, 2024 at 04:59:07PM +0800, Baolu Lu wrote:
Is it feasible to make vIOMMU object more generic, rather than strictly tying it to nested translation? For example, a normal paging domain
that
translates gPAs to hPAs could also have a vIOMMU object associated
with
it.
While we can only support vIOMMU object allocation uAPI for S2 paging domains in the context of this series, we could consider leaving the option open to associate a vIOMMU object with other normal paging domains that are not a nested parent?
Why? The nested parent flavour of the domain is basically free to create, what reason would be to not do that?
If the HW doesn't support it, then does the HW really need/support a VIOMMU?
Now it's agreed to build trusted I/O on top of this new vIOMMU object. format-wise probably it's free to assume that nested parent is supported on any new platform which will support trusted I/O. But I'm not sure all the conditions around allowing nested are same as for trusted I/O, e.g. for ARM nesting is allowed only for CANWBS/S2FWB. Are they always guaranteed in trusted I/O configuration?
ARM is a big ? what exactly will come, but I'm expecting that to be resolved either with continued HW support or Linux will add the cache flushing and relax the test.
Baolu did raise a good open to confirm given it will be used beyond nesting. 😊
Even CC is "nesting", it is just nested with a fixed Identity S1 in the baseline case. The S2 translation still exists and still has to be consistent with whatever the secure world is doing.
this is true. That is why I asked more from the conditions around enabling nested instead of the translation/format itself.
So, my feeling is that the S2 nested domain is mandatory for the viommu, especially for CC, it must exists. In the end there may be more options than just a nested parent.
For instance if the CC design relies on the secure world sharing the CPU and IOMMU page table we might need a new HWPT type to represent that configuration.
From a uapi perspective we seem OK here as the hwpt input could be anything. We might have to adjust some checks in the kernel someday.
yes, that could be extended in case of a need.
To support driver-allocated vIOMMU objects, it's suggested to call the allocator helper in IOMMU dirvers. However, there is no guarantee that drivers will all use it and allocate objects properly.
Add a helper for iommufd core to verify if an unfinalized object is at least reserved in the ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 3 +++ drivers/iommu/iommufd/main.c | 20 ++++++++++++++++++++ 2 files changed, 23 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 5bd41257f2ef..d53c1ca75532 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -152,6 +152,9 @@ static inline void iommufd_put_object(struct iommufd_ctx *ictx, wake_up_interruptible_all(&ictx->destroy_wait); }
+int iommufd_verify_unfinalized_object(struct iommufd_ctx *ictx, + struct iommufd_object *to_verify); + void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj); void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx, struct iommufd_object *obj); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 92bd075108e5..e244fed1b7ab 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -89,6 +89,26 @@ struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id, return obj; }
+int iommufd_verify_unfinalized_object(struct iommufd_ctx *ictx, + struct iommufd_object *to_verify) +{ + XA_STATE(xas, &ictx->objects, 0); + struct iommufd_object *obj; + int rc = 0; + + if (!to_verify || !to_verify->id) + return -EINVAL; + xas.xa_index = to_verify->id; + + xa_lock(&ictx->objects); + obj = xas_load(&xas); + /* Being an unfinalized object, the loaded obj is a reserved space */ + if (obj != XA_ZERO_ENTRY) + rc = -ENOENT; + xa_unlock(&ictx->objects); + return rc; +} + static int iommufd_object_dec_wait_shortterm(struct iommufd_ctx *ictx, struct iommufd_object *to_destroy) {
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
To support driver-allocated vIOMMU objects, it's suggested to call the allocator helper in IOMMU dirvers. However, there is no guarantee that drivers will all use it and allocate objects properly.
Add a helper for iommufd core to verify if an unfinalized object is at least reserved in the ictx.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount.
If an IOMMU driver supports a driver-managed vIOMMU object, it must define its own IOMMU_VIOMMU_TYPE_ in the uAPI header and implement a viommu_alloc op in its iommu_ops.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/Makefile | 3 +- drivers/iommu/iommufd/iommufd_private.h | 3 + include/uapi/linux/iommufd.h | 40 ++++++++++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 85 +++++++++++++++++++++++++ 5 files changed, 136 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu.c
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile index 435124a8e1f1..7c207c5f1eb6 100644 --- a/drivers/iommu/iommufd/Makefile +++ b/drivers/iommu/iommufd/Makefile @@ -7,7 +7,8 @@ iommufd-y := \ ioas.o \ main.o \ pages.o \ - vfio_compat.o + vfio_compat.o \ + viommu.o
iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index d53c1ca75532..9adf8d616796 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -504,6 +504,9 @@ static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev, return iommu_group_replace_domain(idev->igroup->group, hwpt->domain); }
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_viommu_destroy(struct iommufd_object *obj); + #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); void iommufd_selftest_destroy(struct iommufd_object *obj); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cd4920886ad0..d1c99285eda0 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -51,6 +51,7 @@ enum { IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP = 0x8c, IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d, IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e, + IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f, };
/** @@ -852,4 +853,43 @@ struct iommu_fault_alloc { __u32 out_fault_fd; }; #define IOMMU_FAULT_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_FAULT_QUEUE_ALLOC) + +/** + * enum iommu_viommu_type - Virtual IOMMU Type + * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use + */ +enum iommu_viommu_type { + IOMMU_VIOMMU_TYPE_DEFAULT = 0, +}; + +/** + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC) + * @size: sizeof(struct iommu_viommu_alloc) + * @flags: Must be 0 + * @type: Type of the virtual IOMMU. Must be defined in enum iommu_viommu_type + * @dev_id: The device's physical IOMMU will be used to back the virtual IOMMU + * @hwpt_id: ID of a nesting parent HWPT to associate to + * @out_viommu_id: Output virtual IOMMU ID for the allocated object + * + * Allocate a virtual IOMMU object that represents the underlying physical + * IOMMU's virtualization support. The vIOMMU object is a security-isolated + * slice of the physical IOMMU HW that is unique to a specific VM. Operations + * global to the IOMMU are connected to the vIOMMU, such as: + * - Security namespace for guest owned ID, e.g. guest-controlled cache tags + * - Access to a sharable nesting parent pagetable across physical IOMMUs + * - Virtualization of various platforms IDs, e.g. RIDs and others + * - Delivery of paravirtualized invalidation + * - Direct assigned invalidation queues + * - Direct assigned interrupts + * - Non-affiliated event reporting + */ +struct iommu_viommu_alloc { + __u32 size; + __u32 flags; + __u32 type; + __u32 dev_id; + __u32 hwpt_id; + __u32 out_viommu_id; +}; +#define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC) #endif diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index e244fed1b7ab..ab5ee325d809 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -321,6 +321,7 @@ union ucmd_buffer { struct iommu_ioas_unmap unmap; struct iommu_option option; struct iommu_vfio_ioas vfio_ioas; + struct iommu_viommu_alloc viommu; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif @@ -372,6 +373,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { val64), IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas, __reserved), + IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, + struct iommu_viommu_alloc, out_viommu_id), #ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endif @@ -507,6 +510,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_FAULT] = { .destroy = iommufd_fault_destroy, }, + [IOMMUFD_OBJ_VIOMMU] = { + .destroy = iommufd_viommu_destroy, + }, #ifdef CONFIG_IOMMUFD_TEST [IOMMUFD_OBJ_SELFTEST] = { .destroy = iommufd_selftest_destroy, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c new file mode 100644 index 000000000000..e612f3d539b7 --- /dev/null +++ b/drivers/iommu/iommufd/viommu.c @@ -0,0 +1,85 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES + */ + +#include "iommufd_private.h" + +void iommufd_viommu_destroy(struct iommufd_object *obj) +{ + struct iommufd_viommu *viommu = + container_of(obj, struct iommufd_viommu, obj); + + if (viommu->ops && viommu->ops->free) + viommu->ops->free(viommu); + refcount_dec(&viommu->hwpt->common.obj.users); +} + +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{ + struct iommu_viommu_alloc *cmd = ucmd->cmd; + struct iommufd_hwpt_paging *hwpt_paging; + struct iommufd_viommu *viommu; + struct iommufd_device *idev; + const struct iommu_ops *ops; + int rc; + + if (cmd->flags || cmd->type == IOMMU_VIOMMU_TYPE_DEFAULT) + return -EOPNOTSUPP; + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + + ops = dev_iommu_ops(idev->dev); + if (!ops->viommu_alloc) { + rc = -EOPNOTSUPP; + goto out_put_idev; + } + + hwpt_paging = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id); + if (IS_ERR(hwpt_paging)) { + rc = PTR_ERR(hwpt_paging); + goto out_put_idev; + } + + if (!hwpt_paging->nest_parent) { + rc = -EINVAL; + goto out_put_hwpt; + } + + viommu = ops->viommu_alloc(idev->dev, hwpt_paging->common.domain, + ucmd->ictx, cmd->type); + if (IS_ERR(viommu)) { + rc = PTR_ERR(viommu); + goto out_put_hwpt; + } + + rc = iommufd_verify_unfinalized_object(ucmd->ictx, &viommu->obj); + if (rc) { + kfree(viommu); + goto out_put_hwpt; + } + + viommu->type = cmd->type; + viommu->ictx = ucmd->ictx; + viommu->hwpt = hwpt_paging; + /* Assume physical IOMMUs are unpluggable (the most likely case) */ + viommu->iommu_dev = __iommu_get_iommu_dev(idev->dev); + + refcount_inc(&viommu->hwpt->common.obj.users); + + cmd->out_viommu_id = viommu->obj.id; + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + goto out_abort; + iommufd_object_finalize(ucmd->ictx, &viommu->obj); + goto out_put_hwpt; + +out_abort: + iommufd_object_abort_and_destroy(ucmd->ictx, &viommu->obj); +out_put_hwpt: + iommufd_put_object(ucmd->ictx, &hwpt_paging->common.obj); +out_put_idev: + iommufd_put_object(ucmd->ictx, &idev->obj); + return rc; +}
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount.
If an IOMMU driver supports a driver-managed vIOMMU object, it must define
why highlight 'driver-managed', implying a core-managed vIOMMU object some day?
+/**
- struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
- @size: sizeof(struct iommu_viommu_alloc)
- @flags: Must be 0
- @type: Type of the virtual IOMMU. Must be defined in enum
iommu_viommu_type
- @dev_id: The device's physical IOMMU will be used to back the virtual
IOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that represents the underlying physical
- IOMMU's virtualization support. The vIOMMU object is a security-isolated
- slice of the physical IOMMU HW that is unique to a specific VM.
the object itself is a software abstraction, while a 'slice' is a set of real hw resources.
On Fri, Oct 25, 2024 at 08:59:11AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount.
If an IOMMU driver supports a driver-managed vIOMMU object, it must define
why highlight 'driver-managed', implying a core-managed vIOMMU object some day?
Oh, core-managed vIOMMU is gone since this version. I should have updated the commit message here too.
+/**
- struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
- @size: sizeof(struct iommu_viommu_alloc)
- @flags: Must be 0
- @type: Type of the virtual IOMMU. Must be defined in enum
iommu_viommu_type
- @dev_id: The device's physical IOMMU will be used to back the virtual
IOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that represents the underlying physical
- IOMMU's virtualization support. The vIOMMU object is a security-isolated
- slice of the physical IOMMU HW that is unique to a specific VM.
the object itself is a software abstraction, while a 'slice' is a set of real hw resources.
Yea, let's do this: * Allocate a virtual IOMMU object, representing the underlying physical IOMMU's * virtualization support that is a security-isolated slice of the real IOMMU HW * that is unique to a specific VM.
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
- viommu->type = cmd->type;
- viommu->ictx = ucmd->ictx;
- viommu->hwpt = hwpt_paging;
- /* Assume physical IOMMUs are unpluggable (the most likely case)
*/
- viommu->iommu_dev = __iommu_get_iommu_dev(idev->dev);
so what would happen if this assumption breaks?
On Fri, Oct 25, 2024 at 09:05:58AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
viommu->type = cmd->type;
viommu->ictx = ucmd->ictx;
viommu->hwpt = hwpt_paging;
/* Assume physical IOMMUs are unpluggable (the most likely case)
*/
viommu->iommu_dev = __iommu_get_iommu_dev(idev->dev);
so what would happen if this assumption breaks?
I had a very verbose comments previously that Alexey suggested to optimize away.. Perhaps I should add back the part that mentions adding a refcount for pluggable ones..
Nicolin
Allow IOMMU driver to use a vIOMMU object that holds a nesting parent hwpt/domain to allocate a nested domain.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 55054fbc793c..5c13c35952d8 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -14,6 +14,7 @@ struct device; struct file; struct iommu_group; +struct iommu_user_data; struct iommufd_access; struct iommufd_ctx; struct iommufd_device; @@ -95,9 +96,17 @@ struct iommufd_viommu { * struct iommufd_viommu_ops - vIOMMU specific operations * @free: Free all driver-specific parts of an iommufd_viommu. The memory of the * vIOMMU will be free-ed by iommufd core after calling this free op. + * @domain_alloc_nested: Allocate a IOMMU_DOMAIN_NESTED on a vIOMMU that holds a + * nesting parent domain (IOMMU_DOMAIN_PAGING). @user_data + * must be defined in include/uapi/linux/iommufd.h. + * It must fully initialize the new iommu_domain before + * returning. Upon failure, ERR_PTR must be returned. */ struct iommufd_viommu_ops { void (*free)(struct iommufd_viommu *viommu); + struct iommu_domain *(*domain_alloc_nested)( + struct iommufd_viommu *viommu, + const struct iommu_user_data *user_data); };
#if IS_ENABLED(CONFIG_IOMMUFD)
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
Allow IOMMU driver to use a vIOMMU object that holds a nesting parent hwpt/domain to allocate a nested domain.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like that nesting parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, add an iommufd_hwpt_nested_alloc_for_viommu helper to allocate a nested HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's refcount already, increase the refcount of the vIOMMU only.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 1 + include/uapi/linux/iommufd.h | 14 ++--- drivers/iommu/iommufd/hw_pagetable.c | 69 ++++++++++++++++++++++++- 3 files changed, 77 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 9adf8d616796..8c9ab35eaea5 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -288,6 +288,7 @@ struct iommufd_hwpt_paging { struct iommufd_hwpt_nested { struct iommufd_hw_pagetable common; struct iommufd_hwpt_paging *parent; + struct iommufd_viommu *viommu; };
static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt) diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index d1c99285eda0..f835ccf4a494 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -430,7 +430,7 @@ enum iommu_hwpt_data_type { * @size: sizeof(struct iommu_hwpt_alloc) * @flags: Combination of enum iommufd_hwpt_alloc_flags * @dev_id: The device to allocate this HWPT for - * @pt_id: The IOAS or HWPT to connect this HWPT to + * @pt_id: The IOAS or HWPT or vIOMMU to connect this HWPT to * @out_hwpt_id: The ID of the new HWPT * @__reserved: Must be 0 * @data_type: One of enum iommu_hwpt_data_type @@ -449,11 +449,13 @@ enum iommu_hwpt_data_type { * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags. * - * A user-managed nested HWPT will be created from a given parent HWPT via - * @pt_id, in which the parent HWPT must be allocated previously via the - * same ioctl from a given IOAS (@pt_id). In this case, the @data_type - * must be set to a pre-defined type corresponding to an I/O page table - * type supported by the underlying IOMMU hardware. + * A user-managed nested HWPT will be created from a given vIOMMU (wrapping a + * parent HWPT) or a parent HWPT via @pt_id, in which the parent HWPT must be + * allocated previously via the same ioctl from a given IOAS (@pt_id). In this + * case, the @data_type must be set to a pre-defined type corresponding to an + * I/O page table type supported by the underlying IOMMU hardware. The device + * via @dev_id and the vIOMMU via @pt_id must be associated to the same IOMMU + * instance. * * If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and * @data_uptr should be zero. Otherwise, both @data_len and @data_uptr diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index d06bf6e6c19f..5314cd486ddb 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -57,7 +57,10 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_hwpt_nested, common.obj);
__iommufd_hwpt_destroy(&hwpt_nested->common); - refcount_dec(&hwpt_nested->parent->common.obj.users); + if (hwpt_nested->viommu) + refcount_dec(&hwpt_nested->viommu->obj.users); + else + refcount_dec(&hwpt_nested->parent->common.obj.users); }
void iommufd_hwpt_nested_abort(struct iommufd_object *obj) @@ -260,6 +263,54 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx, return ERR_PTR(rc); }
+/** + * iommufd_hwpt_nested_alloc_for_viommu() - Get a hwpt_nested for a vIOMMU + * @viommu: vIOMMU ojbect to associate the hwpt_nested/domain with + * @user_data: user_data pointer. Must be valid + * + * Allocate a new IOMMU_DOMAIN_NESTED for a vIOMMU and return it as a NESTED + * hw_pagetable. + */ +static struct iommufd_hwpt_nested * +iommufd_hwpt_nested_alloc_for_viommu(struct iommufd_viommu *viommu, + const struct iommu_user_data *user_data) +{ + struct iommufd_hwpt_nested *hwpt_nested; + struct iommufd_hw_pagetable *hwpt; + int rc; + + if (!viommu->ops || !viommu->ops->domain_alloc_nested) + return ERR_PTR(-EOPNOTSUPP); + + hwpt_nested = __iommufd_object_alloc( + viommu->ictx, hwpt_nested, IOMMUFD_OBJ_HWPT_NESTED, common.obj); + if (IS_ERR(hwpt_nested)) + return ERR_CAST(hwpt_nested); + hwpt = &hwpt_nested->common; + + hwpt_nested->viommu = viommu; + hwpt_nested->parent = viommu->hwpt; + refcount_inc(&viommu->obj.users); + + hwpt->domain = viommu->ops->domain_alloc_nested(viommu, user_data); + if (IS_ERR(hwpt->domain)) { + rc = PTR_ERR(hwpt->domain); + hwpt->domain = NULL; + goto out_abort; + } + hwpt->domain->owner = viommu->iommu_dev->ops; + + if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) { + rc = -EINVAL; + goto out_abort; + } + return hwpt_nested; + +out_abort: + iommufd_object_abort_and_destroy(viommu->ictx, &hwpt->obj); + return ERR_PTR(rc); +} + int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) { struct iommu_hwpt_alloc *cmd = ucmd->cmd; @@ -316,6 +367,22 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) goto out_unlock; } hwpt = &hwpt_nested->common; + } else if (pt_obj->type == IOMMUFD_OBJ_VIOMMU) { + struct iommufd_hwpt_nested *hwpt_nested; + struct iommufd_viommu *viommu; + + viommu = container_of(pt_obj, struct iommufd_viommu, obj); + if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) { + rc = -EINVAL; + goto out_unlock; + } + hwpt_nested = iommufd_hwpt_nested_alloc_for_viommu(viommu, + &user_data); + if (IS_ERR(hwpt_nested)) { + rc = PTR_ERR(hwpt_nested); + goto out_unlock; + } + hwpt = &hwpt_nested->common; } else { rc = -EINVAL; goto out_put_pt;
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
+static struct iommufd_hwpt_nested * +iommufd_hwpt_nested_alloc_for_viommu(struct iommufd_viommu *viommu,
const struct iommu_user_data *user_data)
probably "_for" can be skipped to reduce the name length
+{
- struct iommufd_hwpt_nested *hwpt_nested;
- struct iommufd_hw_pagetable *hwpt;
- int rc;
- if (!viommu->ops || !viommu->ops->domain_alloc_nested)
return ERR_PTR(-EOPNOTSUPP);
- hwpt_nested = __iommufd_object_alloc(
viommu->ictx, hwpt_nested, IOMMUFD_OBJ_HWPT_NESTED,
common.obj);
- if (IS_ERR(hwpt_nested))
return ERR_CAST(hwpt_nested);
- hwpt = &hwpt_nested->common;
- hwpt_nested->viommu = viommu;
- hwpt_nested->parent = viommu->hwpt;
- refcount_inc(&viommu->obj.users);
- hwpt->domain = viommu->ops->domain_alloc_nested(viommu,
user_data);
- if (IS_ERR(hwpt->domain)) {
rc = PTR_ERR(hwpt->domain);
hwpt->domain = NULL;
goto out_abort;
- }
- hwpt->domain->owner = viommu->iommu_dev->ops;
- if (WARN_ON_ONCE(hwpt->domain->type !=
IOMMU_DOMAIN_NESTED)) {
rc = -EINVAL;
goto out_abort;
- }
- return hwpt_nested;
+out_abort:
- iommufd_object_abort_and_destroy(viommu->ictx, &hwpt->obj);
- return ERR_PTR(rc);
+}
looks there missed a check on flags in this path.
On Fri, Oct 25, 2024 at 09:04:15AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
+static struct iommufd_hwpt_nested * +iommufd_hwpt_nested_alloc_for_viommu(struct iommufd_viommu *viommu,
const struct iommu_user_data *user_data)
probably "_for" can be skipped to reduce the name length
That would sound like a hwpt_nested allocating vIOMMU...
It'd be probably neutral to have iommufd_viommu_alloc_hwpt_nested, yet we have iommufd_hwpt_nested_alloc (HWPT-based) to align with..
looks there missed a check on flags in this path.
Oh yes, I missed that. Will pass in the cmd->flags.
Thanks Nicolin
For an iommu_dev that can unplug (so far only this selftest does so), the viommu->iommu_dev pointer has no guarantee of its life cycle after it is copied from the idev->dev->iommu->iommu_dev.
Track the user count of the iommu_dev. Postpone the exit routine using a completion, if refcount is unbalanced. The refcount inc/dec will be added in the following patch.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/selftest.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 540437be168a..dbd2a78c1074 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -507,14 +507,17 @@ static bool mock_domain_capable(struct device *dev, enum iommu_cap cap)
static struct iopf_queue *mock_iommu_iopf_queue;
-static struct iommu_device mock_iommu_device = { -}; +static struct mock_iommu_device { + struct iommu_device iommu_dev; + struct completion complete; + refcount_t users; +} mock_iommu;
static struct iommu_device *mock_probe_device(struct device *dev) { if (dev->bus != &iommufd_mock_bus_type.bus) return ERR_PTR(-ENODEV); - return &mock_iommu_device; + return &mock_iommu.iommu_dev; }
static void mock_domain_page_response(struct device *dev, struct iopf_fault *evt, @@ -1536,24 +1539,27 @@ int __init iommufd_test_init(void) if (rc) goto err_platform;
- rc = iommu_device_sysfs_add(&mock_iommu_device, + rc = iommu_device_sysfs_add(&mock_iommu.iommu_dev, &selftest_iommu_dev->dev, NULL, "%s", dev_name(&selftest_iommu_dev->dev)); if (rc) goto err_bus;
- rc = iommu_device_register_bus(&mock_iommu_device, &mock_ops, + rc = iommu_device_register_bus(&mock_iommu.iommu_dev, &mock_ops, &iommufd_mock_bus_type.bus, &iommufd_mock_bus_type.nb); if (rc) goto err_sysfs;
+ refcount_set(&mock_iommu.users, 1); + init_completion(&mock_iommu.complete); + mock_iommu_iopf_queue = iopf_queue_alloc("mock-iopfq");
return 0;
err_sysfs: - iommu_device_sysfs_remove(&mock_iommu_device); + iommu_device_sysfs_remove(&mock_iommu.iommu_dev); err_bus: bus_unregister(&iommufd_mock_bus_type.bus); err_platform: @@ -1563,6 +1569,15 @@ int __init iommufd_test_init(void) return rc; }
+static void iommufd_test_wait_for_users(void) +{ + if (refcount_dec_and_test(&mock_iommu.users)) + return; + /* Time out waiting for iommu device user count to become 0 */ + WARN_ON(!wait_for_completion_timeout(&mock_iommu.complete, + msecs_to_jiffies(10000))); +} + void iommufd_test_exit(void) { if (mock_iommu_iopf_queue) { @@ -1570,8 +1585,9 @@ void iommufd_test_exit(void) mock_iommu_iopf_queue = NULL; }
- iommu_device_sysfs_remove(&mock_iommu_device); - iommu_device_unregister_bus(&mock_iommu_device, + iommufd_test_wait_for_users(); + iommu_device_sysfs_remove(&mock_iommu.iommu_dev); + iommu_device_unregister_bus(&mock_iommu.iommu_dev, &iommufd_mock_bus_type.bus, &iommufd_mock_bus_type.nb); bus_unregister(&iommufd_mock_bus_type.bus);
Implement the viommu alloc/free functions to increase/reduce refcount of its dependent mock iommu device. User space can verify this loop via the IOMMU_VIOMMU_TYPE_SELFTEST.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 2 ++ drivers/iommu/iommufd/selftest.c | 41 ++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index f4bc23a92f9a..edced4ac7cd3 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -180,4 +180,6 @@ struct iommu_hwpt_invalidate_selftest { __u32 iotlb_id; };
+#define IOMMU_VIOMMU_TYPE_SELFTEST 0xdeadbeef + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index dbd2a78c1074..04dd95fe24ca 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -132,6 +132,10 @@ struct mock_iommu_domain_nested { u32 iotlb[MOCK_NESTED_DOMAIN_IOTLB_NUM]; };
+struct mock_viommu { + struct iommufd_viommu core; +}; + enum selftest_obj_type { TYPE_IDEV, }; @@ -543,6 +547,42 @@ static int mock_dev_disable_feat(struct device *dev, enum iommu_dev_features fea return 0; }
+static void mock_viommu_free(struct iommufd_viommu *viommu) +{ + struct mock_iommu_device *mock_iommu = container_of( + viommu->iommu_dev, struct mock_iommu_device, iommu_dev); + + if (refcount_dec_and_test(&mock_iommu->users)) + complete(&mock_iommu->complete); + + /* iommufd core frees mock_viommu and viommu */ +} + +static struct iommufd_viommu_ops mock_viommu_ops = { + .free = mock_viommu_free, +}; + +static struct iommufd_viommu *mock_viommu_alloc(struct device *dev, + struct iommu_domain *domain, + struct iommufd_ctx *ictx, + unsigned int viommu_type) +{ + struct mock_iommu_device *mock_iommu = + iommu_get_iommu_dev(dev, struct mock_iommu_device, iommu_dev); + struct mock_viommu *mock_viommu; + + if (viommu_type != IOMMU_VIOMMU_TYPE_SELFTEST) + return ERR_PTR(-EOPNOTSUPP); + + mock_viommu = + iommufd_viommu_alloc(ictx, mock_viommu, core, &mock_viommu_ops); + if (IS_ERR(mock_viommu)) + return ERR_CAST(mock_viommu); + + refcount_inc(&mock_iommu->users); + return &mock_viommu->core; +} + static const struct iommu_ops mock_ops = { /* * IOMMU_DOMAIN_BLOCKED cannot be returned from def_domain_type() @@ -562,6 +602,7 @@ static const struct iommu_ops mock_ops = { .dev_enable_feat = mock_dev_enable_feat, .dev_disable_feat = mock_dev_disable_feat, .user_pasid_table = true, + .viommu_alloc = mock_viommu_alloc, .default_domain_ops = &(struct iommu_domain_ops){ .free = mock_domain_free,
Add a new iommufd_viommu FIXTURE and setup it up with a vIOMMU object.
Any new vIOMMU feature will be added as a TEST_F under that.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd_utils.h | 28 +++++++ tools/testing/selftests/iommu/iommufd.c | 78 +++++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 11 +++ 3 files changed, 117 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 40f6f14ce136..ca09308dad6a 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -762,3 +762,31 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 fault_fd)
#define test_cmd_trigger_iopf(device_id, fault_fd) \ ASSERT_EQ(0, _test_cmd_trigger_iopf(self->fd, device_id, fault_fd)) + +static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, + __u32 type, __u32 flags, __u32 *viommu_id) +{ + struct iommu_viommu_alloc cmd = { + .size = sizeof(cmd), + .flags = flags, + .type = type, + .dev_id = device_id, + .hwpt_id = hwpt_id, + }; + int ret; + + ret = ioctl(fd, IOMMU_VIOMMU_ALLOC, &cmd); + if (ret) + return ret; + if (viommu_id) + *viommu_id = cmd.out_viommu_id; + return 0; +} + +#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id) \ + ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ + type, 0, viommu_id)) +#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ + type, 0, viommu_id)) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 4927b9add5ad..49ac144cf5a4 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -128,6 +128,7 @@ TEST_F(iommufd, cmd_length) TEST_LENGTH(iommu_ioas_unmap, IOMMU_IOAS_UNMAP, length); TEST_LENGTH(iommu_option, IOMMU_OPTION, val64); TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS, __reserved); + TEST_LENGTH(iommu_viommu_alloc, IOMMU_VIOMMU_ALLOC, out_viommu_id); #undef TEST_LENGTH }
@@ -2386,4 +2387,81 @@ TEST_F(vfio_compat_mock_domain, huge_map) } }
+FIXTURE(iommufd_viommu) +{ + int fd; + uint32_t ioas_id; + uint32_t stdev_id; + uint32_t hwpt_id; + uint32_t device_id; + uint32_t viommu_id; +}; + +FIXTURE_VARIANT(iommufd_viommu) +{ + unsigned int viommu; +}; + +FIXTURE_SETUP(iommufd_viommu) +{ + self->fd = open("/dev/iommu", O_RDWR); + ASSERT_NE(-1, self->fd); + test_ioctl_ioas_alloc(&self->ioas_id); + test_ioctl_set_default_memory_limit(); + + if (variant->viommu) { + test_cmd_mock_domain(self->ioas_id, &self->stdev_id, NULL, + &self->device_id); + + /* Negative test -- invalid hwpt */ + test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + + /* Negative test -- not a nesting parent hwpt */ + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, 0, + &self->hwpt_id); + test_err_viommu_alloc(EINVAL, self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + test_ioctl_destroy(self->hwpt_id); + + /* Allocate a nesting parent HWP */ + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, + IOMMU_HWPT_ALLOC_NEST_PARENT, + &self->hwpt_id); + /* Negative test -- unsupported viommu type */ + test_err_viommu_alloc(EOPNOTSUPP, self->device_id, + self->hwpt_id, 0xdead, NULL); + + test_cmd_viommu_alloc(self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, + &self->viommu_id); + } else { + test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, NULL); + } +} + +FIXTURE_TEARDOWN(iommufd_viommu) +{ + if (variant->viommu) { + test_ioctl_destroy(self->viommu_id); + test_ioctl_destroy(self->hwpt_id); + } + teardown_iommufd(self->fd, _metadata); +} + +FIXTURE_VARIANT_ADD(iommufd_viommu, no_viommu) +{ + .viommu = 0, +}; + +FIXTURE_VARIANT_ADD(iommufd_viommu, mock_viommu) +{ + .viommu = 1, +}; + +TEST_F(iommufd_viommu, viommu_auto_destroy) +{ +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index c5d5e69452b0..e9a980b7729b 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -582,6 +582,7 @@ TEST_FAIL_NTH(basic_fail_nth, device) uint32_t stdev_id; uint32_t idev_id; uint32_t hwpt_id; + uint32_t viommu_id; __u64 iova;
self->fd = open("/dev/iommu", O_RDWR); @@ -624,6 +625,16 @@ TEST_FAIL_NTH(basic_fail_nth, device)
if (_test_cmd_mock_domain_replace(self->fd, stdev_id, hwpt_id, NULL)) return -1; + + if (_test_cmd_hwpt_alloc(self->fd, idev_id, ioas_id, 0, + IOMMU_HWPT_ALLOC_NEST_PARENT, &hwpt_id, + IOMMU_HWPT_DATA_NONE, 0, 0)) + return -1; + + if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, 0, &viommu_id)) + return -1; + return 0; }
With the introduction of the new object and its infrastructure, update the doc to reflect that and add a new graph.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-)
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst index 2deba93bf159..92d16efad5b0 100644 --- a/Documentation/userspace-api/iommufd.rst +++ b/Documentation/userspace-api/iommufd.rst @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: space usually has mappings from guest-level I/O virtual addresses to guest- level physical addresses.
+ - IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, + passed to or shared with a VM. It may be some HW-accelerated virtualization + features and some SW resources used by the VM. For examples: + * Security namespace for guest owned ID, e.g. guest-controlled cache tags + * Access to a sharable nesting parent pagetable across physical IOMMUs + * Virtualization of various platforms IDs, e.g. RIDs and others + * Delivery of paravirtualized invalidation + * Direct assigned invalidation queues + * Direct assigned interrupts + * Non-affiliated event reporting + Such a vIOMMU object generally has the access to a nesting parent pagetable + to support some HW-accelerated virtualization features. So, a vIOMMU object + must be created given a nesting parent HWPT_PAGING object, and then it would + encapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be used + to allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING. + + .. note:: + + The name "vIOMMU" isn't necessarily identical to a virtualized IOMMU in a + VM. A VM can have one giant virtualized IOMMU running on a machine having + multiple physical IOMMUs, in which case the VMM will dispatch the requests + or configurations from this single virtualized IOMMU instance to multiple + vIOMMU objects created for individual slices of different physical IOMMUs. + In other words, a vIOMMU object is always a representation of one physical + IOMMU, not necessarily of a virtualized IOMMU. For VMMs that want the full + virtualization features from physical IOMMUs, it is suggested to build the + same number of virtualized IOMMUs as the number of physical IOMMUs, so the + passed-through devices would be connected to their own virtualized IOMMUs + backed by corresponding vIOMMU objects, in which case a guest OS would do + the "dispatch" naturally instead of VMM trappings. + All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
The diagrams below show relationships between user-visible objects and kernel @@ -101,6 +132,28 @@ creating the objects and links:: |------------>|iommu_domain|<----|iommu_domain|<----|device| |____________| |____________| |______|
+ _______________________________________________________________________ + | iommufd (with vIOMMU) | + | | + | [5] | + | _____________ | + | | | | + | |----------------| vIOMMU | | + | | | | | + | | | | | + | | [1] | | [4] [2] | + | | ______ | | _____________ ________ | + | | | | | [3] | | | | | | + | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | + | | |______| |_____________| |_____________| |________| | + | | | | | | | + |______|________|______________|__________________|_______________|_____| + | | | | | + ______v_____ | ______v_____ ______v_____ ___v__ + | struct | | PFN | (paging) | | (nested) | |struct| + |iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device| + |____________| storage|____________| |____________| |______| + 1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can hold multiple IOAS objects. IOAS is the most generic object and does not expose interfaces that are specific to single IOMMU drivers. All operations @@ -132,7 +185,8 @@ creating the objects and links:: flag is set.
4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOC - uAPI, provided an hwpt_id via @pt_id to associate the new HWPT_NESTED object + uAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating a + nesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED object to the corresponding HWPT_PAGING object. The associating HWPT_PAGING object must be a nesting parent manually allocated via the same uAPI previously with an IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. The @@ -149,6 +203,18 @@ creating the objects and links:: created via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the type of the object passed in via the @pt_id field of struct iommufd_hwpt_alloc.
+5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The + iommufd core will link the vIOMMU object to the struct iommu_device that the + struct device is behind. And an IOMMU driver can implement a viommu_alloc op + to allocate its own vIOMMU data structure embedding the core-level structure + iommufd_viommu and some driver-specific data. If necessary, the driver can + also configure its HW virtualization feature for that vIOMMU (and thus for + the VM). Successful completion of this operation sets up the linkages between + the vIOMMU object and the HWPT_PAGING, then this vIOMMU object can be used + as a nesting parent object to allocate an HWPT_NESTED object described above. + A device can only bind to an iommufd due to DMA ownership claim and attach to at most one IOAS object (no support of PASID yet).
@@ -161,6 +227,7 @@ User visible objects are backed by following datastructures: - iommufd_device for IOMMUFD_OBJ_DEVICE. - iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING. - iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED. +- iommufd_viommu for IOMMUFD_OBJ_VIOMMU.
Several terminologies when looking at these datastructures:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:20 AM
With the introduction of the new object and its infrastructure, update the doc to reflect that and add a new graph.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add a new driver-type for ARM SMMUv3 to enum iommu_viommu_type. Implement an arm_vsmmu_alloc() with its viommu op arm_vsmmu_domain_alloc_nested(), to replace arm_smmu_domain_alloc_nesting(). As an initial step, copy the VMID from s2_parent. A later cleanup series is required to move the VMID allocation out of the stage-2 domain allocation routine to this.
After that, replace nested_domain->s2_parent with nested_domain->vsmmu.
Note that the validatting conditions for a nested_domain allocation are moved from arm_vsmmu_domain_alloc_nested to arm_vsmmu_alloc, since there is no point in creating a vIOMMU (vsmmu) from the beginning if it would not support a nested_domain.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++--- include/uapi/linux/iommufd.h | 2 + .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++++------- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-- 4 files changed, 70 insertions(+), 46 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 956c12637866..5a025d310dbe 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -10,6 +10,7 @@
#include <linux/bitfield.h> #include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/kernel.h> #include <linux/mmzone.h> #include <linux/sizes.h> @@ -835,7 +836,7 @@ struct arm_smmu_domain {
struct arm_smmu_nested_domain { struct iommu_domain domain; - struct arm_smmu_domain *s2_parent; + struct arm_vsmmu *vsmmu;
__le64 ste[2]; }; @@ -1005,21 +1006,22 @@ tegra241_cmdqv_probe(struct arm_smmu_device *smmu) } #endif /* CONFIG_TEGRA241_CMDQV */
+struct arm_vsmmu { + struct iommufd_viommu core; + struct arm_smmu_device *smmu; + struct arm_smmu_domain *s2_parent; + u16 vmid; +}; + #if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD) void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); -struct iommu_domain * -arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, - struct iommu_domain *parent, - const struct iommu_user_data *user_data); +struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, + struct iommu_domain *parent, + struct iommufd_ctx *ictx, + unsigned int viommu_type); #else #define arm_smmu_hw_info NULL -static inline struct iommu_domain * -arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, - struct iommu_domain *parent, - const struct iommu_user_data *user_data) -{ - return ERR_PTR(-EOPNOTSUPP); -} +#define arm_vsmmu_alloc NULL #endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */
#endif /* _ARM_SMMU_V3_H */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f835ccf4a494..09c1b4ba46d8 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -859,9 +859,11 @@ struct iommu_fault_alloc { /** * enum iommu_viommu_type - Virtual IOMMU Type * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use + * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type */ enum iommu_viommu_type { IOMMU_VIOMMU_TYPE_DEFAULT = 0, + IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1, };
/** diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c index 44e1b9bef850..70ad857a57b8 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -34,7 +34,8 @@ static void arm_smmu_make_nested_cd_table_ste( struct arm_smmu_ste *target, struct arm_smmu_master *master, struct arm_smmu_nested_domain *nested_domain, bool ats_enabled) { - arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent, + arm_smmu_make_s2_domain_ste(target, master, + nested_domain->vsmmu->s2_parent, ats_enabled);
target->data[0] = cpu_to_le64(STRTAB_STE_0_V | @@ -75,7 +76,8 @@ static void arm_smmu_make_nested_domain_ste( break; case STRTAB_STE_0_CFG_BYPASS: arm_smmu_make_s2_domain_ste( - target, master, nested_domain->s2_parent, ats_enabled); + target, master, nested_domain->vsmmu->s2_parent, + ats_enabled); break; case STRTAB_STE_0_CFG_ABORT: default: @@ -100,7 +102,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain, struct arm_smmu_ste ste; int ret;
- if (nested_domain->s2_parent->smmu != master->smmu) + if (nested_domain->vsmmu->smmu != master->smmu) return -EINVAL; if (arm_smmu_ssids_in_use(&master->cd_table)) return -EBUSY; @@ -151,36 +153,15 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg) return 0; }
-struct iommu_domain * -arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, - struct iommu_domain *parent, +static struct iommu_domain * +arm_vsmmu_domain_alloc_nested(struct iommufd_viommu *viommu, const struct iommu_user_data *user_data) { - struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core); struct arm_smmu_nested_domain *nested_domain; - struct arm_smmu_domain *smmu_parent; struct iommu_hwpt_arm_smmuv3 arg; int ret;
- if (flags || !(master->smmu->features & ARM_SMMU_FEAT_NESTING)) - return ERR_PTR(-EOPNOTSUPP); - - /* - * Must support some way to prevent the VM from bypassing the cache - * because VFIO currently does not do any cache maintenance. - */ - if (!arm_smmu_master_canwbs(master) && - !(master->smmu->features & ARM_SMMU_FEAT_S2FWB)) - return ERR_PTR(-EOPNOTSUPP); - - /* - * The core code checks that parent was created with - * IOMMU_HWPT_ALLOC_NEST_PARENT - */ - smmu_parent = to_smmu_domain(parent); - if (smmu_parent->smmu != master->smmu) - return ERR_PTR(-EINVAL); - ret = iommu_copy_struct_from_user(&arg, user_data, IOMMU_HWPT_DATA_ARM_SMMUV3, ste); if (ret) @@ -196,9 +177,51 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
nested_domain->domain.type = IOMMU_DOMAIN_NESTED; nested_domain->domain.ops = &arm_smmu_nested_ops; - nested_domain->s2_parent = smmu_parent; + nested_domain->vsmmu = vsmmu; nested_domain->ste[0] = arg.ste[0]; nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
return &nested_domain->domain; } + + +static const struct iommufd_viommu_ops arm_vsmmu_ops = { + .domain_alloc_nested = arm_vsmmu_domain_alloc_nested, +}; + +struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev, + struct iommu_domain *parent, + struct iommufd_ctx *ictx, + unsigned int viommu_type) +{ + struct arm_smmu_device *smmu = + iommu_get_iommu_dev(dev, struct arm_smmu_device, iommu); + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct arm_smmu_domain *s2_parent = to_smmu_domain(parent); + struct arm_vsmmu *vsmmu; + + if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3) + return ERR_PTR(-EOPNOTSUPP); + + if (!(smmu->features & ARM_SMMU_FEAT_NESTING)) + return ERR_PTR(-EOPNOTSUPP); + + /* + * Must support some way to prevent the VM from bypassing the cache + * because VFIO currently does not do any cache maintenance. + */ + if (!arm_smmu_master_canwbs(master) && + !(smmu->features & ARM_SMMU_FEAT_S2FWB)) + return ERR_PTR(-EOPNOTSUPP); + + vsmmu = iommufd_viommu_alloc(ictx, arm_vsmmu, core, &arm_vsmmu_ops); + if (IS_ERR(vsmmu)) + return ERR_CAST(vsmmu); + + vsmmu->smmu = smmu; + vsmmu->s2_parent = s2_parent; + /* FIXME Move VMID allocation from the S2 domain allocation to here */ + vsmmu->vmid = s2_parent->s2_cfg.vmid; + + return &vsmmu->core; +} diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index c7cc98961019..de598d66b5c2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2661,7 +2661,7 @@ to_smmu_domain_devices(struct iommu_domain *domain) domain->type == IOMMU_DOMAIN_SVA) return to_smmu_domain(domain); if (domain->type == IOMMU_DOMAIN_NESTED) - return to_smmu_nested_domain(domain)->s2_parent; + return to_smmu_nested_domain(domain)->vsmmu->s2_parent; return NULL; }
@@ -3126,13 +3126,9 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags, struct arm_smmu_domain *smmu_domain; int ret;
- if (parent) - return arm_smmu_domain_alloc_nesting(dev, flags, parent, - user_data); - if (flags & ~PAGING_FLAGS) return ERR_PTR(-EOPNOTSUPP); - if (user_data) + if (parent || user_data) return ERR_PTR(-EOPNOTSUPP);
smmu_domain = arm_smmu_domain_alloc(); @@ -3541,6 +3537,7 @@ static struct iommu_ops arm_smmu_ops = { .dev_disable_feat = arm_smmu_dev_disable_feature, .page_response = arm_smmu_page_response, .def_domain_type = arm_smmu_def_domain_type, + .viommu_alloc = arm_vsmmu_alloc, .user_pasid_table = 1, .pgsize_bitmap = -1UL, /* Restricted during device attach */ .owner = THIS_MODULE,
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:20 AM
Add a new driver-type for ARM SMMUv3 to enum iommu_viommu_type. Implement an arm_vsmmu_alloc() with its viommu op arm_vsmmu_domain_alloc_nested(), to replace arm_smmu_domain_alloc_nesting(). As an initial step, copy the VMID from s2_parent. A later cleanup series is required to move the VMID allocation out of the stage-2 domain allocation routine to this.
After that, replace nested_domain->s2_parent with nested_domain->vsmmu.
Note that the validatting conditions for a nested_domain allocation are moved from arm_vsmmu_domain_alloc_nested to arm_vsmmu_alloc, since there is no point in creating a vIOMMU (vsmmu) from the beginning if it would not support a nested_domain.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
hmm I wonder whether this series should be merged with Jason's nesting series together and directly use vIOMMU to create nesting. Otherwise it looks a bit weird for one series to first enable a uAPI which is immediately replaced by another uAPI from the following series. Even if both are merged in one cycle, logically it doesn't sound clean when looking at the git history.
On Fri, Oct 25, 2024 at 09:18:05AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:20 AM
Add a new driver-type for ARM SMMUv3 to enum iommu_viommu_type. Implement an arm_vsmmu_alloc() with its viommu op arm_vsmmu_domain_alloc_nested(), to replace arm_smmu_domain_alloc_nesting(). As an initial step, copy the VMID from s2_parent. A later cleanup series is required to move the VMID allocation out of the stage-2 domain allocation routine to this.
After that, replace nested_domain->s2_parent with nested_domain->vsmmu.
Note that the validatting conditions for a nested_domain allocation are moved from arm_vsmmu_domain_alloc_nested to arm_vsmmu_alloc, since there is no point in creating a vIOMMU (vsmmu) from the beginning if it would not support a nested_domain.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
hmm I wonder whether this series should be merged with Jason's nesting series together and directly use vIOMMU to create nesting. Otherwise it looks a bit weird for one series to first enable a uAPI which is immediately replaced by another uAPI from the following series.
It has changed from my original expectation, that's for sure. I've wondered the same thing.
For now I've been keeping them separate and was going to review when this is all settled down.
It is troublesome because of all the branches, but if we don't have a conflict we could take the whole lot through iommufd.
Jason
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi- IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a parent HWPT typically hold one stage-2 IO pagetable and tag it with only one ID in the cache entries. When sharing one large stage-2 IO pagetable across physical IOMMU instances, that one ID may not always be available across all the IOMMU instances. In other word, it's ideal for SW to have a different container for the stage-2 IO pagetable so it can hold another ID that's available.
Just holding multiple IDs doesn't require a different container. This is just a side effect when vIOMMU will be required for other said reasons.
If we have to put more words here I'd prefer to adding a bit more for CMDQV which is more compelling. not a big deal though. 😊
For this "different container", add vIOMMU, an additional layer to hold extra virtualization information:
| iommufd (with vIOMMU) | | | | [5] | | _____________ | | | | | | |----------------| vIOMMU | | | | | | | | | | | | | | [1] | | [4] [2] | | | ______ | | _____________ ________ | | | | | | [3] | | | | | | | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | | | |______| |_____________| |_____________| |________| | | | | | | | |
|______|________|______________|__________________|_____________ __|_____| | | | | | ______v_____ | ______v_____ ______v_____ ___v__ | struct | | PFN | (paging) | | (nested) | |struct| |iommu_device| |------>|iommu_domain|<----|iommu_domain|<---- |device| |____________| storage|____________| |____________| |______|
nit - [1] ... [5] can be removed.
The vIOMMU object should be seen as a slice of a physical IOMMU instance that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
sorry no idea about 'non-affiliated event'. Can you elaborate?
On a multi-IOMMU system, the vIOMMU object must be instanced to the number of the physical IOMMUs that are passed to (via devices) a guest VM, while
'to the number of the physical IOMMUs that have a slice passed to ..."
being able to hold the shareable parent HWPT. Each vIOMMU then just needs to allocate its own individual ID to tag its own cache: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
---------------- | | paging_hwpt0 | | hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we repurposed the vIOMMU object from an earlier RFC, just for a referece: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4 (paring QEMU branch for testing will be provided with the part2 series)
Changelog v4
- Added "Reviewed-by" from Jason
- Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
- Dropped iommufd_object_alloc_elm renamings
- Renamed iommufd's viommu_api.c to driver.c
- Reworked iommufd_viommu_alloc helper
- Added a separate iommufd_hwpt_nested_alloc_for_viommu function for hwpt_nested allocations on a vIOMMU, and added comparison between viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
- Replaced s2_parent with vsmmu in arm_smmu_nested_domain
- Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in viommu_ops
- Replaced wait_queue_head_t with a completion, to delay the unplug of mock_iommu_dev
- Corrected documentation graph that was missing struct iommu_device
- Added an iommufd_verify_unfinalized_object helper to verify driver- allocated vIOMMU/vDEVICE objects
- Added missing test cases for TEST_LENGTH and fail_nth
v3 https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
- Rebased on top of Jason's nesting v3 series https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-
smmuv3_nesting_jgg@nvidia.com/
- Split the series into smaller parts
- Added Jason's Reviewed-by
- Added back viommu->iommu_dev
- Added support for driver-allocated vIOMMU v.s. core-allocated
- Dropped arm_smmu_cache_invalidate_user
- Added an iommufd_test_wait_for_users() in selftest
- Reworked test code to make viommu an individual FIXTURE
- Added missing TEST_LENGTH case for the new ioctl command
v2 https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
- Limited vdev_id to one per idev
- Added a rw_sem to protect the vdev_id list
- Reworked driver-level APIs with proper lockings
- Added a new viommu_api file for IOMMUFD_DRIVER config
- Dropped useless iommu_dev point from the viommu structure
- Added missing index numnbers to new types in the uAPI header
- Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT
one
- Reworked mock_viommu_cache_invalidate() using the new iommu helper
- Reordered details of set/unset_vdev_id handlers for proper lockings
v1 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks! Nicolin
Nicolin Chen (11): iommufd: Move struct iommufd_object to public iommufd header iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct iommufd: Add iommufd_verify_unfinalized_object iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl iommufd: Add domain_alloc_nested op to iommufd_viommu_ops iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC iommufd/selftest: Add refcount to mock_iommu_device iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage Documentation: userspace-api: iommufd: Update vIOMMU iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++--- drivers/iommu/iommufd/iommufd_private.h | 36 ++------ drivers/iommu/iommufd/iommufd_test.h | 2 + include/linux/iommu.h | 14 +++ include/linux/iommufd.h | 89 +++++++++++++++++++ include/uapi/linux/iommufd.h | 56 ++++++++++-- tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++ .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +- drivers/iommu/iommufd/driver.c | 38 ++++++++ drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++- drivers/iommu/iommufd/main.c | 58 ++++++------ drivers/iommu/iommufd/selftest.c | 73 +++++++++++++-- drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 11 +++ Documentation/userspace-api/iommufd.rst | 69 +++++++++++++- 18 files changed, 701 insertions(+), 124 deletions(-) create mode 100644 drivers/iommu/iommufd/driver.c create mode 100644 drivers/iommu/iommufd/viommu.c
-- 2.43.0
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
The vIOMMU object should be seen as a slice of a physical IOMMU instance that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
sorry no idea about 'non-affiliated event'. Can you elaborate?
This would be an even that is not a connected to a device
For instance a CMDQ experienced a problem.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, October 25, 2024 11:43 PM
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
The vIOMMU object should be seen as a slice of a physical IOMMU
instance
that is passed to or shared with a VM. That can be some HW/SW
resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache
tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
sorry no idea about 'non-affiliated event'. Can you elaborate?
This would be an even that is not a connected to a device
For instance a CMDQ experienced a problem.
Okay, then 'non-device-affiliated' is probably clearer.
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, October 22, 2024 8:19 AM
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi- IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a parent HWPT typically hold one stage-2 IO pagetable and tag it with only one ID in the cache entries. When sharing one large stage-2 IO pagetable across physical IOMMU instances, that one ID may not always be available across all the IOMMU instances. In other word, it's ideal for SW to have a different container for the stage-2 IO pagetable so it can hold another ID that's available.
Just holding multiple IDs doesn't require a different container. This is just a side effect when vIOMMU will be required for other said reasons.
If we have to put more words here I'd prefer to adding a bit more for CMDQV which is more compelling. not a big deal though. 😊
Ack.
For this "different container", add vIOMMU, an additional layer to hold extra virtualization information:
| iommufd (with vIOMMU) | | | | [5] | | _____________ | | | | | | |----------------| vIOMMU | | | | | | | | | | | | | | [1] | | [4] [2] | | | ______ | | _____________ ________ | | | | | | [3] | | | | | | | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | | | |______| |_____________| |_____________| |________| | | | | | | | |
|______|________|______________|__________________|_____________ __|_____| | | | | | ______v_____ | ______v_____ ______v_____ ___v__ | struct | | PFN | (paging) | | (nested) | |struct| |iommu_device| |------>|iommu_domain|<----|iommu_domain|<---- |device| |____________| storage|____________| |____________| |______|
nit - [1] ... [5] can be removed.
They are copied from the Documentation where numbers are needed. I will take all the numbers out in the cover-letters.
The vIOMMU object should be seen as a slice of a physical IOMMU instance that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
sorry no idea about 'non-affiliated event'. Can you elaborate?
I'll put an "e.g.".
On a multi-IOMMU system, the vIOMMU object must be instanced to the number of the physical IOMMUs that are passed to (via devices) a guest VM, while
'to the number of the physical IOMMUs that have a slice passed to ..."
Ack.
Thanks Nicolin
linux-kselftest-mirror@lists.linaro.org