[ Background ] On ARM GIC systems and others, the target address of the MSI is translated by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the IOMMU is disabled, the MSI address is programmed to the physical location of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI address is programmed to an allocated IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage translation is enabled, IOVA will be still used to program the MSI address, though the mappings will be in two stages: IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the IOVA is dynamically allocated from the top of the IOVA space. If attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge of the IOMMU translation (1-stage translation), since the IOVA for the ITS page is fixed and known by kernel. However, with virtual machine enabling a nested IOMMU translation (2-stage), a guest kernel directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem: 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel would fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT region per iommu group for a direct mapping. Eventually, the mappings would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA. 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC driver, to program the correct MSI IOVA. Forward the VMM-defined vITS page location (IPA) to the kernel for the stage-2 mapping. Eventually: IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000) This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
Worth mentioning that when Eric Auger was working on the same topic with the VFIO iommu uAPI, he had the approach (2) first, and then switched to the approach (1), suggested by Jean-Philippe for reduction of complexity.
The approach (1) basically feels like the existing VFIO passthrough that has a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has- iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same idea of "VMM leaving everything to the kernel".
The approach (2) is an ideal solution, yet it requires additional effort for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS page(s), which demands VMM to closely cooperate. * It also brings some complicated use cases to the table where the host or/and guest system(s) has/have multiple ITS pages.
[ Execution ] The iommu core rework (part-1) for iommufd_sw_msi is merged. So, now the IOMMU_RESV_SW_MSI can be used as an ABI. VMM can take this hard coded MSI window and create a direct stage-1 mapping using RMR in the guest's IORT. However, a proper uAPI must be defined for kernel and VMM to agree on wrt this virtual MSI window.
Moreover, some use cases might want to map the IOVAs in IOMMU_RESV_SW_MSI for something else. This requires kernel to provide an interface to shift the software MSI window to a different region: https://lore.kernel.org/all/20250909154600.910110-1-shyamsaini@linux.microso...
This series, as a follow-up series, introduces a pair of iommufd options for user space to configure the software MSI window.
[ Future Plan ] Part-3 and beyond will continue the effort of supporting the approach (2) for a complete vITS-to-pITS mapping: 1) Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI) 2) Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE) Note that the set_option uAPI in this series might not fit since this requires it is an array of MSI IOVAs.)
This series is on github: https://github.com/nicolinc/iommufd/commits/iommufd_msi_p2-v2 Pairing QEMU branch for testing (approach 1): https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr
Changelog v2 * Rebase on v6.18-rc1 * Update commit logs and kdocs * Add a patch fixing iommufd_device_is_attached() * Add sanity check for overflow and cover it in the selftest v1 (containing part-1 that is now merged) https://lore.kernel.org/all/cover.1739005085.git.nicolinc@nvidia.com/
Thanks! Nicolin
Nicolin Chen (7): iommufd/device: Move sw_msi_start from igroup to idev iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions iommufd/device: Make iommufd_device_is_attached non-static iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH iommufd/selftest: Add a testing reserved region iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
drivers/iommu/iommufd/iommufd_private.h | 7 +- drivers/iommu/iommufd/iommufd_test.h | 4 + include/uapi/linux/iommufd.h | 21 +++- drivers/iommu/iommufd/device.c | 43 +++---- drivers/iommu/iommufd/driver.c | 4 +- drivers/iommu/iommufd/io_pagetable.c | 18 ++- drivers/iommu/iommufd/ioas.c | 113 ++++++++++++++++++ drivers/iommu/iommufd/main.c | 4 + drivers/iommu/iommufd/selftest.c | 35 +++++- tools/testing/selftests/iommu/iommufd.c | 105 ++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 21 ++++ 11 files changed, 339 insertions(+), 36 deletions(-)
At the IOMMU driver layer the IOMMU_RESV_SW_MSI region is per device. So, storing the sw_msi_start per idev makes sense.
And looking at the iommufd_sw_msi design: - The global ictx->sw_msi_list allocates an item for each different pair of sw_msi_start and msi_addr. And the allocation is per msi_desc, i.e. per idev. - Each allocated list item will be added to the igroup->required_sw_msi and the hwpt_paging->present_sw_msi.bitmap during a device attachment.
This makes it possible to move the sw_msi_start from struct iommufd_group struct iommufd_device, giving a potential to support a new SET_OPTION uAPI per idev for user space to configure the start_sw_msi for 2-stage mappings of the MSI window.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 2 +- drivers/iommu/iommufd/device.c | 31 +++++++++++++------------ drivers/iommu/iommufd/driver.c | 4 ++-- 3 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 627f9b78483a0..73e5cddad24e9 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -472,7 +472,6 @@ struct iommufd_group { struct iommu_group *group; struct xarray pasid_attach; struct iommufd_sw_msi_maps required_sw_msi; - phys_addr_t sw_msi_start; };
/* @@ -490,6 +489,7 @@ struct iommufd_device { bool enforce_cache_coherency; struct iommufd_vdevice *vdev; bool destroying; + phys_addr_t sw_msi_start; };
static inline struct iommufd_device * diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 4c842368289f0..ea7ed32bbaede 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -96,7 +96,6 @@ static struct iommufd_group *iommufd_get_group(struct iommufd_ctx *ictx, kref_init(&new_igroup->ref); mutex_init(&new_igroup->lock); xa_init(&new_igroup->pasid_attach); - new_igroup->sw_msi_start = PHYS_ADDR_MAX; /* group reference moves into new_igroup */ new_igroup->group = group;
@@ -272,6 +271,7 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, refcount_inc(&idev->obj.users); /* igroup refcount moves into iommufd_device */ idev->igroup = igroup; + idev->sw_msi_start = PHYS_ADDR_MAX;
/* * If the caller fails after this success it must call @@ -367,13 +367,13 @@ static unsigned int iommufd_group_device_num(struct iommufd_group *igroup, }
#ifdef CONFIG_IRQ_MSI_IOMMU -static int iommufd_group_setup_msi(struct iommufd_group *igroup, - struct iommufd_hwpt_paging *hwpt_paging) +static int iommufd_device_setup_msi(struct iommufd_device *idev, + struct iommufd_hwpt_paging *hwpt_paging) { - struct iommufd_ctx *ictx = igroup->ictx; + struct iommufd_ctx *ictx = idev->ictx; struct iommufd_sw_msi_map *cur;
- if (igroup->sw_msi_start == PHYS_ADDR_MAX) + if (idev->sw_msi_start == PHYS_ADDR_MAX) return 0;
/* @@ -383,8 +383,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup, list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) { int rc;
- if (cur->sw_msi_start != igroup->sw_msi_start || - !test_bit(cur->id, igroup->required_sw_msi.bitmap)) + if (cur->sw_msi_start != idev->sw_msi_start || + !test_bit(cur->id, idev->igroup->required_sw_msi.bitmap)) continue;
rc = iommufd_sw_msi_install(ictx, hwpt_paging, cur); @@ -395,8 +395,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup, } #else static inline int -iommufd_group_setup_msi(struct iommufd_group *igroup, - struct iommufd_hwpt_paging *hwpt_paging) +iommufd_device_setup_msi(struct iommufd_device *idev, + struct iommufd_hwpt_paging *hwpt_paging) { return 0; } @@ -420,12 +420,12 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt, idev->dev, - &igroup->sw_msi_start); + &idev->sw_msi_start); if (rc) return rc;
if (iommufd_group_first_attach(igroup, IOMMU_NO_PASID)) { - rc = iommufd_group_setup_msi(igroup, hwpt_paging); + rc = iommufd_device_setup_msi(idev, hwpt_paging); if (rc) { iopt_remove_reserved_iova(&hwpt_paging->ioas->iopt, idev->dev); @@ -745,9 +745,10 @@ iommufd_group_remove_reserved_iova(struct iommufd_group *igroup, }
static int -iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup, - struct iommufd_hwpt_paging *hwpt_paging) +iommufd_device_do_replace_reserved_iova(struct iommufd_device *idev, + struct iommufd_hwpt_paging *hwpt_paging) { + struct iommufd_group *igroup = idev->igroup; struct iommufd_hwpt_paging *old_hwpt_paging; struct iommufd_attach *attach; struct iommufd_device *cur; @@ -767,7 +768,7 @@ iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup, } }
- rc = iommufd_group_setup_msi(igroup, hwpt_paging); + rc = iommufd_device_setup_msi(idev, hwpt_paging); if (rc) goto err_unresv; return 0; @@ -813,7 +814,7 @@ iommufd_device_do_replace(struct iommufd_device *idev, ioasid_t pasid, }
if (attach_resv) { - rc = iommufd_group_do_replace_reserved_iova(igroup, hwpt_paging); + rc = iommufd_device_do_replace_reserved_iova(idev, hwpt_paging); if (rc) goto err_unlock; } diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 6f1010da221c9..35475937d069b 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -271,7 +271,7 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
handle = to_iommufd_handle(raw_handle); /* No IOMMU_RESV_SW_MSI means no change to the msi_msg */ - if (handle->idev->igroup->sw_msi_start == PHYS_ADDR_MAX) + if (handle->idev->sw_msi_start == PHYS_ADDR_MAX) return 0;
ictx = handle->idev->ictx; @@ -283,7 +283,7 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc, */ msi_map = iommufd_sw_msi_get_map(handle->idev->ictx, msi_addr & PAGE_MASK, - handle->idev->igroup->sw_msi_start); + handle->idev->sw_msi_start); if (IS_ERR(msi_map)) return PTR_ERR(msi_map);
The per-device sw_msi window, defined by sw_msi_start and a size, is from the IOMMU driver where a static IOMMU_RESV_SW_MSI region is defined.
But soon user space will be allowed to configure the sw_msi window, via a new SET_OPTION uAPI.
On the other hand, the iopt_table_enforce_dev_resv_regions() will need to access the sw_msi_start and sw_msi_size stored in the idev struct.
So, pass in idev pointer instead to prepare for the new uAPI.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 2 +- drivers/iommu/iommufd/device.c | 5 ++--- drivers/iommu/iommufd/io_pagetable.c | 3 ++- 3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 73e5cddad24e9..cc758610b9f7c 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -132,7 +132,7 @@ int iopt_table_add_domain(struct io_pagetable *iopt, void iopt_table_remove_domain(struct io_pagetable *iopt, struct iommu_domain *domain); int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, - struct device *dev, + struct iommufd_device *idev, phys_addr_t *sw_msi_start); int iopt_set_allow_iova(struct io_pagetable *iopt, struct rb_root_cached *allowed_iova); diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index ea7ed32bbaede..2a816533dc10e 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -418,8 +418,7 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
lockdep_assert_held(&igroup->lock);
- rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt, - idev->dev, + rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt, idev, &idev->sw_msi_start); if (rc) return rc; @@ -762,7 +761,7 @@ iommufd_device_do_replace_reserved_iova(struct iommufd_device *idev, if (!old_hwpt_paging || hwpt_paging->ioas != old_hwpt_paging->ioas) { xa_for_each(&attach->device_array, index, cur) { rc = iopt_table_enforce_dev_resv_regions( - &hwpt_paging->ioas->iopt, cur->dev, NULL); + &hwpt_paging->ioas->iopt, cur, NULL); if (rc) goto err_unresv; } diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index c0360c450880b..dee0aa3e7cb4a 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1440,9 +1440,10 @@ void iopt_remove_access(struct io_pagetable *iopt,
/* Narrow the valid_iova_itree to include reserved ranges from a device. */ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, - struct device *dev, + struct iommufd_device *idev, phys_addr_t *sw_msi_start) { + struct device *dev = idev->dev; struct iommu_resv_region *resv; LIST_HEAD(resv_regions); unsigned int num_hw_msi = 0;
A new SET_OPTION will reuse this helper for a sanity check before setting a per-idev property.
Given that the attach handle can be NULL if device is not attached, add a pointer check prior to the xa_load(). This is not a problem currently, as the only caller iommufd_device_do_replace() verifies the attach handle.
Also, add lockdep_assert_held on igroup's mutex;
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 1 + drivers/iommu/iommufd/device.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index cc758610b9f7c..c458ab16736b6 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -502,6 +502,7 @@ iommufd_get_device(struct iommufd_ucmd *ucmd, u32 id)
void iommufd_device_pre_destroy(struct iommufd_object *obj); void iommufd_device_destroy(struct iommufd_object *obj); +bool iommufd_device_is_attached(struct iommufd_device *idev, ioasid_t pasid); int iommufd_get_hw_info(struct iommufd_ucmd *ucmd);
struct iommufd_access { diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 2a816533dc10e..45a1d1603c009 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -436,13 +436,14 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
/* The device attach/detach/replace helpers for attach_handle */
-static bool iommufd_device_is_attached(struct iommufd_device *idev, - ioasid_t pasid) +bool iommufd_device_is_attached(struct iommufd_device *idev, ioasid_t pasid) { struct iommufd_attach *attach;
+ lockdep_assert_held(&idev->igroup->lock); + attach = xa_load(&idev->igroup->pasid_attach, pasid); - return xa_load(&attach->device_array, idev->obj.id); + return attach && xa_load(&attach->device_array, idev->obj.id); }
static int iommufd_hwpt_pasid_compat(struct iommufd_hw_pagetable *hwpt,
For systems that require MSI pages to be mapped into the IOMMU translation the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default recommended IOVA window to place these mappings. However, there is nothing special about this address. And to support the RMR trick in VMM for nested translation, the VMM needs to know what sw_msi window the kernel is using.
Moreover, there are cases that the default IOMMU_RESV_SW_MSI region cannot be reserved as some platforms reserve this address for other purposes: https://lore.kernel.org/all/20250909154600.910110-1-shyamsaini@linux.microso...
Provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use to directly specify its desired sw_msi window, which replaces and disables the default IOMMU_RESV_SW_MSI from the driver, to avoid having to build an API to discover the default IOMMU_RESV_SW_MSI.
Since iommufd now has its own sw_msi function, this is easy to implement.
Keep these two options per iommufd_device, so each device can set its own desired MSI window. VMM must set the values before attaching the device to any HWPT/IOAS to have an effect.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 2 + include/uapi/linux/iommufd.h | 21 ++++- drivers/iommu/iommufd/io_pagetable.c | 15 +++- drivers/iommu/iommufd/ioas.c | 113 ++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 4 + 5 files changed, 151 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index c458ab16736b6..1defd416813c8 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -346,6 +346,7 @@ int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd); int iommufd_ioas_copy(struct iommufd_ucmd *ucmd); int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd); int iommufd_ioas_option(struct iommufd_ucmd *ucmd); +int iommufd_option_sw_msi(struct iommufd_ucmd *ucmd); int iommufd_option_rlimit_mode(struct iommu_option *cmd, struct iommufd_ctx *ictx);
@@ -490,6 +491,7 @@ struct iommufd_device { struct iommufd_vdevice *vdev; bool destroying; phys_addr_t sw_msi_start; + size_t sw_msi_size; };
static inline struct iommufd_device * diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index c218c89e0e2eb..5e5277f77a97b 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -296,7 +296,9 @@ struct iommu_ioas_unmap {
/** * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and - * ioctl(IOMMU_OPTION_HUGE_PAGES) + * ioctl(IOMMU_OPTION_HUGE_PAGES) and + * ioctl(IOMMU_OPTION_SW_MSI_START) and + * ioctl(IOMMU_OPTION_SW_MSI_SIZE) * @IOMMU_OPTION_RLIMIT_MODE: * Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege * to invoke this. Value 0 (default) is user based accounting, 1 uses process @@ -306,10 +308,27 @@ struct iommu_ioas_unmap { * iommu mappings. Value 0 disables combining, everything is mapped to * PAGE_SIZE. This can be useful for benchmarking. This is a per-IOAS * option, the object_id must be the IOAS ID. + * @IOMMU_OPTION_SW_MSI_START: + * Change the base address of the IOMMU mapping region for MSI doorbell(s). + * This option being unset or @IOMMU_OPTION_SW_MSI_SIZE being value 0 tells + * the kernel to pick its default MSI doorbell window, ignoring these two + * options. To set this option, userspace must do before attaching a device + * to an IOAS/HWPT. Otherwise, kernel will return error (-EBUSY). An address + * must be 1MB aligned. This option is per-device, the object_id must be the + * device ID. + * @IOMMU_OPTION_SW_MSI_SIZE: + * Change the size (in MB) of the IOMMU mapping region for MSI doorbell(s). + * The minimum value is 1 MB. A value 0 (default) tells the kernel to ignore + * the base address value set to @IOMMU_OPTION_SW_MSI_START, and to pick its + * default MSI doorbell window. Same requirements are applied to this option + * too, so check @IOMMU_OPTION_SW_MSI_START for details. User space must set + * IOMMU_OPTION_SW_MSI_START first before setting IOMMU_OPTION_SW_MSI_SIZE. */ enum iommufd_option { IOMMU_OPTION_RLIMIT_MODE = 0, IOMMU_OPTION_HUGE_PAGES = 1, + IOMMU_OPTION_SW_MSI_START = 2, + IOMMU_OPTION_SW_MSI_SIZE = 3, };
/** diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index dee0aa3e7cb4a..7a1016d6dcfe0 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1458,18 +1458,27 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, iommu_get_resv_regions(dev, &resv_regions);
list_for_each_entry(resv, &resv_regions, list) { + unsigned long start = PHYS_ADDR_MAX, last = 0; + if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE) continue;
if (sw_msi_start && resv->type == IOMMU_RESV_MSI) num_hw_msi++; if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) { - *sw_msi_start = resv->start; + if (idev->sw_msi_size) { + start = *sw_msi_start; + last = idev->sw_msi_size - 1 + start; + } num_sw_msi++; }
- rc = iopt_reserve_iova(iopt, resv->start, - resv->length - 1 + resv->start, dev); + if (start == PHYS_ADDR_MAX) { + start = resv->start; + last = resv->length - 1 + start; + } + + rc = iopt_reserve_iova(iopt, start, last, dev); if (rc) goto out_reserved; } diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c index 1542c5fd10a85..f2a4ab98f1665 100644 --- a/drivers/iommu/iommufd/ioas.c +++ b/drivers/iommu/iommufd/ioas.c @@ -620,6 +620,119 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd, return -EOPNOTSUPP; }
+static inline int iommufd_option_sw_msi_test(struct iommufd_device *idev, + phys_addr_t start, size_t size) +{ + const phys_addr_t alignment = SZ_1M - 1; + struct iommu_resv_region *resv; + LIST_HEAD(resv_regions); + phys_addr_t last; + int rc = 0; + + if (start & alignment || size & alignment) + return -EINVAL; + + size = max_t(size_t, size, SZ_1M); + + if (check_add_overflow(start, size - 1, &last)) + return -EOVERFLOW; + + /* Test if the new sw_msi range overlaps with other reserved regions */ + iommu_get_resv_regions(idev->dev, &resv_regions); + list_for_each_entry(resv, &resv_regions, list) { + phys_addr_t resv_last = resv->length - 1 + resv->start; + + /* start/size replaces the driver-defined IOMMU_RESV_SW_MSI */ + if (resv->type == IOMMU_RESV_SW_MSI) + continue; + /* IOMMU_RESV_DIRECT_RELAXABLE does not get enforced to iopt */ + if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE) + continue; + + if (resv->start <= last && resv_last >= start) { + rc = -EADDRINUSE; + break; + } + } + iommu_put_resv_regions(idev->dev, &resv_regions); + return rc; +} + +int iommufd_option_sw_msi(struct iommufd_ucmd *ucmd) +{ + struct iommu_option *cmd = ucmd->cmd; + struct iommufd_device *idev; + int rc = 0; + + idev = iommufd_get_device(ucmd, cmd->object_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + + mutex_lock(&idev->igroup->lock); + + /* Device cannot enforce the sw_msi window if already attached */ + if (iommufd_device_is_attached(idev, IOMMU_NO_PASID)) { + rc = -EBUSY; + goto out_unlock; + } + + if (cmd->op == IOMMU_OPTION_OP_GET) { + switch (cmd->option_id) { + case IOMMU_OPTION_SW_MSI_START: + cmd->val64 = (u64)idev->sw_msi_start; + break; + case IOMMU_OPTION_SW_MSI_SIZE: + cmd->val64 = (u64)idev->sw_msi_size / SZ_1M; + break; + default: + rc = -EOPNOTSUPP; + break; + } + } + + if (cmd->op == IOMMU_OPTION_OP_SET) { + phys_addr_t start = idev->sw_msi_start; + size_t size = idev->sw_msi_size; + + switch (cmd->option_id) { + case IOMMU_OPTION_SW_MSI_START: + if (cmd->val64 > PHYS_ADDR_MAX) { + rc = -EINVAL; + break; + } + start = (phys_addr_t)cmd->val64; + rc = iommufd_option_sw_msi_test(idev, start, size); + if (rc) + break; + idev->sw_msi_start = start; + break; + case IOMMU_OPTION_SW_MSI_SIZE: + /* The input unit is MB */ + if (cmd->val64 > SIZE_MAX >> 20) { + rc = -EINVAL; + break; + } + size = (size_t)cmd->val64 * SZ_1M; + if (size) { + rc = iommufd_option_sw_msi_test(idev, start, + size); + if (rc) + break; + } + idev->sw_msi_size = size; + break; + default: + rc = -EOPNOTSUPP; + break; + } + } + +out_unlock: + mutex_unlock(&idev->igroup->lock); + iommufd_put_object(ucmd->ictx, &idev->obj); + return rc; +} + static int iommufd_ioas_option_huge_pages(struct iommu_option *cmd, struct iommufd_ioas *ioas) { diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index ce775fbbae94e..9a8ab58d694d4 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -398,6 +398,10 @@ static int iommufd_option(struct iommufd_ucmd *ucmd) case IOMMU_OPTION_RLIMIT_MODE: rc = iommufd_option_rlimit_mode(cmd, ucmd->ictx); break; + case IOMMU_OPTION_SW_MSI_START: + case IOMMU_OPTION_SW_MSI_SIZE: + rc = iommufd_option_sw_msi(ucmd); + break; case IOMMU_OPTION_HUGE_PAGES: rc = iommufd_ioas_option(ucmd); break;
Hi Nicolin,
kernel test robot noticed the following build warnings:
[auto build test WARNING on shuah-kselftest/next] [also build test WARNING on shuah-kselftest/fixes linus/master v6.18-rc1 next-20251014] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommufd-device-M... base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next patch link: https://lore.kernel.org/r/6c36de14b00a3f06df3a602f18baf6b51fde429f.176048786... patch subject: [PATCH v2 4/7] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls config: i386-buildonly-randconfig-002-20251015 (https://download.01.org/0day-ci/archive/20251015/202510151909.E0Zb31Ah-lkp@i...) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251015/202510151909.E0Zb31Ah-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202510151909.E0Zb31Ah-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from include/linux/overflow.h:6, from include/linux/string.h:13, from include/linux/scatterlist.h:5, from include/linux/iommu.h:10, from drivers/iommu/iommufd/io_pagetable.c:13: drivers/iommu/iommufd/io_pagetable.c: In function 'iopt_table_enforce_dev_resv_regions':
include/linux/limits.h:11:25: warning: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '18446744073709551615' to '4294967295' [-Woverflow]
11 | #define PHYS_ADDR_MAX (~(phys_addr_t)0) | ^ drivers/iommu/iommufd/io_pagetable.c:1461:39: note: in expansion of macro 'PHYS_ADDR_MAX' 1461 | unsigned long start = PHYS_ADDR_MAX, last = 0; | ^~~~~~~~~~~~~
vim +11 include/linux/limits.h
54d50897d544c8 Masahiro Yamada 2019-03-07 8 54d50897d544c8 Masahiro Yamada 2019-03-07 9 #define SIZE_MAX (~(size_t)0) dabba87229411a Pasha Tatashin 2022-05-27 10 #define SSIZE_MAX ((ssize_t)(SIZE_MAX >> 1)) 54d50897d544c8 Masahiro Yamada 2019-03-07 @11 #define PHYS_ADDR_MAX (~(phys_addr_t)0) 54d50897d544c8 Masahiro Yamada 2019-03-07 12
On Tue, Oct 14, 2025 at 05:29:36PM -0700, Nicolin Chen wrote:
@@ -1458,18 +1458,27 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, iommu_get_resv_regions(dev, &resv_regions); list_for_each_entry(resv, &resv_regions, list) {
unsigned long start = PHYS_ADDR_MAX, last = 0;
kernel test robot complained. It should be: phys_addr_t start = PHYS_ADDR_MAX, last = 0;
Will fix in next version.
Thanks Nicolin
Add a new MOCK_FLAGS_DEVICE_NO_ATTACH flag to allow the mock_domain cmd to bypass the attach step, as IOMMU_OPTION_SW_MSI_START/SIZE only allow users to set prior to an IOAS/HWPT attachment.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 1 + drivers/iommu/iommufd/selftest.c | 17 ++++++++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 8fc618b2bcf96..7f7ffe5d670bb 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -54,6 +54,7 @@ enum { MOCK_FLAGS_DEVICE_NO_DIRTY = 1 << 0, MOCK_FLAGS_DEVICE_HUGE_IOVA = 1 << 1, MOCK_FLAGS_DEVICE_PASID = 1 << 2, + MOCK_FLAGS_DEVICE_NO_ATTACH = 1 << 3, };
enum { diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index de178827a078a..ee5671d7e55d8 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -1088,6 +1088,7 @@ static struct mock_dev *mock_dev_create(unsigned long dev_flags) {}, }; const u32 valid_flags = MOCK_FLAGS_DEVICE_NO_DIRTY | + MOCK_FLAGS_DEVICE_NO_ATTACH | MOCK_FLAGS_DEVICE_HUGE_IOVA | MOCK_FLAGS_DEVICE_PASID; struct mock_dev *mdev; @@ -1181,9 +1182,13 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, } sobj->idev.idev = idev;
- rc = iommufd_device_attach(idev, IOMMU_NO_PASID, &pt_id); - if (rc) - goto out_unbind; + if (dev_flags & MOCK_FLAGS_DEVICE_NO_ATTACH) { + pt_id = 0; + } else { + rc = iommufd_device_attach(idev, IOMMU_NO_PASID, &pt_id); + if (rc) + goto out_unbind; + }
/* Userspace must destroy the device_id to destroy the object */ cmd->mock_domain.out_hwpt_id = pt_id; @@ -1196,7 +1201,8 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, return 0;
out_detach: - iommufd_device_detach(idev, IOMMU_NO_PASID); + if (!(dev_flags & MOCK_FLAGS_DEVICE_NO_ATTACH)) + iommufd_device_detach(idev, IOMMU_NO_PASID); out_unbind: iommufd_device_unbind(idev); out_mdev: @@ -2024,7 +2030,8 @@ void iommufd_selftest_destroy(struct iommufd_object *obj)
switch (sobj->type) { case TYPE_IDEV: - iommufd_device_detach(sobj->idev.idev, IOMMU_NO_PASID); + if (!(sobj->idev.mock_dev->flags & MOCK_FLAGS_DEVICE_NO_ATTACH)) + iommufd_device_detach(sobj->idev.idev, IOMMU_NO_PASID); iommufd_device_unbind(sobj->idev.idev); mock_dev_destroy(sobj->idev.mock_dev); break;
The new IOMMU_OPTION_SW_MSI_START/SIZE must not overlap with any existing device reserved region, so add a testing region [0x80000000, 0x8fffffff], on top of the normal IOVA aperture for selftest program to run an overlap test.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 3 +++ drivers/iommu/iommufd/selftest.c | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 7f7ffe5d670bb..34fb12a36621a 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -273,4 +273,7 @@ struct iommu_viommu_event_selftest { #define IOMMU_HW_QUEUE_TYPE_SELFTEST 0xdeadbeef #define IOMMU_TEST_HW_QUEUE_MAX 2
+#define IOMMU_TEST_RESV_BASE 0x80000000UL +#define IOMMU_TEST_RESV_LENGTH 0x10000000UL + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index ee5671d7e55d8..2c660c021ed27 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -480,7 +480,8 @@ mock_domain_alloc_paging_flags(struct device *dev, u32 flags, if (!mock) return ERR_PTR(-ENOMEM); mock->domain.geometry.aperture_start = MOCK_APERTURE_START; - mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST; + mock->domain.geometry.aperture_end = + MOCK_APERTURE_LAST + IOMMU_TEST_RESV_LENGTH; mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE; if (dev && mdev->flags & MOCK_FLAGS_DEVICE_HUGE_IOVA) mock->domain.pgsize_bitmap |= MOCK_HUGE_PAGE_SIZE; @@ -688,6 +689,20 @@ static void mock_dev_disable_iopf(struct device *dev, struct iommu_domain *domai iopf_queue_remove_device(mock_iommu_iopf_queue, dev); }
+static void mock_dev_get_resv_regions(struct device *dev, + struct list_head *head) +{ + const int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; + struct iommu_resv_region *region; + + region = iommu_alloc_resv_region(IOMMU_TEST_RESV_BASE, + IOMMU_TEST_RESV_LENGTH, prot, + IOMMU_RESV_RESERVED, GFP_KERNEL); + if (!region) + return; + list_add_tail(®ion->list, head); +} + static void mock_viommu_destroy(struct iommufd_viommu *viommu) { struct mock_iommu_device *mock_iommu = container_of( @@ -952,6 +967,7 @@ static const struct iommu_ops mock_ops = { .device_group = generic_device_group, .probe_device = mock_probe_device, .page_response = mock_domain_page_response, + .get_resv_regions = mock_dev_get_resv_regions, .user_pasid_table = true, .get_viommu_size = mock_get_viommu_size, .viommu_init = mock_viommu_init,
Also add fail_nth coverage too.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd.c | 105 ++++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 21 ++++ 2 files changed, 126 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 3eebf5e3b974f..d67b1ac3e60a6 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -336,6 +336,111 @@ TEST_F(change_process, basic) ASSERT_EQ(child, waitpid(child, NULL, 0)); }
+FIXTURE(iommufd_sw_msi) +{ + int fd; + uint32_t ioas_id; + uint32_t idev_id[2]; +}; + +FIXTURE_SETUP(iommufd_sw_msi) +{ + self->fd = open("/dev/iommu", O_RDWR); + ASSERT_NE(-1, self->fd); + + test_ioctl_ioas_alloc(&self->ioas_id); + test_cmd_mock_domain(self->ioas_id, NULL, NULL, &self->idev_id[0]); + test_cmd_mock_domain_flags(self->ioas_id, MOCK_FLAGS_DEVICE_NO_ATTACH, + NULL, NULL, &self->idev_id[1]); +} + +FIXTURE_TEARDOWN(iommufd_sw_msi) +{ + teardown_iommufd(self->fd, _metadata); +} + +TEST_F(iommufd_sw_msi, basic) +{ + struct iommu_option cmd = { + .size = sizeof(cmd), + .op = IOMMU_OPTION_OP_SET, + }; + + /* Negative case: object_id must be a device id */ + cmd.object_id = self->ioas_id; + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x70000000; + EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.object_id = 0; + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + cmd.val64 = 2; + EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd)); + + /* Negative case: device must not be attached already */ + if (self->idev_id[0]) { + cmd.object_id = self->idev_id[0]; + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x70000000; + EXPECT_ERRNO(EBUSY, ioctl(self->fd, IOMMU_OPTION, &cmd)); + } + + /* Device isn't attached yet */ + if (self->idev_id[1]) { + /* Negative case: alignment failures */ + cmd.object_id = self->idev_id[1]; + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x7fffffff; + EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.val64 = 0x7fffff00; + EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.val64 = 0x7fff0000; + EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd)); + + /* Negative case: overlap against [0x80000000, 0x80ffffff] */ + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x80000000; + EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.val64 = 0x80400000; + EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.val64 = 0x80800000; + EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.val64 = 0x80c00000; + EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd)); + /* Though an address that starts 1MB below will be okay ... */ + cmd.val64 = 0x7ff00000; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + /* ... but not with a 2MB size */ + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + cmd.val64 = 2; + EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd)); + + /* Negative case: overflows with the 2MB size */ + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = UINT64_MAX - 1 * 1024 * 1024 + 1; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + cmd.val64 = 2; + EXPECT_ERRNO(EOVERFLOW, ioctl(self->fd, IOMMU_OPTION, &cmd)); + + /* Set a safe 2MB window */ + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x70000000; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + cmd.val64 = 2; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + + /* Read them back to verify */ + cmd.op = IOMMU_OPTION_OP_GET; + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + ASSERT_EQ(cmd.val64, 0x70000000); + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd)); + ASSERT_EQ(cmd.val64, 2); + } +} + FIXTURE(iommufd_ioas) { int fd; diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c index 45c14323a6183..19f23519b7914 100644 --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c @@ -621,6 +621,10 @@ TEST_FAIL_NTH(basic_fail_nth, access_pin_domain) /* device.c */ TEST_FAIL_NTH(basic_fail_nth, device) { + struct iommu_option cmd = { + .size = sizeof(cmd), + .op = IOMMU_OPTION_OP_SET, + }; struct iommu_hwpt_selftest data = { .iotlb = IOMMU_TEST_IOTLB_DEFAULT, }; @@ -632,6 +636,7 @@ TEST_FAIL_NTH(basic_fail_nth, device) uint32_t ioas_id; uint32_t ioas_id2; uint32_t idev_id; + uint32_t idev_id2; uint32_t hwpt_id; uint32_t viommu_id; uint32_t hw_queue_id; @@ -742,6 +747,22 @@ TEST_FAIL_NTH(basic_fail_nth, device)
self->pasid = 0;
+ if (_test_cmd_mock_domain_flags(self->fd, ioas_id, + MOCK_FLAGS_DEVICE_NO_ATTACH, NULL, NULL, + &idev_id2)) + return -1; + + cmd.object_id = idev_id2; + cmd.option_id = IOMMU_OPTION_SW_MSI_START; + cmd.val64 = 0x70000000; + if (ioctl(self->fd, IOMMU_OPTION, &cmd)) + return -1; + + cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE; + cmd.val64 = 2; + if (ioctl(self->fd, IOMMU_OPTION, &cmd)) + return -1; + return 0; }
linux-kselftest-mirror@lists.linaro.org