Nested translation has two stage address translations to get the final physical addresses. Take Intel VT-d as an example, the first stage translation structure is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host to do the first stage translation. Along with it, guest modifications to present mappings in the first stage page should be followed with an iotlb invalidation to sync host iotlb.
.-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |--------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where: - FS = First stage page tables - SS = Second stage page tables <Intel VT-d Nested translation>
Different platform vendors have different first stage translation formats, so userspace should query the underlying iommu capability before setting first stage translation structures to host.[1]
In iommufd subsystem, I/O page tables would be tracked by hw_pagetable objects. First stage page table is owned by userspace (guest), while second stage page table is owned by kernel for security. So First stage page tables are tracked by user-managed hw_pagetable, second stage page tables are tracked by kernel- managed hw_pagetable.
This series first introduces new iommu op for allocating domains for iommufd, and op for syncing iotlb for first stage page table modifications, and then add the implementation of the new ops in intel-iommu driver. After this preparation, adds kernel-managed and user-managed hw_pagetable allocation for userspace. Last, add self-test for the new ioctls.
This series is based on "[PATCH 0/6] iommufd: Add iommu capability reporting"[1] and Nicolin's "[PATCH v2 00/10] Add IO page table replacement support"[2]. Complete code can be found in[3]. Draft Qemu code can be found in[4].
Basic test done with DSA device on VT-d. Where the guest has a vIOMMU built with nested translation.
[1] https://lore.kernel.org/linux-iommu/20230209041642.9346-1-yi.l.liu@intel.com... [2] https://lore.kernel.org/linux-iommu/cover.1675802050.git.nicolinc@nvidia.com... [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting_vtd_v1 [4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv3%2Bnesting
Regards, Yi Liu
Lu Baolu (5): iommu: Add new iommu op to create domains owned by userspace iommu: Add nested domain support iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain support
Nicolin Chen (6): iommufd: Add/del hwpt to IOAS at alloc/destroy() iommufd/device: Move IOAS attaching and detaching operations into helpers iommufd/selftest: Add IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE test op iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC ioctl iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (6): iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation iommufd: Split iommufd_hw_pagetable_alloc() iommufd: Add kernel-managed hw_pagetable allocation for userspace iommufd: Add infrastructure for user-managed hw_pagetable allocation iommufd: Add user-managed hw_pagetable allocation iommufd/device: Report supported stage-1 page table types
drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 38 ++- drivers/iommu/intel/iommu.h | 50 +++- drivers/iommu/intel/nested.c | 143 +++++++++ drivers/iommu/intel/pasid.c | 142 +++++++++ drivers/iommu/intel/pasid.h | 2 + drivers/iommu/iommufd/device.c | 117 ++++---- drivers/iommu/iommufd/hw_pagetable.c | 280 +++++++++++++++++- drivers/iommu/iommufd/iommufd_private.h | 23 +- drivers/iommu/iommufd/iommufd_test.h | 35 +++ drivers/iommu/iommufd/main.c | 11 + drivers/iommu/iommufd/selftest.c | 149 +++++++++- include/linux/iommu.h | 11 + include/uapi/linux/iommufd.h | 196 ++++++++++++ tools/testing/selftests/iommu/iommufd.c | 124 +++++++- tools/testing/selftests/iommu/iommufd_utils.h | 106 +++++++ 16 files changed, 1329 insertions(+), 100 deletions(-) create mode 100644 drivers/iommu/intel/nested.c
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new iommu_domain op to create domains owned by userspace, e.g. through iommufd. These domains have a few different properties compares to kernel owned domains:
- They may be MANAGED domains, but created with special parameters. For instance aperture size changes/number of levels, different IOPTE formats, or other things necessary to make a vIOMMU work
- We have to track all the memory allocations with GFP_KERNEL_ACCOUNT to make the cgroup sandbox stronger
- Device-specialty domains, such as NESTED domains can be created by iommufd.
The new op clearly says the domain is being created by IOMMUFD, that the domain is intended for userspace use, and it provides a way to pass a driver specific uAPI structure to customize the created domain to exactly what the vIOMMU userspace driver requires.
iommu drivers that cannot support VFIO/IOMMUFD should not support this op. This includes any driver that cannot provide a fully functional UNMANAGED domain.
This op chooses to make the special parameters opaque to the core. This suits the current usage model where accessing any of the IOMMU device special parameters does require a userspace driver that matches the kernel driver. If a need for common parameters, implemented similarly by several drivers, arises then there is room in the design to grow a generic parameter set as well.
This new op for now is only supposed to be used by iommufd, hence no wrapper for it. iommufd would call the callback directly. As for domain free, iommufd would use iommu_domain_free().
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/linux/iommu.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 97b398d19fd2..214e3eb9bc86 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -230,6 +230,7 @@ struct iommu_iotlb_gather { * after use. Return the data buffer if success, or ERR_PTR on * failure. * @domain_alloc: allocate iommu domain + * @domain_alloc_user: allocate user iommu domain * @probe_device: Add device to iommu driver handling * @release_device: Remove device from iommu driver handling * @probe_finalize: Do final setup work after the device is added to an IOMMU @@ -262,6 +263,9 @@ struct iommu_ops {
/* Domain allocation and freeing by the iommu driver */ struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); + struct iommu_domain *(*domain_alloc_user)(struct device *dev, + struct iommu_domain *parent, + const void *user_data);
struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev);
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, February 9, 2023 12:32 PM
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new iommu_domain op to create domains owned by userspace, e.g. through iommufd. These domains have a few different properties compares to kernel owned domains:
- They may be MANAGED domains, but created with special parameters. For instance aperture size changes/number of levels, different IOPTE formats, or other things necessary to make a vIOMMU work
what is *MANAGED* domain?
today we only have the separation between DMA domain and UNMANAGED domain.
soon with Jason's cleanup all domains are called UNMANAGED.
On 2023/2/10 16:24, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, February 9, 2023 12:32 PM
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new iommu_domain op to create domains owned by userspace, e.g. through iommufd. These domains have a few different properties compares to kernel owned domains:
- They may be MANAGED domains, but created with special parameters. For instance aperture size changes/number of levels, different IOPTE formats, or other things necessary to make a vIOMMU work
what is *MANAGED* domain?
today we only have the separation between DMA domain and UNMANAGED domain.
soon with Jason's cleanup all domains are called UNMANAGED.
It should be UNMANAGED domain.
Best regards, baolu
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new domain type for a user space I/O address, which is nested on top of another user space address represented by a UNMANAGED domain. The mappings of a nested domain are managed by user space software, therefore it's unnecessary to have map/unmap callbacks. But the updates of the PTEs in the nested domain page table must be propagated to the caches on both IOMMU (IOTLB) and devices (DevTLB).
The nested domain is allocated by the domain_alloc_user op, and attached to the device through the existing iommu_attach_device/group() interfaces.
An new domain op, named iotlb_sync_user is added for the userspace to flush the hardware caches for a nested domain through iommufd. No wrapper for it as it's only supposed to be used by iommufd.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/linux/iommu.h | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 214e3eb9bc86..f6db50f85a20 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -67,6 +67,9 @@ struct iommu_domain_geometry {
#define __IOMMU_DOMAIN_SVA (1U << 4) /* Shared process address space */
+#define __IOMMU_DOMAIN_NESTED (1U << 5) /* User-managed IOVA nested on + a stage-2 translation */ + /* * This are the possible domain-types * @@ -92,6 +95,7 @@ struct iommu_domain_geometry { __IOMMU_DOMAIN_DMA_API | \ __IOMMU_DOMAIN_DMA_FQ) #define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SVA) +#define IOMMU_DOMAIN_NESTED (__IOMMU_DOMAIN_NESTED)
struct iommu_domain { unsigned type; @@ -321,6 +325,7 @@ struct iommu_ops { * @iotlb_sync_map: Sync mappings created recently using @map to the hardware * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush * queue + * @iotlb_sync_user: Flush hardware TLBs caching user space IO mappings * @iova_to_phys: translate iova to physical address * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE, * including no-snoop TLPs on PCIe or other platform @@ -350,6 +355,8 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather); + void (*iotlb_sync_user)(struct iommu_domain *domain, + void *user_data);
phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
On Wed, Feb 08, 2023 at 08:31:38PM -0800, Yi Liu wrote:
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new domain type for a user space I/O address, which is nested on top of another user space address represented by a UNMANAGED domain. The mappings of a nested domain are managed by user space software, therefore it's unnecessary to have map/unmap callbacks. But the updates of the PTEs in the nested domain page table must be propagated to the caches on both IOMMU (IOTLB) and devices (DevTLB).
The nested domain is allocated by the domain_alloc_user op, and attached to the device through the existing iommu_attach_device/group() interfaces.
An new domain op, named iotlb_sync_user is added for the userspace to flush the hardware caches for a nested domain through iommufd. No wrapper for it as it's only supposed to be used by iommufd.
Following the remarks from Jason and Robin in their first looks at the nested SMMU changes, perhaps we should rename this op "iotlb_sync_user" back to "cache_invalidate_user" or so, since the type for the caches on VT-d isn't confined to IOTLB either.
Thanks Nic
From: Lu Baolu baolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.h | 35 ++++++++++++++++++++---- include/uapi/linux/iommufd.h | 53 ++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 2e70265d4ceb..4fe74f217696 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -541,15 +541,38 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */
- struct dma_pte *pgd; /* virtual address */ - int gaw; /* max guest address width */ - - /* adjusted guest address width, 0 is level 2 30-bit */ - int agaw; int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 == 2MiB, 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */ - u64 max_addr; /* maximum mapped address */ + union { + /* DMA remapping domain */ + struct { + /* virtual address */ + struct dma_pte *pgd; + /* max guest address width */ + int gaw; + /* + * adjusted guest address width: + * 0: level 2 30-bit + * 1: level 3 39-bit + * 2: level 4 48-bit + * 3: level 5 57-bit + */ + int agaw; + /* maximum mapped address */ + u64 max_addr; + }; + + /* Nested user domain */ + struct { + /* 2-level page table the user domain nested */ + struct dmar_domain *s2_domain; + /* user page table pointer (in GPA) */ + unsigned long s1_pgtbl; + /* page table attributes */ + struct iommu_hwpt_intel_vtd s1_cfg; + }; + };
struct iommu_domain domain; /* generic domain data structure for iommu core */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 6cfe102f26f3..0867900494c9 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -407,4 +407,57 @@ struct iommu_device_info { __u32 __reserved; }; #define IOMMU_DEVICE_GET_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DEVICE_GET_INFO) + +/** + * enum iommu_hwpt_intel_vtd_flags - Intel VT-d stage-1 page + * table entry attributes + * @IOMMU_VTD_PGTBL_SRE: Supervisor request + * @IOMMU_VTD_PGTBL_EAFE: Extended access enable + * @IOMMU_VTD_PGTBL_PCD: Page-level cache disable + * @IOMMU_VTD_PGTBL_PWT: Page-level write through + * @IOMMU_VTD_PGTBL_EMTE: Extended mem type enable + * @IOMMU_VTD_PGTBL_CD: PASID-level cache disable + * @IOMMU_VTD_PGTBL_WPE: Write protect enable + */ +enum iommu_hwpt_intel_vtd_flags { + IOMMU_VTD_PGTBL_SRE = 1 << 0, + IOMMU_VTD_PGTBL_EAFE = 1 << 1, + IOMMU_VTD_PGTBL_PCD = 1 << 2, + IOMMU_VTD_PGTBL_PWT = 1 << 3, + IOMMU_VTD_PGTBL_EMTE = 1 << 4, + IOMMU_VTD_PGTBL_CD = 1 << 5, + IOMMU_VTD_PGTBL_WPE = 1 << 6, + IOMMU_VTD_PGTBL_LAST = 1 << 7, +}; + +/** + * struct iommu_hwpt_intel_vtd - Intel VT-d specific user-managed + * stage-1 page table info + * @flags: Combination of enum iommu_hwpt_intel_vtd_flags + * @pgtbl_addr: The base address of the user-managed stage-1 page table. + * @pat: Page attribute table data to compute effective memory type + * @emt: Extended memory type + * @addr_width: The address width of the untranslated addresses that are + * subjected to the user-managed stage-1 page table. + * @__reserved: Must be 0 + * + * The Intel VT-d specific data for creating hw_pagetable to represent + * the user-managed stage-1 page table that is used in nested translation. + * + * In nested translation, the stage-1 page table locates in the address + * space that defined by the corresponding stage-2 page table. Hence the + * stage-1 page table base address value should not be higher than the + * maximum untranslated address of stage-2 page table. + * + * The paging level of the stage-1 page table should be compataible with + * the hardware iommu. Otherwise, the allocation would be failed. + */ +struct iommu_hwpt_intel_vtd { + __u64 flags; + __u64 pgtbl_addr; + __u32 pat; + __u32 emt; + __u32 addr_width; + __u32 __reserved; +}; #endif
From: Lu Baolu baolu.lu@linux.intel.com
The configurations are passed in from the user when the user domain is allocated. This helper interprets these configurations according to the data structure defined in uapi/linux/iommufd.h. The EINVAL error will be returned if any of configurations are not compatible with the hardware capabilities. The caller can retry with another compatible user domain. The encoding of fields of each pasid entry is defined in section 9.6 of the VT-d spec.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/pasid.c | 142 ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + 2 files changed, 144 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index fb3c7020028d..1ca19262f649 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -21,6 +21,11 @@ #include "iommu.h" #include "pasid.h"
+#define IOMMU_VTD_PGTBL_MTS_MASK (IOMMU_VTD_PGTBL_CD | \ + IOMMU_VTD_PGTBL_EMTE | \ + IOMMU_VTD_PGTBL_PCD | \ + IOMMU_VTD_PGTBL_PWT) + /* * Intel IOMMU system wide PASID name space: */ @@ -394,6 +399,15 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2); }
+/* + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135) + * of a scalable mode PASID entry. + */ +static inline void pasid_set_eafe(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7); +} + static void pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu, u16 did, u32 pasid) @@ -738,3 +752,131 @@ void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu, if (!cap_caching_mode(iommu->cap)) devtlb_invalidation_with_pasid(iommu, dev, pasid); } + +/** + * intel_pasid_setup_nested() - Set up PASID entry for nested translation. + * This could be used for guest shared virtual address. In this case, the + * first level page tables are used for GVA-GPA translation in the guest, + * second level page tables are used for GPA-HPA translation. + * + * @iommu: IOMMU which the device belong to + * @dev: Device to be set up for translation + * @pasid: PASID to be programmed in the device PASID table + * @domain: User domain nested on a s2 domain + */ +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain) +{ + struct iommu_hwpt_intel_vtd *s1_cfg = &domain->s1_cfg; + pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl; + struct dmar_domain *s2_domain = domain->s2_domain; + u16 did = domain_id_iommu(domain, iommu); + struct dma_pte *pgd = s2_domain->pgd; + struct pasid_entry *pte; + int agaw; + + if (!ecap_nest(iommu->ecap)) { + pr_err_ratelimited("%s: No nested translation support\n", + iommu->name); + return -ENODEV; + } + + /* + * Sanity checking performed by caller to make sure address width + * matching in two dimensions: CPU vs. IOMMU, guest vs. host. + */ + switch (s1_cfg->addr_width) { + case ADDR_WIDTH_4LEVEL: + break; +#ifdef CONFIG_X86 + case ADDR_WIDTH_5LEVEL: + if (!cpu_feature_enabled(X86_FEATURE_LA57) || + !cap_fl5lp_support(iommu->cap)) { + dev_err_ratelimited(dev, + "5-level paging not supported\n"); + return -EINVAL; + } + break; +#endif + default: + dev_err_ratelimited(dev, "Invalid guest address width %d\n", + s1_cfg->addr_width); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) && !ecap_srs(iommu->ecap)) { + pr_err_ratelimited("No supervisor request support on %s\n", + iommu->name); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE) && !ecap_eafs(iommu->ecap)) { + pr_err_ratelimited("No extended access flag support on %s\n", + iommu->name); + return -EINVAL; + } + + /* + * Memory type is only applicable to devices inside processor coherent + * domain. Will add MTS support once coherent devices are available. + */ + if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) { + pr_warn_ratelimited("No memory type support %s\n", + iommu->name); + return -EINVAL; + } + + agaw = iommu_skip_agaw(s2_domain, iommu, &pgd); + if (agaw < 0) { + dev_err_ratelimited(dev, "Invalid domain page table\n"); + return -EINVAL; + } + + /* First level PGD (in GPA) must be supported by the second level. */ + if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) { + dev_err_ratelimited(dev, + "Guest PGD %lx not supported, max %llx\n", + (uintptr_t)s1_gpgd, s2_domain->max_addr); + return -EINVAL; + } + + spin_lock(&iommu->lock); + pte = intel_pasid_get_entry(dev, pasid); + if (!pte) { + spin_unlock(&iommu->lock); + return -ENODEV; + } + if (pasid_pte_is_present(pte)) { + spin_unlock(&iommu->lock); + return -EBUSY; + } + + pasid_clear_entry(pte); + + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL) + pasid_set_flpm(pte, 1); + + pasid_set_flptr(pte, (uintptr_t)s1_gpgd); + + if (s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) { + pasid_set_sre(pte); + if (s1_cfg->flags & IOMMU_VTD_PGTBL_WPE) + pasid_set_wpe(pte); + } + + if (s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE) + pasid_set_eafe(pte); + + pasid_set_slptr(pte, virt_to_phys(pgd)); + pasid_set_fault_enable(pte); + pasid_set_domain_id(pte, did); + pasid_set_address_width(pte, agaw); + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED); + pasid_set_present(pte); + spin_unlock(&iommu->lock); + + pasid_flush_caches(iommu, pte, pasid, did); + + return 0; +} diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 20c54e50f533..2a72bbc79532 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -118,6 +118,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, int intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct dmar_domain *domain, struct device *dev, u32 pasid); +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain); void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, u32 pasid, bool fault_ignore);
From: Lu Baolu baolu.lu@linux.intel.com
This adds nested domain support in the Intel IOMMU driver. It allows to allocate and free a nested domain, set the nested domain to a device, and synchronize the caches when the userspace managed page tables are updated.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 38 ++++++---- drivers/iommu/intel/iommu.h | 15 ++++ drivers/iommu/intel/nested.c | 143 +++++++++++++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 54 +++++++++++++ 5 files changed, 236 insertions(+), 16 deletions(-) create mode 100644 drivers/iommu/intel/nested.c
diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile index fa0dae16441c..8b324db03692 100644 --- a/drivers/iommu/intel/Makefile +++ b/drivers/iommu/intel/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DMAR_TABLE) += dmar.o -obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o +obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o nested.o obj-$(CONFIG_DMAR_TABLE) += trace.o cap_audit.o obj-$(CONFIG_DMAR_PERF) += perf.o obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 929f600cc350..ea42d169afa2 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -277,7 +277,6 @@ static LIST_HEAD(dmar_satc_units); #define for_each_rmrr_units(rmrr) \ list_for_each_entry(rmrr, &dmar_rmrr_units, list)
-static void device_block_translation(struct device *dev); static void intel_iommu_domain_free(struct iommu_domain *domain);
int dmar_disabled = !IS_ENABLED(CONFIG_INTEL_IOMMU_DEFAULT_ON); @@ -555,7 +554,7 @@ static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain) }
/* Some capabilities may be different across iommus */ -static void domain_update_iommu_cap(struct dmar_domain *domain) +void domain_update_iommu_cap(struct dmar_domain *domain) { domain_update_iommu_coherency(domain); domain->iommu_superpage = domain_update_iommu_superpage(domain, NULL); @@ -1497,10 +1496,10 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, spin_unlock_irqrestore(&domain->lock, flags); }
-static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, - struct dmar_domain *domain, - unsigned long pfn, unsigned int pages, - int ih, int map) +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map) { unsigned int aligned_pages = __roundup_pow_of_two(pages); unsigned int mask = ilog2(aligned_pages); @@ -1572,7 +1571,7 @@ static inline void __mapping_notify_one(struct intel_iommu *iommu, iommu_flush_write_buffer(iommu); }
-static void intel_flush_iotlb_all(struct iommu_domain *domain) +void intel_flush_iotlb_all(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct iommu_domain_info *info; @@ -1764,8 +1763,7 @@ static struct dmar_domain *alloc_domain(unsigned int type) return domain; }
-static int domain_attach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info, *curr; unsigned long ndomains; @@ -1814,8 +1812,7 @@ static int domain_attach_iommu(struct dmar_domain *domain, return ret; }
-static void domain_detach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info;
@@ -4096,7 +4093,7 @@ static void dmar_remove_one_dev_info(struct device *dev) * all DMA requests without PASID from the device are blocked. If the page * table has been set, clean up the data structures. */ -static void device_block_translation(struct device *dev) +void device_block_translation(struct device *dev) { struct device_domain_info *info = dev_iommu_priv_get(dev); struct intel_iommu *iommu = info->iommu; @@ -4197,14 +4194,24 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return NULL; }
+static struct iommu_domain * +intel_iommu_domain_alloc_user(struct device *dev, struct iommu_domain *parent, + const void *user_data) +{ + if (parent) + return intel_nested_domain_alloc(parent, user_data); + else + return iommu_domain_alloc(dev->bus); +} + static void intel_iommu_domain_free(struct iommu_domain *domain) { if (domain != &si_domain->domain && domain != &blocking_domain) domain_exit(to_dmar_domain(domain)); }
-static int prepare_domain_attach_device(struct iommu_domain *domain, - struct device *dev) +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct intel_iommu *iommu; @@ -4436,7 +4443,7 @@ static void domain_set_force_snooping(struct dmar_domain *domain) PASID_RID2PASID); }
-static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); unsigned long flags; @@ -4781,6 +4788,7 @@ const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, .hw_info = intel_iommu_hw_info, .domain_alloc = intel_iommu_domain_alloc, + .domain_alloc_user = intel_iommu_domain_alloc_user, .probe_device = intel_iommu_probe_device, .probe_finalize = intel_iommu_probe_finalize, .release_device = intel_iommu_release_device, diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 4fe74f217696..25d1c99fce40 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -753,6 +753,19 @@ void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu,
int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, unsigned int count, unsigned long options); +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void device_block_translation(struct device *dev); +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev); +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain); +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map); +void intel_flush_iotlb_all(struct iommu_domain *domain); +void domain_update_iommu_cap(struct dmar_domain *domain); + /* * Options used in qi_submit_sync: * QI_OPT_WAIT_DRAIN - Wait for PRQ drain completion, spec 6.5.2.8. @@ -765,6 +778,8 @@ void *alloc_pgtable_page(int node); void free_pgtable_page(void *vaddr); void iommu_flush_write_buffer(struct intel_iommu *iommu); struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn); +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const void *data);
#ifdef CONFIG_INTEL_IOMMU_SVM extern void intel_svm_check(struct intel_iommu *iommu); diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c new file mode 100644 index 000000000000..7624a3dd4603 --- /dev/null +++ b/drivers/iommu/intel/nested.c @@ -0,0 +1,143 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * nested.c - nested mode translation support + * + * Copyright (C) 2023 Intel Corporation + * + * Author: Lu Baolu baolu.lu@linux.intel.com + * Jacob Pan jacob.jun.pan@linux.intel.com + */ + +#define pr_fmt(fmt) "DMAR: " fmt + +#include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/pci-ats.h> + +#include "iommu.h" +#include "pasid.h" + +static int intel_nested_attach_dev(struct iommu_domain *domain, + struct device *dev) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = info->iommu; + unsigned long flags; + int ret = 0; + + if (info->domain) + device_block_translation(dev); + + /* Is s2_domain compatible with this IOMMU? */ + ret = prepare_domain_attach_device(&dmar_domain->s2_domain->domain, dev); + if (ret) { + dev_err_ratelimited(dev, "s2 domain is not compatible\n"); + return ret; + } + + ret = domain_attach_iommu(dmar_domain, iommu); + if (ret) { + dev_err_ratelimited(dev, "Failed to attach domain to iommu\n"); + return ret; + } + + ret = intel_pasid_setup_nested(iommu, dev, + PASID_RID2PASID, dmar_domain); + if (ret) { + domain_detach_iommu(dmar_domain, iommu); + dev_err_ratelimited(dev, "Failed to setup pasid entry\n"); + return ret; + } + + info->domain = dmar_domain; + spin_lock_irqsave(&dmar_domain->lock, flags); + list_add(&info->link, &dmar_domain->devices); + spin_unlock_irqrestore(&dmar_domain->lock, flags); + domain_update_iommu_cap(dmar_domain); + + return 0; +} + +static void intel_nested_domain_free(struct iommu_domain *domain) +{ + kfree(to_dmar_domain(domain)); +} + +static void intel_nested_invalidate(struct device *dev, + struct dmar_domain *domain, + void *user_data) +{ + struct iommu_hwpt_invalidate_intel_vtd *inv_info = user_data; + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + + if (WARN_ON(!user_data)) + return; + + switch (inv_info->granularity) { + case IOMMU_VTD_QI_GRAN_ADDR: + if (inv_info->granule_size != VTD_PAGE_SIZE || + !IS_ALIGNED(inv_info->addr, VTD_PAGE_SIZE)) { + dev_err_ratelimited(dev, "Invalid invalidation address 0x%llx\n", + inv_info->addr); + return; + } + + iommu_flush_iotlb_psi(iommu, domain, + inv_info->addr >> VTD_PAGE_SHIFT, + inv_info->nb_granules, 1, 0); + break; + case IOMMU_VTD_QI_GRAN_DOMAIN: + intel_flush_iotlb_all(&domain->domain); + break; + default: + dev_err_ratelimited(dev, "Unsupported IOMMU invalidation type %d\n", + inv_info->granularity); + break; + } +} + +static void intel_nested_iotlb_sync_user(struct iommu_domain *domain, + void *user_data) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct device_domain_info *info; + unsigned long flags; + + spin_lock_irqsave(&dmar_domain->lock, flags); + list_for_each_entry(info, &dmar_domain->devices, link) + intel_nested_invalidate(info->dev, dmar_domain, + user_data); + spin_unlock_irqrestore(&dmar_domain->lock, flags); +} + +static const struct iommu_domain_ops intel_nested_domain_ops = { + .attach_dev = intel_nested_attach_dev, + .iotlb_sync_user = intel_nested_iotlb_sync_user, + .free = intel_nested_domain_free, + .enforce_cache_coherency = intel_iommu_enforce_cache_coherency, +}; + +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const void *user_data) +{ + const struct iommu_hwpt_intel_vtd *vtd = user_data; + struct dmar_domain *domain; + + domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT); + if (!domain) + return NULL; + + domain->use_first_level = true; + domain->s2_domain = to_dmar_domain(s2_domain); + domain->s1_pgtbl = vtd->pgtbl_addr; + domain->s1_cfg = *vtd; + domain->domain.ops = &intel_nested_domain_ops; + domain->domain.type = IOMMU_DOMAIN_NESTED; + INIT_LIST_HEAD(&domain->devices); + spin_lock_init(&domain->lock); + xa_init(&domain->iommu_array); + + return &domain->domain; +} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 0867900494c9..bbffb63d2513 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -460,4 +460,58 @@ struct iommu_hwpt_intel_vtd { __u32 addr_width; __u32 __reserved; }; + +/** + * enum iommu_vtd_qi_granularity - Intel VT-d specific granularity of + * queued invalidation + * @IOMMU_VTD_QI_GRAN_DOMAIN: domain-selective invalidation + * @IOMMU_VTD_QI_GRAN_ADDR: page-selective invalidation + */ +enum iommu_vtd_qi_granularity { + IOMMU_VTD_QI_GRAN_DOMAIN, + IOMMU_VTD_QI_GRAN_ADDR, +}; + +/** + * enum iommu_hwpt_intel_vtd_invalidate_flags - Flags for Intel VT-d + * stage-1 page table cache + * invalidation + * @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the + * leaf PTE caching needs to be invalidated + * and other paging structure caches can be + * preserved. + */ +enum iommu_hwpt_intel_vtd_invalidate_flags { + IOMMU_VTD_QI_FLAGS_LEAF = 1 << 0, +}; + +/** + * struct iommu_hwpt_invalidate_intel_vtd - Intel VT-d cache invalidation info + * @granularity: One of enum iommu_vtd_qi_granularity. + * @flags: Combination of enum iommu_hwpt_intel_vtd_invalidate_flags + * @__reserved: Must be 0 + * @addr: The start address of the addresses to be invalidated. + * @granule_size: Page/block size of the mapping in bytes. It is used to + * compute the invalidation range togehter with @nb_granules. + * @nb_granules: Number of contiguous granules to be invalidated. + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation under nested translation. Userspace uses this structure to + * tell host about the impacted caches after modifying the stage-1 page table. + * + * @addr, @granule_size and @nb_granules are meaningful when + * @granularity==IOMMU_VTD_QI_GRAN_ADDR. Intel VT-d currently only supports + * 4kB page size, so @granule_size should be 4KB. @addr should be aligned + * with @granule_size * @nb_granules, otherwise invalidation won't take + * effect. + */ +struct iommu_hwpt_invalidate_intel_vtd { + __u8 granularity; + __u8 padding[7]; + __u32 flags; + __u32 __reserved; + __u64 addr; + __u64 granule_size; + __u64 nb_granules; +}; #endif
This converts iommufd to use iommu_domain_alloc_user() for iommu_domain creation.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 43d473989a06..08d963ee38c7 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -30,6 +30,7 @@ struct iommufd_hw_pagetable * iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, struct device *dev) { + const struct iommu_ops *ops; struct iommufd_hw_pagetable *hwpt; int rc;
@@ -37,7 +38,13 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, if (IS_ERR(hwpt)) return hwpt;
- hwpt->domain = iommu_domain_alloc(dev->bus); + ops = dev_iommu_ops(dev); + if (!ops || !ops->domain_alloc_user) { + rc = -EOPNOTSUPP; + goto out_abort; + } + + hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort;
On 2/8/23 11:31 PM, Yi Liu wrote:
This converts iommufd to use iommu_domain_alloc_user() for iommu_domain creation.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/hw_pagetable.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 43d473989a06..08d963ee38c7 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -30,6 +30,7 @@ struct iommufd_hw_pagetable * iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, struct device *dev) {
- const struct iommu_ops *ops; struct iommufd_hw_pagetable *hwpt; int rc;
@@ -37,7 +38,13 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, if (IS_ERR(hwpt)) return hwpt;
- hwpt->domain = iommu_domain_alloc(dev->bus);
- ops = dev_iommu_ops(dev);
- if (!ops || !ops->domain_alloc_user) {
rc = -EOPNOTSUPP;
goto out_abort;
- }
Hi Yi,
This seems to break the iommufd vfio container support for any iommu that hasn't implemented domain_alloc_user yet.
I noticed it using vfio-pci on s390 with
CONFIG_IOMMUFD=m CONFIG_IOMMUFD_VFIO_CONTAINER=y CONFIG_VFIO_GROUP=y
Not sure if the intent is to make domain_alloc_user a hard requirement for using iommufd (if so then the commit description really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user (instead of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Thanks, Matt
On Thu, Feb 09, 2023 at 12:59:58PM -0500, Matthew Rosato wrote:
really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user (instead of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Yes, that is right if the input user data is 0 length or full of 0s then we should call the normal driver function
Jason
On Thu, Feb 09, 2023 at 02:36:49PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 12:59:58PM -0500, Matthew Rosato wrote:
really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user (instead of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Yes, that is right if the input user data is 0 length or full of 0s then we should call the normal driver function
Maybe I am wrong, yet I recall that doing ops->domain_alloc_user without a fallback was intentional to rule out drivers that don't support IOMMUFD?
To be backward-compatible and concern about SMMU, we can opt in ops->domain_alloc_user upon availability and then add a fallback:
if ((!ops || !ops->domain_alloc_user) && user_data) { rc = -EOPNOTSUPP; goto out_abort; }
if (ops->domain_alloc_user) hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); else hwpt->domain = iommu_domain_alloc(dev->bus); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort; }
Yet, even by doing so, this series having the PATCH 07/17 that moves iopt_table_add_domain() would temporally break IOMMUFD on ARM platforms, until we add the ops->domain_alloc_user to SMMU drivers.
Thanks Nic
On Thu, Feb 09, 2023 at 11:51:31AM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 02:36:49PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 12:59:58PM -0500, Matthew Rosato wrote:
really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user (instead of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Yes, that is right if the input user data is 0 length or full of 0s then we should call the normal driver function
Maybe I am wrong, yet I recall that doing ops->domain_alloc_user without a fallback was intentional to rule out drivers that don't support IOMMUFD?
I think we moved away from that to the idea of using the dma_domain patch I suggested..
To be backward-compatible and concern about SMMU, we can opt in ops->domain_alloc_user upon availability and then add a fallback:
if ((!ops || !ops->domain_alloc_user) && user_data) { rc = -EOPNOTSUPP; goto out_abort; }
if (ops->domain_alloc_user) hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); else hwpt->domain = iommu_domain_alloc(dev->bus); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort; }
Yet, even by doing so, this series having the PATCH 07/17 that moves iopt_table_add_domain() would temporally break IOMMUFD on ARM platforms, until we add the ops->domain_alloc_user to SMMU drivers.
Drop patch 7 and 8
Change patch 12 so it has a unique flow to allocate and IOAS map a HWPT that does not try to share so much code with the existing flow.
The HWPT flow should always just call allocate and then map with no effort to attach first. This will fail on ARM SMMU at this point, and that is fine.
All the existing code should work exactly as it is now and not have any need to be changed.
Where things when wrong was trying to share "__iommufd_hw_pagetable_alloc", I think.
Don't try to consolidate at this point. Once all the drivers are updated then we could try to consolidate things.
Jason
On Thu, Feb 09, 2023 at 04:39:37PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 11:51:31AM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 02:36:49PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 12:59:58PM -0500, Matthew Rosato wrote:
really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user (instead of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Yes, that is right if the input user data is 0 length or full of 0s then we should call the normal driver function
Maybe I am wrong, yet I recall that doing ops->domain_alloc_user without a fallback was intentional to rule out drivers that don't support IOMMUFD?
I think we moved away from that to the idea of using the dma_domain patch I suggested..
To be backward-compatible and concern about SMMU, we can opt in ops->domain_alloc_user upon availability and then add a fallback:
if ((!ops || !ops->domain_alloc_user) && user_data) { rc = -EOPNOTSUPP; goto out_abort; }
if (ops->domain_alloc_user) hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); else hwpt->domain = iommu_domain_alloc(dev->bus); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort; }
Yet, even by doing so, this series having the PATCH 07/17 that moves iopt_table_add_domain() would temporally break IOMMUFD on ARM platforms, until we add the ops->domain_alloc_user to SMMU drivers.
Drop patch 7 and 8
Change patch 12 so it has a unique flow to allocate and IOAS map a HWPT that does not try to share so much code with the existing flow.
The HWPT flow should always just call allocate and then map with no effort to attach first. This will fail on ARM SMMU at this point, and that is fine.
All the existing code should work exactly as it is now and not have any need to be changed.
Where things when wrong was trying to share "__iommufd_hw_pagetable_alloc", I think.
Don't try to consolidate at this point. Once all the drivers are updated then we could try to consolidate things.
Yea, I think that's the only way out for now. Though I am not sure about other drivers yet, hopefully the SMMU driver(s) is the last one that we need to update...
Thanks Nic
On Thu, Feb 09, 2023 at 02:22:26PM -0800, Nicolin Chen wrote:
Yea, I think that's the only way out for now. Though I am not sure about other drivers yet, hopefully the SMMU driver(s) is the last one that we need to update...
I noticed apple dart had copy and pasted this from smmu, but I see that it needed it.
Jason
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, February 10, 2023 6:22 AM
On Thu, Feb 09, 2023 at 04:39:37PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 11:51:31AM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 02:36:49PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 12:59:58PM -0500, Matthew Rosato wrote:
really should highlight that). Otherwise, conditionally calling iommu_domain_alloc(dev->bus) when !ops->domain_alloc_user
(instead
of returning -EOPNOTSUPP) seems to restore the prior functionality for me.
Yes, that is right if the input user data is 0 length or full of 0s then we should call the normal driver function
Maybe I am wrong, yet I recall that doing ops->domain_alloc_user without a fallback was intentional to rule out drivers that don't support IOMMUFD?
I think we moved away from that to the idea of using the dma_domain patch I suggested..
To be backward-compatible and concern about SMMU, we can opt in ops->domain_alloc_user upon availability and then add a fallback:
if ((!ops || !ops->domain_alloc_user) && user_data) { rc = -EOPNOTSUPP; goto out_abort; }
if (ops->domain_alloc_user) hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); else hwpt->domain = iommu_domain_alloc(dev->bus); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort; }
Yet, even by doing so, this series having the PATCH 07/17 that moves iopt_table_add_domain() would temporally break IOMMUFD on ARM platforms, until we add the ops->domain_alloc_user to SMMU drivers.
Drop patch 7 and 8
Change patch 12 so it has a unique flow to allocate and IOAS map a HWPT that does not try to share so much code with the existing flow.
The HWPT flow should always just call allocate and then map with no effort to attach first. This will fail on ARM SMMU at this point, and that is fine.
Ok. this sounds iopt_table_add_domain() should still be called right after user hwpt is allocated instead of first device attached.
All the existing code should work exactly as it is now and not have any need to be changed.
Where things when wrong was trying to share "__iommufd_hw_pagetable_alloc", I think.
Sure.
Don't try to consolidate at this point. Once all the drivers are updated then we could try to consolidate things.
Yea, I think that's the only way out for now. Though I am not sure about other drivers yet, hopefully the SMMU driver(s) is the last one that we need to update...
😊 hope it.
Regards, Yi Liu
From: Nicolin Chen nicolinc@nvidia.com
A hw_pagetable is allocated with an IOAS id, so it was supposed to link to the IOAS upon its allocation. But, previously with ARM SMMUv3 driver IOMMUFD fails to add a newly allocated hwpt to the IOAS, because SMMUv3 driver "finalises" an iommu_domain (hwpt->domain) after it attaches to a device. This was because the existing domain_alloc op doesn't pass in a dev pointer, so the driver could not know which SMMU device to look for. Now, IOMMUFD allocates the hwpt->domain using domain_alloc_user op that passes in the dev pointer. So, there's no need to wait for a device attachment anymore.
Move iopt_table_add_domain() call, along with the list_add_tail call on the hwpt_item, to the iommufd_hw_pagetable_alloc() right after a domain allocation. Accordingly, move iopt_table_remove_domain() and list_del to the iommufd_hw_pagetable_destroy() routine.
This simplifies the logic in the do_attach/detach(), by reducing the dependency on the device list and potential locking conundrum with the coming nesting feature.
Similarly, drop the iopt_table_add/remove_domain() calls, for selftest, from the iommufd_device_selftest_attach/detach(). Also, allocate hwpts outside the iommufd_device_selftest_attach() to make it symmetric with the iommufd_device_selftest_detach().
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 54 ++++--------------------- drivers/iommu/iommufd/hw_pagetable.c | 13 ++++++ drivers/iommu/iommufd/iommufd_private.h | 6 +-- drivers/iommu/iommufd/selftest.c | 16 ++++++-- 4 files changed, 35 insertions(+), 54 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 0e5d2bde7b3c..71a8c4f1c4a9 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -311,18 +311,11 @@ static void __iommmufd_device_detach(struct iommufd_device *idev,
mutex_lock(&hwpt->devices_lock); list_del(&idev->devices_item); - if (hwpt->ioas != new_ioas) - mutex_lock(&hwpt->ioas->mutex); - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - if (list_empty(&hwpt->devices)) { - iopt_table_remove_domain(&hwpt->ioas->iopt, - hwpt->domain); - list_del(&hwpt->hwpt_item); - } - if (detach_group) - iommu_detach_group(hwpt->domain, idev->group); - } + if (detach_group && !iommufd_hw_pagetable_has_group(hwpt, idev->group)) + iommu_detach_group(hwpt->domain, idev->group); + if (hwpt->ioas != new_ioas) { + mutex_lock(&hwpt->ioas->mutex); iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); mutex_unlock(&hwpt->ioas->mutex); } @@ -384,14 +377,6 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, rc = iommu_group_replace_domain(idev->group, hwpt->domain); if (rc) goto out_iova; - - if (list_empty(&hwpt->devices)) { - rc = iopt_table_add_domain(&hwpt->ioas->iopt, - hwpt->domain); - if (rc) - goto out_detach; - list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); - } }
/* Replace the cur_hwpt without iommu_detach_group() */ @@ -404,11 +389,6 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, mutex_unlock(&hwpt->devices_lock); return 0;
-out_detach: - if (cur_hwpt) - iommu_group_replace_domain(idev->group, cur_hwpt->domain); - else - iommu_detach_group(hwpt->domain, idev->group); out_iova: iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); out_unlock: @@ -940,35 +920,17 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD); * Creating a real iommufd_device is too hard, bypass creating a iommufd_device * and go directly to attaching a domain. */ -struct iommufd_hw_pagetable * -iommufd_device_selftest_attach(struct iommufd_ctx *ictx, - struct iommufd_ioas *ioas, - struct device *mock_dev) -{ - struct iommufd_hw_pagetable *hwpt; - int rc; - - hwpt = iommufd_hw_pagetable_alloc(ictx, ioas, mock_dev); - if (IS_ERR(hwpt)) - return hwpt; - - rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain); - if (rc) - goto out_hwpt;
+int iommufd_device_selftest_attach(struct iommufd_ctx *ictx, + struct iommufd_hw_pagetable *hwpt) +{ refcount_inc(&hwpt->obj.users); - iommufd_object_finalize(ictx, &hwpt->obj); - return hwpt; - -out_hwpt: - iommufd_object_abort_and_destroy(ictx, &hwpt->obj); - return ERR_PTR(rc); + return 0; }
void iommufd_device_selftest_detach(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt) { - iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); refcount_dec(&hwpt->obj.users); } #endif diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 08d963ee38c7..bda21ec737cf 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -11,8 +11,12 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) struct iommufd_hw_pagetable *hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+ lockdep_assert_held(&hwpt->ioas->mutex); + WARN_ON(!list_empty(&hwpt->devices));
+ iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); + list_del(&hwpt->hwpt_item); iommu_domain_free(hwpt->domain); refcount_dec(&hwpt->ioas->obj.users); mutex_destroy(&hwpt->devices_lock); @@ -34,6 +38,8 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, struct iommufd_hw_pagetable *hwpt; int rc;
+ lockdep_assert_held(&ioas->mutex); + hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE); if (IS_ERR(hwpt)) return hwpt; @@ -53,11 +59,18 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, INIT_LIST_HEAD(&hwpt->devices); INIT_LIST_HEAD(&hwpt->hwpt_item); mutex_init(&hwpt->devices_lock); + rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain); + if (rc) + goto out_free_domain; + list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list); + /* Pairs with iommufd_hw_pagetable_destroy() */ refcount_inc(&ioas->obj.users); hwpt->ioas = ioas; return hwpt;
+out_free_domain: + iommu_domain_free(hwpt->domain); out_abort: iommufd_object_abort(ictx, &hwpt->obj); return ERR_PTR(rc); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 7748117e36f9..604ad29f87b8 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -278,10 +278,8 @@ void iopt_remove_access(struct io_pagetable *iopt, void iommufd_access_destroy_object(struct iommufd_object *obj);
#ifdef CONFIG_IOMMUFD_TEST -struct iommufd_hw_pagetable * -iommufd_device_selftest_attach(struct iommufd_ctx *ictx, - struct iommufd_ioas *ioas, - struct device *mock_dev); +int iommufd_device_selftest_attach(struct iommufd_ctx *ictx, + struct iommufd_hw_pagetable *hwpt); void iommufd_device_selftest_detach(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt); struct device *iommufd_selftest_obj_to_dev(struct iommufd_object *obj); diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 5013c8757f4b..5f841d1d9e96 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -311,22 +311,30 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, sobj->idev.mock_dev.bus = &mock_bus; sobj->idev.mock_dev.iommu = &iommu;
- hwpt = iommufd_device_selftest_attach(ucmd->ictx, ioas, - &sobj->idev.mock_dev); + hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, + &sobj->idev.mock_dev); if (IS_ERR(hwpt)) { rc = PTR_ERR(hwpt); - goto out_sobj; + goto out_unlock; } sobj->idev.hwpt = hwpt;
+ rc = iommufd_device_selftest_attach(ucmd->ictx, hwpt); + if (rc) + goto out_free_hwpt; + /* Userspace must destroy both of these IDs to destroy the object */ cmd->mock_domain.out_hwpt_id = hwpt->obj.id; cmd->mock_domain.out_device_id = sobj->obj.id; iommufd_object_finalize(ucmd->ictx, &sobj->obj); + iommufd_object_finalize(ucmd->ictx, &hwpt->obj); iommufd_put_object(&ioas->obj); return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
-out_sobj: +out_free_hwpt: + iommufd_object_abort_and_destroy(ucmd->ictx, &hwpt->obj); +out_unlock: + mutex_unlock(&ioas->mutex); iommufd_object_abort(ucmd->ictx, &sobj->obj); out_ioas: iommufd_put_object(&ioas->obj);
to a helper which accepts the parent hw_pagetable pointer and user_data pointer. This is a prepareation for supporting userspace hw_pagetable allocation as caller needs to pass the two parameters to domain_alloc_user op.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 41 ++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index bda21ec737cf..ee97d2f3cf43 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -22,24 +22,23 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) mutex_destroy(&hwpt->devices_lock); }
-/** - * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device - * @ictx: iommufd context - * @ioas: IOAS to associate the domain with - * @dev: Device to get an iommu_domain for - * - * Allocate a new iommu_domain and return it as a hw_pagetable. - */ -struct iommufd_hw_pagetable * -iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, - struct device *dev) +static struct iommufd_hw_pagetable * +__iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, + struct iommufd_ioas *ioas, + struct device *dev, + struct iommufd_hw_pagetable *parent, + void *user_data) { const struct iommu_ops *ops; + struct iommu_domain *parent_domain = NULL; struct iommufd_hw_pagetable *hwpt; int rc;
lockdep_assert_held(&ioas->mutex);
+ if (WARN_ON(!ioas && !parent)) + return ERR_PTR(-EINVAL); + hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE); if (IS_ERR(hwpt)) return hwpt; @@ -50,7 +49,10 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, goto out_abort; }
- hwpt->domain = ops->domain_alloc_user(dev, NULL, NULL); + if (parent) + parent_domain = parent->domain; + + hwpt->domain = ops->domain_alloc_user(dev, parent_domain, user_data); if (!hwpt->domain) { rc = -ENOMEM; goto out_abort; @@ -75,3 +77,18 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, iommufd_object_abort(ictx, &hwpt->obj); return ERR_PTR(rc); } + +/** + * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device + * @ictx: iommufd context + * @ioas: IOAS to associate the domain with + * @dev: Device to get an iommu_domain for + * + * Allocate a new iommu_domain and return it as a hw_pagetable. + */ +struct iommufd_hw_pagetable * +iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, + struct device *dev) +{ + return __iommufd_hw_pagetable_alloc(ictx, ioas, dev, NULL, NULL); +}
Today iommufd allocates the kernel-managed hw_pagetabe implicitly when device is attached to an IOAS. This links the hw_pagetable to IOPT within IOAS.
However, this is not the perfect way. It makes much sense to let userspace explicitly request hw_pagtable allocation via iommufd. The reason is even though the hw_pagetable is kernel-managed, the mappings are feed by userspace. Also, this makes the lifecircle of kernel-managed hw_pagetable more clear during usage. This is very important in the usage of nested translation, in which the kernel-managed hw_pagetable would be used as the stage-2 hw_pagetable. In such case, both stage-1 and stage-2 hw_pagetable should be allocated by userspace to ensure the life-circle.
This adds an ioctl IOMMU_HWPT_ALLOC for the hw_pagetable allocation. For kernel-managed hw_pagetable, userspace should provide an IOAS ID in the allocation request.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 11 ++- drivers/iommu/iommufd/hw_pagetable.c | 116 ++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 15 +++ drivers/iommu/iommufd/main.c | 3 + include/uapi/linux/iommufd.h | 48 ++++++++++ 5 files changed, 191 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 71a8c4f1c4a9..dd7943ff02e4 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -146,8 +146,7 @@ static int iommufd_zero_fill_user(u64 ptr, int bytes) return 0; }
-static struct device * -iommufd_obj_dev(struct iommufd_object *obj) +struct device *iommufd_obj_dev(struct iommufd_object *obj) { struct device *dev = NULL;
@@ -160,6 +159,14 @@ iommufd_obj_dev(struct iommufd_object *obj) return dev; }
+/* + * bitmaps of supported page table data types of hardware iommu, + * indexed by the members defined in enum iommu_device_data_type. + */ +const u64 iommufd_supported_pgtbl_types[] = { + [IOMMU_DEVICE_DATA_INTEL_VTD] = BIT_ULL(IOMMU_PGTBL_DATA_NONE), +}; + int iommufd_device_get_info(struct iommufd_ucmd *ucmd) { struct iommu_device_info *cmd = ucmd->cmd; diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index ee97d2f3cf43..998d01490a74 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -92,3 +92,119 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, { return __iommufd_hw_pagetable_alloc(ictx, ioas, dev, NULL, NULL); } + +/* + * size of page table type specific data, indexed by + * enum iommu_pgtbl_data_type. + */ +static const size_t iommufd_hwpt_info_size[] = { + [IOMMU_PGTBL_DATA_NONE] = 0, +}; + +int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) +{ + struct iommufd_hw_pagetable *hwpt; + struct iommu_hwpt_alloc *cmd = ucmd->cmd; + struct iommufd_ctx *ictx = ucmd->ictx; + struct iommufd_object *pt_obj = NULL; + struct iommufd_ioas *ioas = NULL; + struct iommufd_object *dev_obj; + struct device *dev; + const struct iommu_ops *ops; + void *data = NULL; + u32 driver_type, klen; + int rc; + + if (cmd->__reserved || cmd->flags) + return -EOPNOTSUPP; + + dev_obj = iommufd_get_object(ucmd->ictx, cmd->dev_id, + IOMMUFD_OBJ_ANY); + if (IS_ERR(dev_obj)) + return PTR_ERR(dev_obj); + + dev = iommufd_obj_dev(dev_obj); + if (!dev) { + rc = -EINVAL; + goto out_put_dev; + } + + ops = dev_iommu_ops(dev); + if (!ops) { + rc = -EOPNOTSUPP; + goto out_put_dev; + } + + driver_type = ops->driver_type; + + /* data_type should be a supported type by the hardware */ + if (!((1 << cmd->data_type) & + iommufd_supported_pgtbl_types[driver_type])) { + rc = -EINVAL; + goto out_put_dev; + } + + pt_obj = iommufd_get_object(ictx, cmd->pt_id, IOMMUFD_OBJ_ANY); + if (IS_ERR(pt_obj)) { + rc = -EINVAL; + goto out_put_dev; + } + + switch (pt_obj->type) { + case IOMMUFD_OBJ_IOAS: + ioas = container_of(pt_obj, struct iommufd_ioas, obj); + break; + default: + rc = -EINVAL; + goto out_put_pt; + } + + klen = iommufd_hwpt_info_size[cmd->data_type]; + if (klen) { + if (!cmd->data_len) { + rc = -EINVAL; + goto out_put_pt; + } + + data = kzalloc(klen, GFP_KERNEL); + if (!data) { + rc = -ENOMEM; + goto out_put_pt; + } + + rc = copy_struct_from_user(data, klen, + u64_to_user_ptr(cmd->data_uptr), + cmd->data_len); + if (rc) + goto out_free_data; + } + + mutex_lock(&ioas->mutex); + hwpt = __iommufd_hw_pagetable_alloc(ictx, ioas, dev, NULL, data); + mutex_unlock(&ioas->mutex); + if (IS_ERR(hwpt)) { + rc = PTR_ERR(hwpt); + goto out_free_data; + } + + cmd->out_hwpt_id = hwpt->obj.id; + + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + goto out_destroy_hwpt; + + kfree(data); + iommufd_object_finalize(ucmd->ictx, &hwpt->obj); + iommufd_put_object(pt_obj); + iommufd_put_object(dev_obj); + return 0; +out_destroy_hwpt: + iommufd_object_abort_and_destroy(ucmd->ictx, &hwpt->obj); +out_free_data: + kfree(data); +out_put_pt: + iommufd_put_object(pt_obj); +out_put_dev: + iommufd_put_object(dev_obj); + return rc; +} diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 604ad29f87b8..ee5344baf135 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -257,7 +257,22 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, struct device *dev); void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+static inline struct iommufd_hw_pagetable * +iommufd_get_hwpt(struct iommufd_ucmd *ucmd, u32 id) +{ + return container_of(iommufd_get_object(ucmd->ictx, id, + IOMMUFD_OBJ_HW_PAGETABLE), + struct iommufd_hw_pagetable, obj); +} + +int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd); + +struct device *iommufd_obj_dev(struct iommufd_object *obj); + void iommufd_device_destroy(struct iommufd_object *obj); + +extern const u64 iommufd_supported_pgtbl_types[]; + int iommufd_device_get_info(struct iommufd_ucmd *ucmd);
struct iommufd_access { diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 59aa30ad1090..831303d64abe 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -251,6 +251,7 @@ static int iommufd_option(struct iommufd_ucmd *ucmd) union ucmd_buffer { struct iommu_destroy destroy; struct iommu_device_info info; + struct iommu_hwpt_alloc hwpt; struct iommu_ioas_alloc alloc; struct iommu_ioas_allow_iovas allow_iovas; struct iommu_ioas_copy ioas_copy; @@ -284,6 +285,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id), IOCTL_OP(IOMMU_DEVICE_GET_INFO, iommufd_device_get_info, struct iommu_device_info, __reserved), + IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, + __reserved), IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl, struct iommu_ioas_alloc, out_ioas_id), IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index bbffb63d2513..f501add5ffe9 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -46,6 +46,7 @@ enum { IOMMUFD_CMD_OPTION, IOMMUFD_CMD_VFIO_IOAS, IOMMUFD_CMD_DEVICE_GET_INFO, + IOMMUFD_CMD_HWPT_ALLOC, };
/** @@ -373,6 +374,14 @@ struct iommu_device_info_vtd { __aligned_u64 ecap_reg; };
+/** + * enum iommu_pgtbl_data_type - IOMMU Page Table User Data type + * @IOMMU_PGTBL_DATA_NONE: no user data + */ +enum iommu_pgtbl_data_type { + IOMMU_PGTBL_DATA_NONE, +}; + /** * struct iommu_device_info - ioctl(IOMMU_DEVICE_GET_INFO) * @size: sizeof(struct iommu_device_info) @@ -461,6 +470,45 @@ struct iommu_hwpt_intel_vtd { __u32 __reserved; };
+/** + * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC) + * @size: sizeof(struct iommu_hwpt_alloc) + * @flags: Must be 0 + * @dev_id: The device to allocate this HWPT for + * @pt_id: The parent of this HWPT (IOAS or HWPT) + * @data_type: One of enum iommu_pgtbl_data_type + * @data_len: Length of the type specific data + * @data_uptr: User pointer to the type specific data + * @out_hwpt_id: Output HWPT ID for the allocated object + * @__reserved: Must be 0 + * + * Allocate hw_pagetable for managing page tables in userspace. Such page + * tables can be user-managed or kernel-managed. @pt_id is needed for either + * case. While the @data_type, @data_len and @data_uptr are optional. For + * the user-managed page tables, userspace should provide the data_type, the + * data_len and the type speficific data. While for the kernel-managed page + * tables, use the IOMMU_PGTBL_DATA_NONE data_type, @data_len and @data_uptr + * will be ignored. + * + * +==============================+=====================================+ + * | @data_type | Data structure in @data_uptr | + * +------------------------------+-------------------------------------+ + * | IOMMU_PGTBL_DATA_NONE | N/A | + * +------------------------------+-------------------------------------+ + */ +struct iommu_hwpt_alloc { + __u32 size; + __u32 flags; + __u32 dev_id; + __u32 pt_id; + __u32 data_type; + __u32 data_len; + __aligned_u64 data_uptr; + __u32 out_hwpt_id; + __u32 __reserved; +}; +#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC) + /** * enum iommu_vtd_qi_granularity - Intel VT-d specific granularity of * queued invalidation
On Wed, Feb 08, 2023 at 08:31:45PM -0800, Yi Liu wrote:
Today iommufd allocates the kernel-managed hw_pagetabe implicitly when device is attached to an IOAS. This links the hw_pagetable to IOPT within IOAS.
However, this is not the perfect way. It makes much sense to let userspace explicitly request hw_pagtable allocation via iommufd. The reason is even though the hw_pagetable is kernel-managed, the mappings are feed by userspace. Also, this makes the lifecircle of kernel-managed hw_pagetable more clear during usage. This is very important in the usage of nested translation, in which the kernel-managed hw_pagetable would be used as the stage-2 hw_pagetable. In such case, both stage-1 and stage-2 hw_pagetable should be allocated by userspace to ensure the life-circle.
This adds an ioctl IOMMU_HWPT_ALLOC for the hw_pagetable allocation. For kernel-managed hw_pagetable, userspace should provide an IOAS ID in the allocation request.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/device.c | 11 ++- drivers/iommu/iommufd/hw_pagetable.c | 116 ++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 15 +++ drivers/iommu/iommufd/main.c | 3 + include/uapi/linux/iommufd.h | 48 ++++++++++ 5 files changed, 191 insertions(+), 2 deletions(-)
This patch and its requirements should all be first in the series. A mini series who's only job is to add IOMMU_HWPT_ALLOC
Then patches to add IOMMU_HWPT_INVALIDATE
Then the vt-d implementation of all this, including the vt-d specific changes to the uapi/etc.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, February 10, 2023 4:45 AM
On Wed, Feb 08, 2023 at 08:31:45PM -0800, Yi Liu wrote:
Today iommufd allocates the kernel-managed hw_pagetabe implicitly
when
device is attached to an IOAS. This links the hw_pagetable to IOPT within IOAS.
However, this is not the perfect way. It makes much sense to let
userspace
explicitly request hw_pagtable allocation via iommufd. The reason is even though the hw_pagetable is kernel-managed, the mappings are feed by userspace. Also, this makes the lifecircle of kernel-managed
hw_pagetable
more clear during usage. This is very important in the usage of nested translation, in which the kernel-managed hw_pagetable would be used as
the
stage-2 hw_pagetable. In such case, both stage-1 and stage-2
hw_pagetable
should be allocated by userspace to ensure the life-circle.
This adds an ioctl IOMMU_HWPT_ALLOC for the hw_pagetable allocation.
For
kernel-managed hw_pagetable, userspace should provide an IOAS ID in
the
allocation request.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/device.c | 11 ++- drivers/iommu/iommufd/hw_pagetable.c | 116
++++++++++++++++++++++++
drivers/iommu/iommufd/iommufd_private.h | 15 +++ drivers/iommu/iommufd/main.c | 3 + include/uapi/linux/iommufd.h | 48 ++++++++++ 5 files changed, 191 insertions(+), 2 deletions(-)
This patch and its requirements should all be first in the series. A mini series who's only job is to add IOMMU_HWPT_ALLOC
Then patches to add IOMMU_HWPT_INVALIDATE
Then the vt-d implementation of all this, including the vt-d specific changes to the uapi/etc.
Ok. will reorder in next version.
Regards, Yi Liu
From: Nicolin Chen nicolinc@nvidia.com
When using a 2-stage nested hw_pagetable (hwpt) setup, the IOVA mappings or SW MSI are at the stage-2 hwpt, while the device will be attached to a stage-1 hwpt. And then the current do_attach() flow will fail to set up MSI, since it always uses the attaching hwpt pointer (i.e. stage-1 in such case).
As a preparatory change for the following nesting support patch, move all related operations into a pair of separate functions, so we only need to redirect the hwpt pointer to its stage-2 hwpt, when using a nested setup.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 40 +++++++++++++++++++++++++++------- 1 file changed, 32 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index dd7943ff02e4..cdc4ab36f52d 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -294,6 +294,36 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+static int iommufd_device_attach_ioas(struct iommufd_device *idev, + struct iommufd_hw_pagetable *hwpt) +{ + phys_addr_t sw_msi_start = PHYS_ADDR_MAX; + struct io_pagetable *iopt; + int rc; + + iopt = &hwpt->ioas->iopt; + + rc = iopt_table_enforce_group_resv_regions(iopt, idev->dev, + idev->group, &sw_msi_start); + if (rc) + return rc; + + rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start); + if (rc) + goto out_iova; + + return 0; +out_iova: + iopt_remove_reserved_iova(iopt, idev->group); + return rc; +} + +static void iommufd_device_detach_ioas(struct iommufd_device *idev, + struct iommufd_hw_pagetable *hwpt) +{ + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); +} + /** * __iommmufd_device_detach - Detach a device from idev->hwpt to new_hwpt * @idev: device to detach @@ -323,7 +353,7 @@ static void __iommmufd_device_detach(struct iommufd_device *idev,
if (hwpt->ioas != new_ioas) { mutex_lock(&hwpt->ioas->mutex); - iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); + iommufd_device_detach_ioas(idev, hwpt); mutex_unlock(&hwpt->ioas->mutex); } mutex_unlock(&hwpt->devices_lock); @@ -342,7 +372,6 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; - phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc;
lockdep_assert_held(&hwpt->ioas->mutex); @@ -367,15 +396,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } }
- rc = iopt_table_enforce_group_resv_regions(&hwpt->ioas->iopt, idev->dev, - idev->group, &sw_msi_start); + rc = iommufd_device_attach_ioas(idev, hwpt); if (rc) goto out_unlock;
- rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start); - if (rc) - goto out_iova; - /* * FIXME: Hack around missing a device-centric iommu api, only attach to * the group once for the first device that is in the group.
Nested translation has stage-1 and stage-2 hw_pagetables, and both needs userspace to allocate. Stage-1 hw_pagetable needs to work with a stage-2 hw_pagetable. This adds a parent pointer in struct iommufd_hw_pagetable to link stage-1 hw_pagetable with a stage-2 hw_pagetable. Hence iommufd core can accept user-managed hw_pagetable allocation request.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 5 +++++ drivers/iommu/iommufd/hw_pagetable.c | 21 +++++++++++++++------ drivers/iommu/iommufd/iommufd_private.h | 1 + 3 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index cdc4ab36f52d..6d948fa418d5 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -301,6 +301,9 @@ static int iommufd_device_attach_ioas(struct iommufd_device *idev, struct io_pagetable *iopt; int rc;
+ if (hwpt->parent) + hwpt = hwpt->parent; + iopt = &hwpt->ioas->iopt;
rc = iopt_table_enforce_group_resv_regions(iopt, idev->dev, @@ -321,6 +324,8 @@ static int iommufd_device_attach_ioas(struct iommufd_device *idev, static void iommufd_device_detach_ioas(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { + if (hwpt->parent) + hwpt = hwpt->parent; iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); }
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 998d01490a74..02dee8e8d958 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -15,8 +15,12 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
WARN_ON(!list_empty(&hwpt->devices));
- iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); - list_del(&hwpt->hwpt_item); + if (!hwpt->parent) { + iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); + list_del(&hwpt->hwpt_item); + } else { + refcount_dec(&hwpt->parent->obj.users); + } iommu_domain_free(hwpt->domain); refcount_dec(&hwpt->ioas->obj.users); mutex_destroy(&hwpt->devices_lock); @@ -58,13 +62,18 @@ __iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, goto out_abort; }
+ hwpt->parent = parent; INIT_LIST_HEAD(&hwpt->devices); INIT_LIST_HEAD(&hwpt->hwpt_item); mutex_init(&hwpt->devices_lock); - rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain); - if (rc) - goto out_free_domain; - list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list); + if (!parent) { + rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain); + if (rc) + goto out_free_domain; + list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list); + } else { + refcount_inc(&parent->obj.users); + }
/* Pairs with iommufd_hw_pagetable_destroy() */ refcount_inc(&ioas->obj.users); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index ee5344baf135..5ef034451f4b 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -241,6 +241,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd); */ struct iommufd_hw_pagetable { struct iommufd_object obj; + struct iommufd_hw_pagetable *parent; struct iommufd_ioas *ioas; struct iommu_domain *domain; bool auto_domain : 1;
As the introduction of nested translation, there are page tables managed by userspace. hw_pagetables can be stage-1 pagetable, stage-2 pagetable or just standalone pagetable.
Stage-2 page table and standalone pagetable are kernel-managed for security. iommufd has already supported it.
Stage-1 pagetable is user-managed and needs to work with a stage-2 page table. Hence, userspace should provide a hw_pagetable ID that points to a stage-2 hw_pagetable. Since stage-1 is user-managed, so an ioctl is added to sync the IOTLB when there is modification in the stage-1 page table.
The first available user-managed hw_pagtable type is the Intel VT-d stage-1 pagetable for nested translation.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 3 +- drivers/iommu/iommufd/hw_pagetable.c | 71 ++++++++++++++++++++++++- drivers/iommu/iommufd/iommufd_private.h | 1 + drivers/iommu/iommufd/main.c | 8 +++ include/uapi/linux/iommufd.h | 34 ++++++++++++ 5 files changed, 114 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 6d948fa418d5..c19e2f54a44f 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -164,7 +164,8 @@ struct device *iommufd_obj_dev(struct iommufd_object *obj) * indexed by the members defined in enum iommu_device_data_type. */ const u64 iommufd_supported_pgtbl_types[] = { - [IOMMU_DEVICE_DATA_INTEL_VTD] = BIT_ULL(IOMMU_PGTBL_DATA_NONE), + [IOMMU_DEVICE_DATA_INTEL_VTD] = BIT_ULL(IOMMU_PGTBL_DATA_NONE) | + BIT_ULL(IOMMU_PGTBL_DATA_VTD_S1), };
int iommufd_device_get_info(struct iommufd_ucmd *ucmd) diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 02dee8e8d958..44a75ccc8e08 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -108,11 +108,12 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, */ static const size_t iommufd_hwpt_info_size[] = { [IOMMU_PGTBL_DATA_NONE] = 0, + [IOMMU_PGTBL_DATA_VTD_S1] = sizeof(struct iommu_hwpt_intel_vtd), };
int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) { - struct iommufd_hw_pagetable *hwpt; + struct iommufd_hw_pagetable *hwpt, *parent = NULL; struct iommu_hwpt_alloc *cmd = ucmd->cmd; struct iommufd_ctx *ictx = ucmd->ictx; struct iommufd_object *pt_obj = NULL; @@ -160,6 +161,19 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) }
switch (pt_obj->type) { + case IOMMUFD_OBJ_HW_PAGETABLE: + parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj); + /* + * Cannot allocate user-managed hwpt linking to auto_created + * hwpt. If the parent hwpt is already a user-managed hwpt, + * don't allocate another user-managed hwpt linking to it. + */ + if (parent->auto_domain || parent->parent) { + rc = -EINVAL; + goto out_put_pt; + } + ioas = parent->ioas; + break; case IOMMUFD_OBJ_IOAS: ioas = container_of(pt_obj, struct iommufd_ioas, obj); break; @@ -189,7 +203,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) }
mutex_lock(&ioas->mutex); - hwpt = __iommufd_hw_pagetable_alloc(ictx, ioas, dev, NULL, data); + hwpt = __iommufd_hw_pagetable_alloc(ictx, ioas, dev, parent, data); mutex_unlock(&ioas->mutex); if (IS_ERR(hwpt)) { rc = PTR_ERR(hwpt); @@ -217,3 +231,56 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) iommufd_put_object(dev_obj); return rc; } + +static u32 iommufd_hwpt_invalidate_info_size[] = { + [IOMMU_PGTBL_DATA_VTD_S1] = sizeof(struct iommu_hwpt_invalidate_intel_vtd), +}; + +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{ + struct iommu_hwpt_invalidate *cmd = ucmd->cmd; + struct iommufd_hw_pagetable *hwpt; + u64 user_ptr; + u32 user_data_len, klen; + int rc = 0; + + /* + * No invalidation needed for type==IOMMU_PGTBL_DATA_NONE. + * data_len should not exceed the size of iommufd_invalidate_buffer. + */ + if (cmd->data_type == IOMMU_PGTBL_DATA_NONE || !cmd->data_len) + return -EOPNOTSUPP; + + hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id); + if (IS_ERR(hwpt)) + return PTR_ERR(hwpt); + + /* Do not allow any kernel-managed hw_pagetable */ + if (!hwpt->parent) { + rc = -EINVAL; + goto out_put_hwpt; + } + + klen = iommufd_hwpt_invalidate_info_size[cmd->data_type]; + if (!klen) { + rc = -EINVAL; + goto out_put_hwpt; + } + + /* + * copy the needed fields before reusing the ucmd buffer, this + * avoids memory allocation in this path. + */ + user_ptr = cmd->data_uptr; + user_data_len = cmd->data_len; + + rc = copy_struct_from_user(cmd, klen, + u64_to_user_ptr(user_ptr), user_data_len); + if (rc) + goto out_put_hwpt; + + hwpt->domain->ops->iotlb_sync_user(hwpt->domain, cmd); +out_put_hwpt: + iommufd_put_object(&hwpt->obj); + return rc; +} diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 5ef034451f4b..bb341e633c18 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -267,6 +267,7 @@ iommufd_get_hwpt(struct iommufd_ucmd *ucmd, u32 id) }
int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd); +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
struct device *iommufd_obj_dev(struct iommufd_object *obj);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 831303d64abe..6e2d8805daf3 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -252,6 +252,12 @@ union ucmd_buffer { struct iommu_destroy destroy; struct iommu_device_info info; struct iommu_hwpt_alloc hwpt; + struct iommu_hwpt_invalidate cache; + /* + * data_type specific structure used in the cache invalidation + * path. + */ + struct iommu_hwpt_invalidate_intel_vtd vtd; struct iommu_ioas_alloc alloc; struct iommu_ioas_allow_iovas allow_iovas; struct iommu_ioas_copy ioas_copy; @@ -287,6 +293,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, __reserved), + IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate, + struct iommu_hwpt_invalidate, data_uptr), IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl, struct iommu_ioas_alloc, out_ioas_id), IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f501add5ffe9..cb6a9ee215f4 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -47,6 +47,7 @@ enum { IOMMUFD_CMD_VFIO_IOAS, IOMMUFD_CMD_DEVICE_GET_INFO, IOMMUFD_CMD_HWPT_ALLOC, + IOMMUFD_CMD_HWPT_INVALIDATE, };
/** @@ -377,9 +378,11 @@ struct iommu_device_info_vtd { /** * enum iommu_pgtbl_data_type - IOMMU Page Table User Data type * @IOMMU_PGTBL_DATA_NONE: no user data + * @IOMMU_PGTBL_DATA_VTD_S1: Data for Intel VT-d stage-1 page table */ enum iommu_pgtbl_data_type { IOMMU_PGTBL_DATA_NONE, + IOMMU_PGTBL_DATA_VTD_S1, };
/** @@ -495,6 +498,8 @@ struct iommu_hwpt_intel_vtd { * +------------------------------+-------------------------------------+ * | IOMMU_PGTBL_DATA_NONE | N/A | * +------------------------------+-------------------------------------+ + * | IOMMU_PGTBL_DATA_VTD_S1 | struct iommu_hwpt_intel_vtd | + * +------------------------------+-------------------------------------+ */ struct iommu_hwpt_alloc { __u32 size; @@ -562,4 +567,33 @@ struct iommu_hwpt_invalidate_intel_vtd { __u64 granule_size; __u64 nb_granules; }; + +/** + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) + * @size: sizeof(struct iommu_hwpt_invalidate) + * @hwpt_id: HWPT ID of target hardware page table for the invalidation + * @data_type: One of enum iommu_pgtbl_data_type + * @data_len: Length of the type specific data + * @data_uptr: User pointer to the type specific data + * + * Invalidate the iommu cache for user-managed page table. Modifications + * on user-managed page table should be followed with this operation to + * sync the userspace with the kernel and underlying hardware. This operation + * is only needed by user-managed hw_pagetables, so the @data_type should + * never be IOMMU_PGTBL_DATA_NONE. + * + * +==============================+========================================+ + * | @data_type | Data structure in @data_uptr | + * +------------------------------+----------------------------------------+ + * | IOMMU_PGTBL_DATA_VTD_S1 | struct iommu_hwpt_invalidate_intel_vtd | + * +------------------------------+----------------------------------------+ + */ +struct iommu_hwpt_invalidate { + __u32 size; + __u32 hwpt_id; + __u32 data_type; + __u32 data_len; + __aligned_u64 data_uptr; +}; +#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE) #endif
This provides a way for userspace to probe supported stage-1 page table types on hardware IOMMU. This is helpful when the user-managed page table is used by hardware IOMMU. e.g. nested translation.
This commit reports the supported types for IOMMU_DEVICE_DATA_INTEL_VTD.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 7 +++++++ drivers/iommu/iommufd/main.c | 2 +- include/uapi/linux/iommufd.h | 7 +++++++ 3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index c19e2f54a44f..826441c6005d 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -224,6 +224,13 @@ int iommufd_device_get_info(struct iommufd_ucmd *ucmd) cmd->out_device_type = ops->driver_type; cmd->data_len = data_len;
+ if (ops->driver_type != IOMMU_DEVICE_DATA_SELFTEST) + cmd->out_pgtbl_type_bitmap = iommufd_supported_pgtbl_types[ops->driver_type]; +#ifdef CONFIG_IOMMUFD_TEST + else + cmd->out_pgtbl_type_bitmap = 0; +#endif + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
out_free_data: diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 6e2d8805daf3..d8bac1303b33 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -290,7 +290,7 @@ struct iommufd_ioctl_op { static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id), IOCTL_OP(IOMMU_DEVICE_GET_INFO, iommufd_device_get_info, struct iommu_device_info, - __reserved), + out_pgtbl_type_bitmap), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, __reserved), IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cb6a9ee215f4..2c7533d843bc 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -396,6 +396,8 @@ enum iommu_pgtbl_data_type { * @out_device_type: Output the underlying iommu hardware type, it is * one of enum iommu_device_data_type. * @__reserved: Must be 0 + * @out_pgtbl_type_bitmap: Output the supported page table type. Each + * bit is defined in enum iommu_pgtbl_data_type. * * Query the hardware iommu capability for given device which has been * bound to iommufd. @data_len is set to be the size of the buffer to @@ -408,6 +410,10 @@ enum iommu_pgtbl_data_type { * * The @out_device_type will be filled if the ioctl succeeds. It would * be used to decode the data filled in the buffer pointed by @data_ptr. + * + * @out_pgtbl_type_bitmap tells the userspace the supported page tables. + * This differs per @out_device_type. Userspace should check it before + * allocating hw_pagetable in userspace. */ struct iommu_device_info { __u32 size; @@ -417,6 +423,7 @@ struct iommu_device_info { __aligned_u64 data_ptr; __u32 out_device_type; __u32 __reserved; + __aligned_u64 out_pgtbl_type_bitmap; }; #define IOMMU_DEVICE_GET_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DEVICE_GET_INFO)
From: Nicolin Chen nicolinc@nvidia.com
This allows to detach/attach the mock_dev to a specified domain/hwpt.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 1 - drivers/iommu/iommufd/iommufd_test.h | 5 +++ drivers/iommu/iommufd/selftest.c | 54 ++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 826441c6005d..bf803975062c 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -964,7 +964,6 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD); * Creating a real iommufd_device is too hard, bypass creating a iommufd_device * and go directly to attaching a domain. */ - int iommufd_device_selftest_attach(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt) { diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 1605ff2b1a90..a73ab51afe79 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -10,6 +10,7 @@ enum { IOMMU_TEST_OP_ADD_RESERVED = 1, IOMMU_TEST_OP_MOCK_DOMAIN, + IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE, IOMMU_TEST_OP_MD_CHECK_MAP, IOMMU_TEST_OP_MD_CHECK_REFS, IOMMU_TEST_OP_CREATE_ACCESS, @@ -53,6 +54,10 @@ struct iommu_test_cmd { __u32 out_device_id; __u32 out_hwpt_id; } mock_domain; + struct { + __u32 device_id; + __u32 hwpt_id; + } mock_domain_replace; struct { __aligned_u64 iova; __aligned_u64 length; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 5f841d1d9e96..3623a42648d5 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -98,6 +98,7 @@ struct selftest_obj { union { struct { struct iommufd_hw_pagetable *hwpt; + struct iommufd_ioas *ioas; struct iommufd_ctx *ictx; struct device mock_dev; } idev; @@ -308,6 +309,7 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, } sobj->idev.ictx = ucmd->ictx; sobj->type = TYPE_IDEV; + sobj->idev.ioas = ioas; sobj->idev.mock_dev.bus = &mock_bus; sobj->idev.mock_dev.iommu = &iommu;
@@ -341,6 +343,56 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, return rc; }
+/* Replace the mock domain with a manually allocated hw_pagetable */ +static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd, + struct iommu_test_cmd *cmd) +{ + struct iommufd_object *dev_obj, *hwpt_obj; + struct iommufd_hw_pagetable *hwpt; + struct selftest_obj *sobj; + int rc; + + hwpt_obj = iommufd_get_object(ucmd->ictx, + cmd->mock_domain_replace.hwpt_id, + IOMMUFD_OBJ_HW_PAGETABLE); + if (IS_ERR(hwpt_obj)) + return PTR_ERR(hwpt_obj); + + hwpt = container_of(hwpt_obj, struct iommufd_hw_pagetable, obj); + + dev_obj = iommufd_get_object(ucmd->ictx, + cmd->mock_domain_replace.device_id, + IOMMUFD_OBJ_SELFTEST); + if (IS_ERR(dev_obj)) { + rc = PTR_ERR(dev_obj); + goto out_hwpt_obj; + } + + sobj = container_of(dev_obj, struct selftest_obj, obj); + if (sobj->type != TYPE_IDEV) { + rc = -EINVAL; + goto out_dev_obj; + } + + iommufd_device_selftest_detach(sobj->idev.ictx, sobj->idev.hwpt); + + rc = iommufd_device_selftest_attach(ucmd->ictx, hwpt); + if (rc) + goto out_reattach; + sobj->idev.hwpt = hwpt; + + rc = 0; + goto out_dev_obj; + +out_reattach: + iommufd_device_selftest_attach(ucmd->ictx, sobj->idev.hwpt); +out_dev_obj: + iommufd_put_object(dev_obj); +out_hwpt_obj: + iommufd_put_object(hwpt_obj); + return rc; +} + /* Add an additional reserved IOVA to the IOAS */ static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd, unsigned int mockpt_id, @@ -847,6 +899,8 @@ int iommufd_test(struct iommufd_ucmd *ucmd) cmd->add_reserved.length); case IOMMU_TEST_OP_MOCK_DOMAIN: return iommufd_test_mock_domain(ucmd, cmd); + case IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE: + return iommufd_test_mock_domain_replace(ucmd, cmd); case IOMMU_TEST_OP_MD_CHECK_MAP: return iommufd_test_md_check_pa( ucmd, cmd->id, cmd->check_map.iova,
On Wed, Feb 08, 2023 at 08:31:50PM -0800, Yi Liu wrote:
From: Nicolin Chen nicolinc@nvidia.com
This allows to detach/attach the mock_dev to a specified domain/hwpt.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/device.c | 1 - drivers/iommu/iommufd/iommufd_test.h | 5 +++ drivers/iommu/iommufd/selftest.c | 54 ++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 826441c6005d..bf803975062c 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -964,7 +964,6 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD);
- Creating a real iommufd_device is too hard, bypass creating a iommufd_device
- and go directly to attaching a domain.
*/
int iommufd_device_selftest_attach(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt) {
Just found last night that I forgot to drop this line removal.
Let's fix it in v2 :)
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com
This allows to allocate hwpt for mock device to check the IOMMU_HWPT_ALLOC in selftest.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 2 +- drivers/iommu/iommufd/hw_pagetable.c | 28 ++++- drivers/iommu/iommufd/iommufd_test.h | 16 +++ drivers/iommu/iommufd/selftest.c | 42 ++++++- tools/testing/selftests/iommu/iommufd.c | 116 +++++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 68 ++++++++++ 6 files changed, 266 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index bf803975062c..ea5691468740 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -228,7 +228,7 @@ int iommufd_device_get_info(struct iommufd_ucmd *ucmd) cmd->out_pgtbl_type_bitmap = iommufd_supported_pgtbl_types[ops->driver_type]; #ifdef CONFIG_IOMMUFD_TEST else - cmd->out_pgtbl_type_bitmap = 0; + cmd->out_pgtbl_type_bitmap = U64_MAX; // Pretend to support all types #endif
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 44a75ccc8e08..1f1940969b2b 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -5,6 +5,7 @@ #include <linux/iommu.h>
#include "iommufd_private.h" +#include "iommufd_test.h"
void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) { @@ -111,6 +112,15 @@ static const size_t iommufd_hwpt_info_size[] = { [IOMMU_PGTBL_DATA_VTD_S1] = sizeof(struct iommu_hwpt_intel_vtd), };
+/* Return true if type is supported, otherwise false */ +static inline bool +iomufd_hwpt_type_check(enum iommu_device_data_type driver_type, + enum iommu_pgtbl_data_type type) +{ + return ((1 << type) & + iommufd_supported_pgtbl_types[driver_type]); +} + int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) { struct iommufd_hw_pagetable *hwpt, *parent = NULL; @@ -124,6 +134,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) void *data = NULL; u32 driver_type, klen; int rc; + bool support;
if (cmd->__reserved || cmd->flags) return -EOPNOTSUPP; @@ -148,8 +159,14 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) driver_type = ops->driver_type;
/* data_type should be a supported type by the hardware */ - if (!((1 << cmd->data_type) & - iommufd_supported_pgtbl_types[driver_type])) { + if (cmd->data_type != IOMMU_PGTBL_DATA_SELFTTEST) + support = iomufd_hwpt_type_check(driver_type, + cmd->data_type); +#ifdef CONFIG_IOMMUFD_TEST + else + support = true; /* selftest pretend to support all types */ +#endif + if (!support) { rc = -EINVAL; goto out_put_dev; } @@ -182,7 +199,12 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) goto out_put_pt; }
- klen = iommufd_hwpt_info_size[cmd->data_type]; + if (cmd->data_type != IOMMU_PGTBL_DATA_SELFTTEST) + klen = iommufd_hwpt_info_size[cmd->data_type]; +#ifdef CONFIG_IOMMUFD_TEST + else + klen = sizeof(struct iommu_hwpt_selftest); +#endif if (klen) { if (!cmd->data_len) { rc = -EINVAL; diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index a73ab51afe79..842b81397a2a 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -114,4 +114,20 @@ struct iommu_device_info_selftest { __u32 test_reg; };
+/* Should not be equal to any defined value in enum iommu_pgtbl_data_type */ +#define IOMMU_PGTBL_DATA_SELFTTEST 0xbadbeef + +/** + * struct iommu_hwpt_selftest + * + * @flags: page table entry attributes + * @test_config: default iotlb setup (value IOMMU_TEST_IOTLB_DEFAULT) + */ +struct iommu_hwpt_selftest { +#define IOMMU_TEST_FLAG_NESTED (1 << 0) + __u64 flags; +#define IOMMU_TEST_IOTLB_DEFAULT 0xbadbeef + __u64 test_config; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 3623a42648d5..354a0054afad 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -84,7 +84,9 @@ void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
struct mock_iommu_domain { struct iommu_domain domain; + struct mock_iommu_domain *parent; struct xarray pfns; + u32 iotlb; };
enum selftest_obj_type { @@ -119,6 +121,8 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length) return info; }
+static const struct iommu_ops mock_ops; + static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type) { struct mock_iommu_domain *mock; @@ -132,10 +136,38 @@ static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type) mock->domain.geometry.aperture_start = MOCK_APERTURE_START; mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST; mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE; + mock->domain.ops = mock_ops.default_domain_ops; + mock->domain.type = IOMMU_DOMAIN_UNMANAGED; xa_init(&mock->pfns); return &mock->domain; }
+static struct iommu_domain_ops domain_nested_ops; + +static struct iommu_domain *mock_domain_alloc_user(struct device *dev, + struct iommu_domain *parent, + const void *user_data) +{ + const struct iommu_hwpt_selftest *alloc = user_data; + struct mock_iommu_domain *mock_parent; + struct mock_iommu_domain *mock; + + if (!parent || !alloc || !(alloc->flags & IOMMU_TEST_FLAG_NESTED)) + return mock_domain_alloc(IOMMU_DOMAIN_UNMANAGED); + + mock_parent = container_of(parent, struct mock_iommu_domain, domain); + + mock = kzalloc(sizeof(*mock), GFP_KERNEL); + if (!mock) + return NULL; + mock->parent = mock_parent; + mock->iotlb = alloc->test_config; + mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE; + mock->domain.ops = &domain_nested_ops; + mock->domain.type = IOMMU_DOMAIN_NESTED; + return &mock->domain; +} + static void mock_domain_free(struct iommu_domain *domain) { struct mock_iommu_domain *mock = @@ -257,6 +289,7 @@ static const struct iommu_ops mock_ops = { .driver_type = IOMMU_DEVICE_DATA_SELFTEST, .hw_info = mock_domain_hw_info, .domain_alloc = mock_domain_alloc, + .domain_alloc_user = mock_domain_alloc_user, .default_domain_ops = &(struct iommu_domain_ops){ .free = mock_domain_free, @@ -266,6 +299,10 @@ static const struct iommu_ops mock_ops = { }, };
+static struct iommu_domain_ops domain_nested_ops = { + .free = mock_domain_free, +}; + static inline struct iommufd_hw_pagetable * get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id, struct mock_iommu_domain **mock) @@ -278,7 +315,10 @@ get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id, if (IS_ERR(obj)) return ERR_CAST(obj); hwpt = container_of(obj, struct iommufd_hw_pagetable, obj); - if (hwpt->domain->ops != mock_ops.default_domain_ops) { + if ((hwpt->domain->type == IOMMU_DOMAIN_UNMANAGED && + hwpt->domain->ops != mock_ops.default_domain_ops) || + (hwpt->domain->type == IOMMU_DOMAIN_NESTED && + hwpt->domain->ops != &domain_nested_ops)) { iommufd_put_object(&hwpt->obj); return ERR_PTR(-EINVAL); } diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 8e1369451464..ddd7c898fc50 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -125,6 +125,7 @@ TEST_F(iommufd, cmd_length) TEST_LENGTH(iommu_option, IOMMU_OPTION); TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS); TEST_LENGTH(iommu_device_info, IOMMU_DEVICE_GET_INFO); + TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC); #undef TEST_LENGTH }
@@ -196,6 +197,7 @@ FIXTURE_VARIANT(iommufd_ioas) { unsigned int mock_domains; unsigned int memory_limit; + bool new_hwpt; };
FIXTURE_SETUP(iommufd_ioas) @@ -235,6 +237,12 @@ FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) .mock_domains = 1, };
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_hwpt) +{ + .mock_domains = 1, + .new_hwpt = true, +}; + FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) { .mock_domains = 2, @@ -250,6 +258,96 @@ TEST_F(iommufd_ioas, ioas_auto_destroy) { }
+TEST_F(iommufd_ioas, ioas_new_hwpt) +{ + uint32_t new_hwpt_id = 0; + bool nested = true; + + if (self->dev_id) { + test_cmd_mock_domain_alloc_and_replace(self->ioas_id, + self->dev_id, + &new_hwpt_id, !nested); + /* hw_pagetable cannot be freed if a device is attached to it */ + EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, new_hwpt_id)); + + /* Detach from the new hw_pagetable and try again */ + test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, + self->domain_id); + test_ioctl_destroy(new_hwpt_id); + } else { + test_err_ioctl_hwpt_alloc(ENOENT, self->ioas_id, self->dev_id, + &new_hwpt_id, !nested); + test_err_mock_domain_replace(ENOENT, self->ioas_id, + self->dev_id, new_hwpt_id); + } +} + +TEST_F(iommufd_ioas, ioas_nested_hwpt) +{ + uint32_t nested_hwpt_id[2] = {}; + uint32_t parent_hwpt_id = 0; + uint32_t test_hwpt_id = 0; + bool nested = true; + + if (self->dev_id) { + /* Negative tests */ + test_err_ioctl_hwpt_alloc(EINVAL, self->dev_id, + self->ioas_id, &test_hwpt_id, !nested); + test_err_ioctl_hwpt_alloc(EINVAL, self->dev_id, + self->dev_id, &test_hwpt_id, !nested); + + /* Allocate two nested hwpts sharing one common parent hwpt */ + test_ioctl_hwpt_alloc(self->ioas_id, self->dev_id, + &parent_hwpt_id, !nested); + + test_ioctl_hwpt_alloc(parent_hwpt_id, self->dev_id, + &nested_hwpt_id[0], nested); + test_ioctl_hwpt_alloc(parent_hwpt_id, self->dev_id, + &nested_hwpt_id[1], nested); + + /* Negative test: a nested hwpt on top of a nested hwpt */ + test_err_ioctl_hwpt_alloc(EINVAL, nested_hwpt_id[0], + self->dev_id, &test_hwpt_id, nested); + /* Negative test: parent hwpt now cannot be freed */ + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, parent_hwpt_id)); + + /* Attach device to nested_hwpt_id[0] that then will be busy */ + test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, + nested_hwpt_id[0]); + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, nested_hwpt_id[0])); + + /* Switch from nested_hwpt_id[0] to nested_hwpt_id[1] */ + test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, + nested_hwpt_id[1]); + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, nested_hwpt_id[1])); + test_ioctl_destroy(nested_hwpt_id[0]); + + /* Detach from nested_hwpt_id[1] and destroy it */ + test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, + parent_hwpt_id); + test_ioctl_destroy(nested_hwpt_id[1]); + + /* Detach from the parent hw_pagetable and destroy it */ + test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, + self->domain_id); + test_ioctl_destroy(parent_hwpt_id); + } else { + test_err_ioctl_hwpt_alloc(ENOENT, self->ioas_id, self->dev_id, + &parent_hwpt_id, false); + test_err_ioctl_hwpt_alloc(ENOENT, parent_hwpt_id, self->dev_id, + &nested_hwpt_id[0], true); + test_err_ioctl_hwpt_alloc(ENOENT, parent_hwpt_id, self->dev_id, + &nested_hwpt_id[1], true); + test_err_mock_domain_replace(ENOENT, self->ioas_id, + self->dev_id, nested_hwpt_id[0]); + test_err_mock_domain_replace(ENOENT, self->ioas_id, + self->dev_id, nested_hwpt_id[1]); + } +} + TEST_F(iommufd_ioas, ioas_destroy) { if (self->domain_id) { @@ -621,6 +719,7 @@ TEST_F(iommufd_ioas, access_pin) MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
for (npages = 1; npages < BUFFER_SIZE / PAGE_SIZE; npages++) { + uint32_t new_hwpt_id = 0; uint32_t mock_device_id; uint32_t mock_hwpt_id;
@@ -655,11 +754,26 @@ TEST_F(iommufd_ioas, access_pin) &access_cmd)); test_cmd_mock_domain(self->ioas_id, &mock_device_id, &mock_hwpt_id); - check_map_cmd.id = mock_hwpt_id; + if (variant->new_hwpt) { + test_cmd_mock_domain_alloc_and_replace(self->ioas_id, + mock_device_id, + &new_hwpt_id, + false); + check_map_cmd.id = new_hwpt_id; + } else { + check_map_cmd.id = mock_hwpt_id; + } ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP), &check_map_cmd));
+ if (variant->new_hwpt) { + /* Detach from the new hwpt for its destroy() */ + test_cmd_mock_domain_replace(self->ioas_id, + mock_device_id, + mock_hwpt_id); + test_ioctl_destroy(new_hwpt_id); + } test_ioctl_destroy(mock_device_id); test_ioctl_destroy(mock_hwpt_id); test_cmd_destroy_access_pages( diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 1807d29c05f8..cb8a5e3beca6 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -66,6 +66,36 @@ static int _test_cmd_mock_domain(int fd, unsigned int ioas_id, __u32 *device_id, EXPECT_ERRNO(_errno, _test_cmd_mock_domain(self->fd, ioas_id, \ device_id, hwpt_id))
+static int _test_cmd_mock_domain_replace(int fd, unsigned int ioas_id, + __u32 device_id, __u32 hwpt_id) +{ + struct iommu_test_cmd cmd = { + .size = sizeof(cmd), + .op = IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE, + .id = ioas_id, + .mock_domain_replace = { + .device_id = device_id, + .hwpt_id = hwpt_id, + }, + }; + + return ioctl(fd, IOMMU_TEST_CMD, &cmd); +} + +#define test_cmd_mock_domain_replace(ioas_id, device_id, hwpt_id) \ + ASSERT_EQ(0, _test_cmd_mock_domain_replace(self->fd, ioas_id, device_id, \ + hwpt_id)) +#define test_err_mock_domain_replace(_errno, ioas_id, device_id, hwpt_id) \ + EXPECT_ERRNO(_errno, _test_cmd_mock_domain_replace(self->fd, ioas_id, \ + device_id, hwpt_id)) + +#define test_cmd_mock_domain_alloc_and_replace(ioas_id, device_id, \ + hwpt_id, nested) \ + ({ \ + test_ioctl_hwpt_alloc(ioas_id, device_id, hwpt_id, nested); \ + test_cmd_mock_domain_replace(ioas_id, device_id, *(hwpt_id)); \ + }) + static int _test_cmd_access_set_ioas(int fd, __u32 access_id, unsigned int ioas_id) { @@ -159,6 +189,44 @@ static int _test_ioctl_ioas_alloc(int fd, __u32 *id) ASSERT_NE(0, *(id)); \ })
+static int _test_ioctl_hwpt_alloc(int fd, __u32 pt_id, __u32 dev_id, + __u32 *out_hwpt_id, bool nested) +{ + struct iommu_hwpt_selftest data = { + .flags = nested ? IOMMU_TEST_FLAG_NESTED : 0, + .test_config = nested ? IOMMU_TEST_IOTLB_DEFAULT : 0, + }; + struct iommu_hwpt_alloc cmd = { + .size = sizeof(cmd), + .dev_id = dev_id, + .pt_id = pt_id, + .data_type = IOMMU_PGTBL_DATA_SELFTTEST, + .data_len = sizeof(data), + .data_uptr = (uint64_t)&data, + }; + int ret; + + ret = ioctl(fd, IOMMU_HWPT_ALLOC, &cmd); + if (ret) + return ret; + *out_hwpt_id = cmd.out_hwpt_id; + return 0; +} + +#define test_ioctl_hwpt_alloc(pt_id, dev_id, out_hwpt_id, nested) \ + ({ \ + ASSERT_EQ(0, _test_ioctl_hwpt_alloc(self->fd, pt_id, dev_id, \ + out_hwpt_id, nested)); \ + ASSERT_NE(0, *(out_hwpt_id)); \ + }) +#define test_err_ioctl_hwpt_alloc(_errno, pt_id, dev_id, out_hwpt_id, nested) \ + ({ \ + EXPECT_ERRNO(_errno, \ + _test_ioctl_hwpt_alloc(self->fd, pt_id, dev_id, \ + out_hwpt_id, nested)); \ + ASSERT_EQ(0, *(out_hwpt_id)); \ + }) + static int _test_ioctl_ioas_map(int fd, unsigned int ioas_id, void *buffer, size_t length, __u64 *iova, unsigned int flags) {
From: Nicolin Chen nicolinc@nvidia.com
This allows to test whether IOTLB has been invalidated or not.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/iommufd_test.h | 4 +++ drivers/iommu/iommufd/selftest.c | 22 +++++++++++++++ tools/testing/selftests/iommu/iommufd_utils.h | 27 +++++++++++++++---- 3 files changed, 48 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 842b81397a2a..5998c63a89f2 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -13,6 +13,7 @@ enum { IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE, IOMMU_TEST_OP_MD_CHECK_MAP, IOMMU_TEST_OP_MD_CHECK_REFS, + IOMMU_TEST_OP_MD_CHECK_IOTLB, IOMMU_TEST_OP_CREATE_ACCESS, IOMMU_TEST_OP_ACCESS_SET_IOAS, IOMMU_TEST_OP_DESTROY_ACCESS_PAGES, @@ -68,6 +69,9 @@ struct iommu_test_cmd { __aligned_u64 uptr; __u32 refs; } check_refs; + struct { + __u32 iotlb; + } check_iotlb; struct { __u32 out_access_fd; __u32 flags; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 354a0054afad..21134000bc78 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -536,6 +536,25 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd, return 0; }
+static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd, + u32 mockpt_id, u32 iotlb) +{ + struct iommufd_hw_pagetable *hwpt; + struct mock_iommu_domain *mock; + int rc = 0; + + hwpt = get_md_pagetable(ucmd, mockpt_id, &mock); + if (IS_ERR(hwpt)) + return PTR_ERR(hwpt); + + mock = container_of(hwpt->domain, struct mock_iommu_domain, domain); + + if (iotlb != mock->iotlb) + rc = -EINVAL; + iommufd_put_object(&hwpt->obj); + return rc; +} + struct selftest_access { struct iommufd_access *access; struct file *file; @@ -950,6 +969,9 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_md_check_refs( ucmd, u64_to_user_ptr(cmd->check_refs.uptr), cmd->check_refs.length, cmd->check_refs.refs); + case IOMMU_TEST_OP_MD_CHECK_IOTLB: + return iommufd_test_md_check_iotlb(ucmd, cmd->id, + cmd->check_iotlb.iotlb); case IOMMU_TEST_OP_CREATE_ACCESS: return iommufd_test_create_access(ucmd, cmd->create_access.flags); diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index cb8a5e3beca6..ea61b81fbc52 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -189,6 +189,20 @@ static int _test_ioctl_ioas_alloc(int fd, __u32 *id) ASSERT_NE(0, *(id)); \ })
+#define test_ioctl_hwpt_check_iotlb(hwpt_id, expected) \ + ({ \ + struct iommu_test_cmd test_cmd = { \ + .size = sizeof(test_cmd), \ + .op = IOMMU_TEST_OP_MD_CHECK_IOTLB, \ + .id = hwpt_id, \ + .check_iotlb = { .iotlb = expected }, \ + }; \ + ASSERT_EQ(0, \ + ioctl(self->fd, \ + _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_IOTLB), \ + &test_cmd)); \ + }) + static int _test_ioctl_hwpt_alloc(int fd, __u32 pt_id, __u32 dev_id, __u32 *out_hwpt_id, bool nested) { @@ -213,11 +227,14 @@ static int _test_ioctl_hwpt_alloc(int fd, __u32 pt_id, __u32 dev_id, return 0; }
-#define test_ioctl_hwpt_alloc(pt_id, dev_id, out_hwpt_id, nested) \ - ({ \ - ASSERT_EQ(0, _test_ioctl_hwpt_alloc(self->fd, pt_id, dev_id, \ - out_hwpt_id, nested)); \ - ASSERT_NE(0, *(out_hwpt_id)); \ +#define test_ioctl_hwpt_alloc(pt_id, dev_id, out_hwpt_id, nested) \ + ({ \ + ASSERT_EQ(0, _test_ioctl_hwpt_alloc(self->fd, pt_id, dev_id, \ + out_hwpt_id, nested)); \ + ASSERT_NE(0, *(out_hwpt_id)); \ + if (nested) \ + test_ioctl_hwpt_check_iotlb(*(out_hwpt_id), \ + IOMMU_TEST_IOTLB_DEFAULT); \ }) #define test_err_ioctl_hwpt_alloc(_errno, pt_id, dev_id, out_hwpt_id, nested) \ ({ \
From: Nicolin Chen nicolinc@nvidia.com
This only adds limited sanity.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 7 ++++++- drivers/iommu/iommufd/iommufd_test.h | 10 +++++++++ drivers/iommu/iommufd/selftest.c | 15 +++++++++++++ tools/testing/selftests/iommu/iommufd.c | 8 +++++++ tools/testing/selftests/iommu/iommufd_utils.h | 21 +++++++++++++++++++ 5 files changed, 60 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 1f1940969b2b..86864a3e170e 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -283,7 +283,12 @@ int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) goto out_put_hwpt; }
- klen = iommufd_hwpt_invalidate_info_size[cmd->data_type]; + if (cmd->data_type != IOMMU_PGTBL_DATA_SELFTTEST) + klen = iommufd_hwpt_invalidate_info_size[cmd->data_type]; +#ifdef CONFIG_IOMMUFD_TEST + else + klen = sizeof(struct iommu_hwpt_invalidate_selftest); +#endif if (!klen) { rc = -EINVAL; goto out_put_hwpt; diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 5998c63a89f2..4913884a8b24 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -134,4 +134,14 @@ struct iommu_hwpt_selftest { __u64 test_config; };
+/** + * struct iommu_hwpt_invalidate_selftest + * + * @flags: invalidate flags + */ +struct iommu_hwpt_invalidate_selftest { +#define IOMMU_TEST_INVALIDATE_ALL (1 << 0) + __u64 flags; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 21134000bc78..06b5e8829488 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -283,6 +283,20 @@ static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain, return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE; }
+static void mock_domain_iotlb_sync_user(struct iommu_domain *domain, + void *user_data) +{ + struct iommu_hwpt_invalidate_selftest *inv_info = user_data; + struct mock_iommu_domain *mock = + container_of(domain, struct mock_iommu_domain, domain); + + if (domain->type != IOMMU_DOMAIN_NESTED || !mock->parent) + return; + + if (inv_info->flags & IOMMU_TEST_INVALIDATE_ALL) + mock->iotlb = 0; +} + static const struct iommu_ops mock_ops = { .owner = THIS_MODULE, .pgsize_bitmap = MOCK_IO_PAGE_SIZE, @@ -301,6 +315,7 @@ static const struct iommu_ops mock_ops = {
static struct iommu_domain_ops domain_nested_ops = { .free = mock_domain_free, + .iotlb_sync_user = mock_domain_iotlb_sync_user, };
static inline struct iommufd_hw_pagetable * diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index ddd7c898fc50..3fc7a1cd8fef 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -126,6 +126,7 @@ TEST_F(iommufd, cmd_length) TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS); TEST_LENGTH(iommu_device_info, IOMMU_DEVICE_GET_INFO); TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC); + TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE); #undef TEST_LENGTH }
@@ -312,6 +313,13 @@ TEST_F(iommufd_ioas, ioas_nested_hwpt) EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, parent_hwpt_id));
+ /* hwpt_invalidate only supports a user-managed hwpt (nested) */ + test_err_ioctl_hwpt_invalidate(EINVAL, parent_hwpt_id); + test_ioctl_hwpt_invalidate(nested_hwpt_id[0]); + test_ioctl_hwpt_check_iotlb(nested_hwpt_id[0], 0); + test_ioctl_hwpt_invalidate(nested_hwpt_id[1]); + test_ioctl_hwpt_check_iotlb(nested_hwpt_id[1], 0); + /* Attach device to nested_hwpt_id[0] that then will be busy */ test_cmd_mock_domain_replace(self->ioas_id, self->dev_id, nested_hwpt_id[0]); diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index ea61b81fbc52..73ba0bacbb9a 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -244,6 +244,27 @@ static int _test_ioctl_hwpt_alloc(int fd, __u32 pt_id, __u32 dev_id, ASSERT_EQ(0, *(out_hwpt_id)); \ })
+static int _test_ioctl_hwpt_invalidate(int fd, __u32 hwpt_id) +{ + struct iommu_hwpt_invalidate_selftest data = { + .flags = IOMMU_TEST_INVALIDATE_ALL, + }; + struct iommu_hwpt_invalidate cmd = { + .size = sizeof(cmd), + .hwpt_id = hwpt_id, + .data_type = IOMMU_PGTBL_DATA_SELFTTEST, + .data_len = sizeof(data), + .data_uptr = (uint64_t)&data, + }; + + return ioctl(fd, IOMMU_HWPT_INVALIDATE, &cmd); +} + +#define test_ioctl_hwpt_invalidate(hwpt_id) \ + ASSERT_EQ(0, _test_ioctl_hwpt_invalidate(self->fd, hwpt_id)) +#define test_err_ioctl_hwpt_invalidate(_errno, hwpt_id) \ + EXPECT_ERRNO(_errno, _test_ioctl_hwpt_invalidate(self->fd, hwpt_id)) + static int _test_ioctl_ioas_map(int fd, unsigned int ioas_id, void *buffer, size_t length, __u64 *iova, unsigned int flags) {
-----Original Message----- From: Yi Liu [mailto:yi.l.liu@intel.com] Sent: 09 February 2023 04:32 To: joro@8bytes.org; alex.williamson@redhat.com; jgg@nvidia.com; kevin.tian@intel.com; robin.murphy@arm.com Cc: cohuck@redhat.com; eric.auger@redhat.com; nicolinc@nvidia.com; kvm@vger.kernel.org; mjrosato@linux.ibm.com; chao.p.peng@linux.intel.com; yi.l.liu@intel.com; yi.y.sun@linux.intel.com; peterx@redhat.com; jasowang@redhat.com; Shameerali Kolothum Thodi shameerali.kolothum.thodi@huawei.com; lulu@redhat.com; suravee.suthikulpanit@amd.com; iommu@lists.linux.dev; linux-kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; baolu.lu@linux.intel.com Subject: [PATCH 00/17] Add Intel VT-d nested translation
Nested translation has two stage address translations to get the final physical addresses. Take Intel VT-d as an example, the first stage translation structure is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host to do the first stage translation. Along with it, guest modifications to present mappings in the first stage page should be followed with an iotlb invalidation to sync host iotlb.
.-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------'
Guest ------| Shadow |--------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
Different platform vendors have different first stage translation formats, so userspace should query the underlying iommu capability before setting first stage translation structures to host.[1]
In iommufd subsystem, I/O page tables would be tracked by hw_pagetable objects. First stage page table is owned by userspace (guest), while second stage page table is owned by kernel for security. So First stage page tables are tracked by user-managed hw_pagetable, second stage page tables are tracked by kernel- managed hw_pagetable.
This series first introduces new iommu op for allocating domains for iommufd, and op for syncing iotlb for first stage page table modifications, and then add the implementation of the new ops in intel-iommu driver. After this preparation, adds kernel-managed and user-managed hw_pagetable allocation for userspace. Last, add self-test for the new ioctls.
This series is based on "[PATCH 0/6] iommufd: Add iommu capability reporting"[1] and Nicolin's "[PATCH v2 00/10] Add IO page table replacement support"[2]. Complete code can be found in[3]. Draft Qemu code can be found in[4].
Basic test done with DSA device on VT-d. Where the guest has a vIOMMU built with nested translation.
Hi Yi Liu,
Thanks for sending this out. Will go through this one. As I informed before we keep an internal branch based on your work and rebase few patches to get the ARM SMMUv3 nesting support. The recent one is based on your "iommufd-v6.2-rc4-nesting" branch and is here,
https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting-arm
Just wondering any chance the latest "Add SMMUv3 nesting support" series will be send out soon? Please let me know if you need any help with that.
Thanks, Shameer
[1] https://lore.kernel.org/linux-iommu/20230209041642.9346-1-yi.l.liu@intel. com/ [2] https://lore.kernel.org/linux-iommu/cover.1675802050.git.nicolinc@nvidia.c om/ [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting_vtd_v1 [4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv3%2Bnesting
Regards, Yi Liu
Lu Baolu (5): iommu: Add new iommu op to create domains owned by userspace iommu: Add nested domain support iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain support
Nicolin Chen (6): iommufd: Add/del hwpt to IOAS at alloc/destroy() iommufd/device: Move IOAS attaching and detaching operations into helpers iommufd/selftest: Add IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE test op iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC ioctl iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (6): iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation iommufd: Split iommufd_hw_pagetable_alloc() iommufd: Add kernel-managed hw_pagetable allocation for userspace iommufd: Add infrastructure for user-managed hw_pagetable allocation iommufd: Add user-managed hw_pagetable allocation iommufd/device: Report supported stage-1 page table types
drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 38 ++- drivers/iommu/intel/iommu.h | 50 +++- drivers/iommu/intel/nested.c | 143 +++++++++ drivers/iommu/intel/pasid.c | 142 +++++++++ drivers/iommu/intel/pasid.h | 2 + drivers/iommu/iommufd/device.c | 117 ++++---- drivers/iommu/iommufd/hw_pagetable.c | 280 +++++++++++++++++- drivers/iommu/iommufd/iommufd_private.h | 23 +- drivers/iommu/iommufd/iommufd_test.h | 35 +++ drivers/iommu/iommufd/main.c | 11 + drivers/iommu/iommufd/selftest.c | 149 +++++++++- include/linux/iommu.h | 11 + include/uapi/linux/iommufd.h | 196 ++++++++++++ tools/testing/selftests/iommu/iommufd.c | 124 +++++++- tools/testing/selftests/iommu/iommufd_utils.h | 106 +++++++ 16 files changed, 1329 insertions(+), 100 deletions(-) create mode 100644 drivers/iommu/intel/nested.c
-- 2.34.1
Hi Shameer,
On Thu, Feb 09, 2023 at 10:11:42AM +0000, Shameerali Kolothum Thodi wrote:
This series first introduces new iommu op for allocating domains for iommufd, and op for syncing iotlb for first stage page table modifications, and then add the implementation of the new ops in intel-iommu driver. After this preparation, adds kernel-managed and user-managed hw_pagetable allocation for userspace. Last, add self-test for the new ioctls.
This series is based on "[PATCH 0/6] iommufd: Add iommu capability reporting"[1] and Nicolin's "[PATCH v2 00/10] Add IO page table replacement support"[2]. Complete code can be found in[3]. Draft Qemu code can be found in[4].
Basic test done with DSA device on VT-d. Where the guest has a vIOMMU built with nested translation.
Thanks for sending this out. Will go through this one. As I informed before we keep an internal branch based on your work and rebase few patches to get the ARM SMMUv3 nesting support. The recent one is based on your "iommufd-v6.2-rc4-nesting" branch and is here,
https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting-arm
Just wondering any chance the latest "Add SMMUv3 nesting support" series will be send out soon? Please let me know if you need any help with that.
I had an initial discussion with Robin/Jason regarding the SMMUv3 nesting series, and I received quite a few inputs so I'd need to finish reworking before sending out -- hopefully we can see them in the maillist in the following weeks.
Thanks Nic
-----Original Message----- From: Nicolin Chen [mailto:nicolinc@nvidia.com] Sent: 09 February 2023 16:11 To: Shameerali Kolothum Thodi shameerali.kolothum.thodi@huawei.com Cc: Yi Liu yi.l.liu@intel.com; joro@8bytes.org; alex.williamson@redhat.com; jgg@nvidia.com; kevin.tian@intel.com; robin.murphy@arm.com; cohuck@redhat.com; eric.auger@redhat.com; kvm@vger.kernel.org; mjrosato@linux.ibm.com; chao.p.peng@linux.intel.com; yi.y.sun@linux.intel.com; peterx@redhat.com; jasowang@redhat.com; lulu@redhat.com; suravee.suthikulpanit@amd.com; iommu@lists.linux.dev; linux-kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; baolu.lu@linux.intel.com; zhangfei.gao@linaro.org Subject: Re: [PATCH 00/17] Add Intel VT-d nested translation
Hi Shameer,
On Thu, Feb 09, 2023 at 10:11:42AM +0000, Shameerali Kolothum Thodi wrote:
This series first introduces new iommu op for allocating domains for iommufd, and op for syncing iotlb for first stage page table modifications, and then add the implementation of the new ops in intel-iommu driver. After this preparation, adds kernel-managed and user-managed hw_pagetable allocation for userspace. Last, add self-test for the new ioctls.
This series is based on "[PATCH 0/6] iommufd: Add iommu capability reporting"[1] and Nicolin's "[PATCH v2 00/10] Add IO page table replacement
support"[2].
Complete code can be found in[3]. Draft Qemu code can be found in[4].
Basic test done with DSA device on VT-d. Where the guest has a vIOMMU built with nested translation.
Thanks for sending this out. Will go through this one. As I informed before
we keep
an internal branch based on your work and rebase few patches to get the
ARM
SMMUv3 nesting support. The recent one is based on your
"iommufd-v6.2-rc4-nesting"
branch and is here,
https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting -arm
Just wondering any chance the latest "Add SMMUv3 nesting support"
series will
be send out soon? Please let me know if you need any help with that.
I had an initial discussion with Robin/Jason regarding the SMMUv3 nesting series, and I received quite a few inputs so I'd need to finish reworking before sending out -- hopefully we can see them in the maillist in the following weeks.
Thanks for that update. Sure, looking forward to it.
Shameer
Hi Yi,
On Wed, Feb 08, 2023 at 08:31:36PM -0800, Yi Liu wrote:
Nicolin Chen (6): iommufd: Add/del hwpt to IOAS at alloc/destroy() iommufd/device: Move IOAS attaching and detaching operations into helpers iommufd/selftest: Add IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE test op iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC ioctl iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (6): iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation iommufd: Split iommufd_hw_pagetable_alloc() iommufd: Add kernel-managed hw_pagetable allocation for userspace iommufd: Add infrastructure for user-managed hw_pagetable allocation iommufd: Add user-managed hw_pagetable allocation
Just want to let you know that Jason made some major changes in device/hw_pagetable and selftest (mock_domain): https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
So most of changes above need to be redone. I am now rebasing the replace and the nesting series, and probably will finish by early next week.
If you are reworking this series, perhaps can hold for a few days at these changes.
Thanks Nic
linux-kselftest-mirror@lists.linaro.org