Hi Nicolin,
-----Original Message----- From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, January 11, 2025 3:32 AM To: will@kernel.org; robin.murphy@arm.com; jgg@nvidia.com; kevin.tian@intel.com; tglx@linutronix.de; maz@kernel.org; alex.williamson@redhat.com Cc: joro@8bytes.org; shuah@kernel.org; reinette.chatre@intel.com; eric.auger@redhat.com; yebin (H) yebin10@huawei.com; apatel@ventanamicro.com; shivamurthy.shastri@linutronix.de; bhelgaas@google.com; anna-maria@linutronix.de; yury.norov@gmail.com; nipun.gupta@amd.com; iommu@lists.linux.dev; linux- kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org; kvm@vger.kernel.org; linux-kselftest@vger.kernel.org; patches@lists.linux.dev; jean-philippe@linaro.org; mdf@kernel.org; mshavit@google.com; Shameerali Kolothum Thodi shameerali.kolothum.thodi@huawei.com; smostafa@google.com; ddutile@redhat.com Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
[ Background ] On ARM GIC systems and others, the target address of the MSI is translated by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the IOMMU is disabled, the MSI address is programmed to the physical location of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI address is programmed to an allocated IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage translation is enabled, IOVA will be still used to program the MSI address, though the mappings will be in two stages: IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the IOVA is dynamically allocated from the top of the IOVA space. If attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge of the IOMMU translation (1-stage translation), since the IOVA for the ITS page is fixed and known by kernel. However, with virtual machine enabling a nested IOMMU translation (2-stage), a guest kernel directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem:
- Create an identity mapping in the stage-1. VMM could insert a few RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel would fetch these RMR entries from the IORT and create an
IOMMU_RESV_DIRECT region per iommu group for a direct mapping. Eventually, the mappings would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA. 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC driver, to program the correct MSI IOVA. Forward the VMM-defined vITS page location (IPA) to the kernel for the stage-2 mapping. Eventually: IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000) This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
Worth mentioning that when Eric Auger was working on the same topic with the VFIO iommu uAPI, he had the approach (2) first, and then switched to the approach (1), suggested by Jean-Philippe for reduction of complexity.
The approach (1) basically feels like the existing VFIO passthrough that has a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has- iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same idea of "VMM leaving everything to the kernel".
The approach (2) is an ideal solution, yet it requires additional effort for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS page(s), which demands VMM to closely cooperate.
- It also brings some complicated use cases to the table where the host or/and guest system(s) has/have multiple ITS pages.
I had done some basic sanity tests with this series and the Qemu branches you provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will rebase my Qemu smuv3-accel branch on top of this and will do some more tests.
One confusion I have about the above text is, do we still plan to support the approach -1( Using RMR in Qemu) or you are just mentioning it here because it is still possible to make use of that. I think from previous discussions the argument was to adopt a more dedicated MSI pass-through model which I think is approach-2 here. Could you please confirm.
Thanks, Shameer