On Wed, Apr 30, 2025 at 09:59:13PM +0000, Pranjal Shrivastava wrote:
On Fri, Apr 25, 2025 at 10:58:16PM -0700, Nicolin Chen wrote:
The CMDQV HW supports a user-space use for virtualization cases. It allows the VM to issue guest-level TLBI or ATC_INV commands directly to the queue and executes them without a VMEXIT, as HW will replace the VMID field in a TLBI command and the SID field in an ATC_INV command with the preset VMID and SID.
This is built upon the vIOMMU infrastructure by allowing VMM to allocate a VINTF (as a vIOMMU object) and assign VCMDQs (vCMDQ objects) to the VINTF.
So firstly, replace the standard vSMMU model with the VINTF implementation but reuse the standard cache_invalidate op (for unsupported commands) and the standard alloc_domain_nested op (for standard nested STE).
Each VINTF has two 64KB MMIO pages (128B per logical vCMDQ):
- Page0 (directly accessed by guest) has all the control and status bits.
- Page1 (trapped by VMM) has guest-owned queue memory location/size info.
VMM should trap the emulated VINTF0's page1 of the guest VM for the guest- level VCMDQ location/size info and forward that to the kernel to translate to a physical memory location to program the VCMDQ HW during an allocation call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0 of the guest VM. This allows the guest OS to read and write the guest-own VINTF's page0 for direct control of the VCMDQ HW.
For ATC invalidation commands that hold an SID, it requires all devices to register their virtual SIDs to the SID_MATCH registers and their physical SIDs to the pairing SID_REPLACE registers, so that HW can use those as a lookup table to replace those virtual SIDs with the correct physical SIDs. Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid structure to allocate SID_REPLACE and to program the SIDs accordingly.
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS.
The write-up is super helpful to understand how the HW works from a high level. Thanks for explaining this well! :)
I'm curious to know the DMA unmap tests that were run for perf?
tools/testing/selftests/dma/dma_map_benchmark.c
/**
- struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
(IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
- @flags: Must be set to 0
- @impl: Must be 0
- @flags: Combination of enum iommu_hw_info_arm_smmuv3_flags
- @impl: Implementation-defined bits when the following flags are set:
- IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV
Bits[15:12] - Log2 of the total number of SID replacements
Bits[07:04] - Log2 of the total number of vCMDQs per vIOMMU
Bits[03:00] - Version number for the CMDQ-V HW
Nit: It seems that we deliberately chose not to reveal `NUM_VINTF_LOG2` to the user-space. If so, maybe we shall mark those bitfields as unused or reserved for clarity? Bits[11:08] - Reserved / Unused (even 31:16).
I think it should have been there, but kernel should just report 0. Bits[11:08] - Log2 of the total number of virtual interface
- @idr: Implemented features for ARM SMMU Non-secure programming interface
- @iidr: Information about the implementation and implementer of ARM SMMU,
and architecture version supported
@@ -952,10 +965,28 @@ struct iommu_fault_alloc {
- enum iommu_viommu_type - Virtual IOMMU Type
- @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
- @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
*/
- @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
enum iommu_viommu_type { IOMMU_VIOMMU_TYPE_DEFAULT = 0, IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
- IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
+};
This is a little confusing.. I understand that we need a new viommu type to copy the new struct iommu_viommu_tegra241_cmdqv b/w the user & kernel
But, in a previous patch (Add vsmmu_alloc impl op), we add a check to fallback to the standard type SMMUv3, if the impl_ops->vsmmu_alloc returns -EOPNOTSUPP:
if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc) vsmmu = master->smmu->impl_ops->vsmmu_alloc( master->smmu, s2_parent, ictx, viommu_type, user_data); if (PTR_ERR(vsmmu) == -EOPNOTSUPP) { if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3) return ERR_PTR(-EOPNOTSUPP); /* Fallback to standard SMMUv3 type if viommu_type matches */ vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, core, &arm_vsmmu_ops);
Now, if we'll ALWAYS try to allocate an impl-specified vsmmu first, even when the viommu_type == IOMMU_VIOMMU_TYPE_ARM_SMMUV3, we are anyways going to return back from the impl_ops->vsmmu_alloc with -EOPNOTSUPP.
That's not necessarily true. An impl_ops->vsmmu_alloc can support IOMMU_VIOMMU_TYPE_ARM_SMMUV3 potentially, e.g. an impl could just toggle a few special bits in a register and return a valid vsmmu pointer.
It doesn't work like this with VCMDQ as it supports its own type, but for the long run I think we should pass in the standard type to impl_ops->vsmmu_alloc too.
Then we'll again check if the retval was -EOPNOTSUPP and re-check the viommu_type requested.. which seems a little counter intuitive.
It's just prioritizing the impl_ops->vsmmu_alloc. Similar to the probe, if VCMDQ is missing or encountering some initialization problem, give it a chance to fallback to the standard SMMU.
- /*
* @length must be a power of 2, in range of
* [ 32, 1 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
*/
Nit: 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) to match the comment in uapi
Alok pointed it out too. Fixed.
- vcmdq = iommufd_vcmdq_alloc(viommu, struct tegra241_vcmdq, core);
- if (!vcmdq)
return ERR_PTR(-ENOMEM);
- /*
* HW requires to unmap LVCMDQs in descending order, so destroy() must
* follow this rule. Set a dependency on its previous LVCMDQ so iommufd
* core will help enforce it.
*/
- if (prev) {
ret = iommufd_vcmdq_depend(vcmdq, prev, core);
if (ret)
goto free_vcmdq;
- }
- vcmdq->prev = prev;
- ret = tegra241_vintf_init_lvcmdq(vintf, index, vcmdq);
- if (ret)
goto free_vcmdq;
- dev_dbg(cmdqv->dev, "%sallocated\n",
lvcmdq_error_header(vcmdq, header, 64));
- tegra241_vcmdq_map_lvcmdq(vcmdq);
- vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
- vcmdq->cmdq.q.q_base |= log2size;
- ret = tegra241_vcmdq_hw_init_user(vcmdq);
- if (ret)
goto free_vcmdq;
- vintf->lvcmdqs[index] = vcmdq;
- return &vcmdq->core;
+free_vcmdq:
- iommufd_struct_destroy(viommu->ictx, vcmdq, core);
- return ERR_PTR(ret);
Are we missing an undepend here?
Right. The iommufd_struct_destroy doesn't invoke obj->ops.abort().
The whole revert flow is wonky, missing all the unmap/deinit steps.
+static void tegra241_vintf_destroy_vdevice(struct iommufd_vdevice *vdev) +{
- struct tegra241_vintf_sid *vsid =
container_of(vdev, struct tegra241_vintf_sid, core);
- struct tegra241_vintf *vintf = vsid->vintf;
- writel_relaxed(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
- writel_relaxed(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));
Just a thought: Should these be writel to avoid races? Although I believe all user-queues would be free-d by this point?
Yea. They should be. I will change them.
Thanks Nicolin