Re: [PATCH v2 21/22] iommu/tegra241-cmdqv: Add user-space use support

30 Apr 2025


      On Wed, Apr 30, 2025 at 09:59:13PM +0000, Pranjal Shrivastava wrote:
...
On Fri, Apr 25, 2025 at 10:58:16PM -0700, Nicolin Chen wrote:
...
The CMDQV HW supports a user-space use for virtualization cases. It allows
the VM to issue guest-level TLBI or ATC_INV commands directly to the queue
and executes them without a VMEXIT, as HW will replace the VMID field in a
TLBI command and the SID field in an ATC_INV command with the preset VMID
and SID.
This is built upon the vIOMMU infrastructure by allowing VMM to allocate a
VINTF (as a vIOMMU object) and assign VCMDQs (vCMDQ objects) to the VINTF.
So firstly, replace the standard vSMMU model with the VINTF implementation
but reuse the standard cache_invalidate op (for unsupported commands) and
the standard alloc_domain_nested op (for standard nested STE).
Each VINTF has two 64KB MMIO pages (128B per logical vCMDQ):

Page0 (directly accessed by guest) has all the control and status bits.
Page1 (trapped by VMM) has guest-owned queue memory location/size info.

VMM should trap the emulated VINTF0's page1 of the guest VM for the guest-
level VCMDQ location/size info and forward that to the kernel to translate
to a physical memory location to program the VCMDQ HW during an allocation
call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0
of the guest VM. This allows the guest OS to read and write the guest-own
VINTF's page0 for direct control of the VCMDQ HW.
For ATC invalidation commands that hold an SID, it requires all devices to
register their virtual SIDs to the SID_MATCH registers and their physical
SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
lookup table to replace those virtual SIDs with the correct physical SIDs.
Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
structure to allocate SID_REPLACE and to program the SIDs accordingly.
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
The write-up is super helpful to understand how the HW works from a high
level. Thanks for explaining this well! :)
I'm curious to know the DMA unmap tests that were run for perf?
tools/testing/selftests/dma/dma_map_benchmark.c
...
...
/**

struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
                              (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)






@flags: Must be set to 0



@impl: Must be 0





@flags: Combination of enum iommu_hw_info_arm_smmuv3_flags



@impl: Implementation-defined bits when the following flags are set:



   - IOMMU_HW_INFO_ARM_SMMUV3_HAS_TEGRA241_CMDQV





     Bits[15:12] - Log2 of the total number of SID replacements





     Bits[07:04] - Log2 of the total number of vCMDQs per vIOMMU





     Bits[03:00] - Version number for the CMDQ-V HW





Nit: It seems that we deliberately chose not to reveal `NUM_VINTF_LOG2`
to the user-space. If so, maybe we shall mark those bitfields as unused
or reserved for clarity? Bits[11:08] - Reserved / Unused (even 31:16).
I think it should have been there, but kernel should just report 0.
                 Bits[11:08] - Log2 of the total number of virtual interface
...
...

@idr: Implemented features for ARM SMMU Non-secure programming interface
@iidr: Information about the implementation and implementer of ARM SMMU,
   and architecture version supported



@@ -952,10 +965,28 @@ struct iommu_fault_alloc {

enum iommu_viommu_type - Virtual IOMMU Type
@IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
@IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type



@IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3

*/

enum iommu_viommu_type {
   IOMMU_VIOMMU_TYPE_DEFAULT = 0,
   IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,

IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,

+};
This is a little confusing.. I understand that we need a new viommu type
to copy the new struct iommu_viommu_tegra241_cmdqv b/w the user & kernel
But, in a previous patch (Add vsmmu_alloc impl op), we add a check to
fallback to the standard type SMMUv3, if the impl_ops->vsmmu_alloc 
returns -EOPNOTSUPP:
if (master->smmu->impl_ops && master->smmu->impl_ops->vsmmu_alloc)
   	vsmmu = master->smmu->impl_ops->vsmmu_alloc(
   		master->smmu, s2_parent, ictx, viommu_type, user_data);
   if (PTR_ERR(vsmmu) == -EOPNOTSUPP) {
   	if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3)
   		return ERR_PTR(-EOPNOTSUPP);
   	/* Fallback to standard SMMUv3 type if viommu_type matches */
   	vsmmu = iommufd_viommu_alloc(ictx, struct arm_vsmmu, core,
   				     &arm_vsmmu_ops);
Now, if we'll ALWAYS try to allocate an impl-specified vsmmu first, even
when the viommu_type == IOMMU_VIOMMU_TYPE_ARM_SMMUV3, we are anyways
going to return back from the impl_ops->vsmmu_alloc with -EOPNOTSUPP.
That's not necessarily true. An impl_ops->vsmmu_alloc can support
IOMMU_VIOMMU_TYPE_ARM_SMMUV3 potentially, e.g. an impl could just
toggle a few special bits in a register and return a valid vsmmu
pointer.
It doesn't work like this with VCMDQ as it supports its own type,
but for the long run I think we should pass in the standard type
to impl_ops->vsmmu_alloc too.
...
Then we'll again check if the retval was -EOPNOTSUPP and re-check the
viommu_type requested.. which seems a little counter intuitive.
It's just prioritizing the impl_ops->vsmmu_alloc. Similar to the
probe, if VCMDQ is missing or encountering some initialization
problem, give it a chance to fallback to the standard SMMU.
...
...

/*
* @length must be a power of 2, in range of


*   [ 32, 1 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]


*/



Nit: 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) to match the comment in uapi
Alok pointed it out too. Fixed.
...
...

vcmdq = iommufd_vcmdq_alloc(viommu, struct tegra241_vcmdq, core);
if (!vcmdq)
return ERR_PTR(-ENOMEM);



/*
* HW requires to unmap LVCMDQs in descending order, so destroy() must


* follow this rule. Set a dependency on its previous LVCMDQ so iommufd


* core will help enforce it.


*/


if (prev) {
ret = iommufd_vcmdq_depend(vcmdq, prev, core);


if (ret)


	goto free_vcmdq;


}
vcmdq->prev = prev;

ret = tegra241_vintf_init_lvcmdq(vintf, index, vcmdq);
if (ret)
goto free_vcmdq;



dev_dbg(cmdqv->dev, "%sallocated\n",
lvcmdq_error_header(vcmdq, header, 64));



tegra241_vcmdq_map_lvcmdq(vcmdq);

vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
vcmdq->cmdq.q.q_base |= log2size;

ret = tegra241_vcmdq_hw_init_user(vcmdq);
if (ret)
goto free_vcmdq;


vintf->lvcmdqs[index] = vcmdq;

return &vcmdq->core;

+free_vcmdq:

iommufd_struct_destroy(viommu->ictx, vcmdq, core);
return ERR_PTR(ret);

Are we missing an undepend here?
Right. The iommufd_struct_destroy doesn't invoke obj->ops.abort().
The whole revert flow is wonky, missing all the unmap/deinit steps.
...
...
+static void tegra241_vintf_destroy_vdevice(struct iommufd_vdevice *vdev)
+{

struct tegra241_vintf_sid *vsid =
container_of(vdev, struct tegra241_vintf_sid, core);


struct tegra241_vintf *vintf = vsid->vintf;

writel_relaxed(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
writel_relaxed(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));

Just a thought: Should these be writel to avoid races?
Although I believe all user-queues would be free-d by this point?
Yea. They should be. I will change them.
Thanks
Nicolin

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 21/22] iommu/tegra241-cmdqv: Add user-space use support