Changelog v1->v2: * Rebased on top of vfio_device cdev v2 series. * Update the kdoc and commit message of iommu_group_replace_domain(). * Dropped revert-to-core-domain part in iommu_group_replace_domain(). * Dropped !ops->dma_unmap check in vfio_iommufd_emulated_attach_ioas(). * Added missing rc value in vfio_iommufd_emulated_attach_ioas() from the iommufd_access_set_ioas() call. * Added a new patch in vfio_main to deny vfio_pin/unpin_pages() calls if vdev->ops->dma_unmap is not implemented. * Added a __iommmufd_device_detach helper and let the replace routine do a partial detach(). * Added restriction on auto_domains to use the replace feature. * Added the patch "iommufd/device: Make hwpt_list list_add/del symmetric" from the has_group removal series.
Hi all,
The existing IOMMU APIs provide a pair of functions: iommu_attach_group() for callers to attach a device from the default_domain (NULL if not being supported) to a given iommu domain, and iommu_detach_group() for callers to detach a device from a given domain to the default_domain. Internally, the detach_dev op is deprecated for the newer drivers with default_domain. This means that those drivers likely can switch an attaching domain to another one, without stagging the device at a blocking or default domain, for use cases such as: 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N) 2) Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
This series introduces a new iommu_group_replace_domain() for that. And add corresponding support throughout the uAPI. So user space can do such a REPLACE ioctl reusing the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT. This means that user space needs to be aware whether the device is attached or not: an unattached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT means a regular ATTACH; an attached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT on the other hand means a REPLACE.
QEMU with this feature should have the vIOMMU maintain a cache of the guest io page table addresses and assign a unique IOAS to each unique guest page table.
As the guest writes the page table address to the HW registers qemu should then use the 'replace domain' operation on VFIO to assign the VFIO device to the correct de-duplicated page table.
The algorithm where QEMU uses one VFIO container per-device and removes all the mappings to change the assignment should ideally not be used with iommufd.
To apply this series, please rebase on top of the following patches: 1) [PATCH v2 00/14] Add vfio_device cdev for iommufd support https://lore.kernel.org/kvm/20230206090532.95598-1-yi.l.liu@intel.com/
Or you can also find this series on Github: https://github.com/nicolinc/iommufd/commits/iommu_group_replace_domain-v2
Thank you Nicolin Chen
Nicolin Chen (9): iommu: Introduce a new iommu_group_replace_domain() API iommufd: Create access in vfio_iommufd_emulated_bind() iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage iommufd: Add replace support in iommufd_access_set_ioas() iommufd/selftest: Add coverage for access->ioas replacement iommufd/device: Make hwpt_list list_add/del symmetric iommufd/device: Use iommu_group_replace_domain() vfio: Support IO page table replacement vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()
Yi Liu (1): iommu: Move dev_iommu_ops() to private header
drivers/iommu/iommu-priv.h | 22 ++ drivers/iommu/iommu.c | 30 +++ drivers/iommu/iommufd/device.c | 221 +++++++++++++----- drivers/iommu/iommufd/iommufd_private.h | 4 + drivers/iommu/iommufd/iommufd_test.h | 4 + drivers/iommu/iommufd/selftest.c | 25 +- drivers/vfio/iommufd.c | 30 ++- drivers/vfio/vfio_main.c | 4 + include/linux/iommu.h | 11 - include/linux/iommufd.h | 3 +- include/uapi/linux/vfio.h | 6 + tools/testing/selftests/iommu/iommufd.c | 29 ++- tools/testing/selftests/iommu/iommufd_utils.h | 22 +- 13 files changed, 321 insertions(+), 90 deletions(-) create mode 100644 drivers/iommu/iommu-priv.h
From: Yi Liu yi.l.liu@intel.com
dev_iommu_ops() is essentially only used in iommu subsystem, so move to a private header to avoid being abused by other drivers.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommu-priv.h | 18 ++++++++++++++++++ drivers/iommu/iommu.c | 2 ++ include/linux/iommu.h | 11 ----------- 3 files changed, 20 insertions(+), 11 deletions(-) create mode 100644 drivers/iommu/iommu-priv.h
diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h new file mode 100644 index 000000000000..9e1497027cff --- /dev/null +++ b/drivers/iommu/iommu-priv.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __LINUX_IOMMU_PRIV_H +#define __LINUX_IOMMU_PRIV_H + +#include <linux/iommu.h> + +static inline const struct iommu_ops *dev_iommu_ops(struct device *dev) +{ + /* + * Assume that valid ops must be installed if iommu_probe_device() + * has succeeded. The device ops are essentially for internal use + * within the IOMMU subsystem itself, so we should be able to trust + * ourselves not to misuse the helper. + */ + return dev->iommu->iommu_dev->ops; +} +#endif /* __LINUX_IOMMU_PRIV_H */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 4b5a21ac5e88..a18b7f1a4e6e 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -35,6 +35,8 @@
#include "iommu-sva.h"
+#include "iommu-priv.h" + static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index a8063f26ff69..cb586d054c57 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -445,17 +445,6 @@ static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather) }; }
-static inline const struct iommu_ops *dev_iommu_ops(struct device *dev) -{ - /* - * Assume that valid ops must be installed if iommu_probe_device() - * has succeeded. The device ops are essentially for internal use - * within the IOMMU subsystem itself, so we should be able to trust - * ourselves not to misuse the helper. - */ - return dev->iommu->iommu_dev->ops; -} - extern int bus_iommu_probe(struct bus_type *bus); extern bool iommu_present(struct bus_type *bus); extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
From: Yi Liu yi.l.liu@intel.com
dev_iommu_ops() is essentially only used in iommu subsystem, so move to a private header to avoid being abused by other drivers.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
qemu has a need to replace the translations associated with a domain when the guest does large-scale operations like switching between an IDENTITY domain and, say, dma-iommu.c.
Currently, it does this by replacing all the mappings in a single domain, but this is very inefficient and means that domains have to be per-device rather than per-translation.
Provide a high-level API to allow replacements of one domain with another. This is similar to a detach/attach cycle except it doesn't force the group to go to the blocking domain in-between.
By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows.
It could be possible to adderss this by simply removing the protection from the iommu_attach_group(), but it is not so clear if that is safe for the few users. Thus, add a new API to serve this new purpose.
Atomic replacement allows the qemu emulation of the viommu to be more complete, as real hardware has this ability.
All drivers are already required to support changing between active UNMANAGED domains when using their attach_dev ops.
This API is expected to be used by IOMMUFD, so add to the iommu-priv header and mark it as IOMMUFD_INTERNAL.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommu-priv.h | 4 ++++ drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)
diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h index 9e1497027cff..b546795a7e49 100644 --- a/drivers/iommu/iommu-priv.h +++ b/drivers/iommu/iommu-priv.h @@ -15,4 +15,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev) */ return dev->iommu->iommu_dev->ops; } + +extern int iommu_group_replace_domain(struct iommu_group *group, + struct iommu_domain *new_domain); + #endif /* __LINUX_IOMMU_PRIV_H */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index a18b7f1a4e6e..15e07d39cd8d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2151,6 +2151,34 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group) } EXPORT_SYMBOL_GPL(iommu_attach_group);
+/** + * iommu_group_replace_domain - replace the domain that a group is attached to + * @new_domain: new IOMMU domain to replace with + * @group: IOMMU group that will be attached to the new domain + * + * This API allows the group to switch domains without being forced to go to + * the blocking domain in-between. + * + * If the currently attached domain is a core domain (e.g. a default_domain), + * it will act just like the iommu_attach_group(). + */ +int iommu_group_replace_domain(struct iommu_group *group, + struct iommu_domain *new_domain) +{ + int ret; + + if (!new_domain) + return -EINVAL; + + mutex_lock(&group->mutex); + ret = __iommu_group_set_domain(group, new_domain); + if (ret) + __iommu_group_set_domain(group, group->domain); + mutex_unlock(&group->mutex); + return ret; +} +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL); + static int iommu_group_do_set_platform_dma(struct device *dev, void *data) { const struct iommu_ops *ops = dev_iommu_ops(dev);
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
Is there value of allowing NULL new domain so this plays like iommu_detach_group() then iommufd only needs call one function in both attach/detach path?
- mutex_lock(&group->mutex);
- ret = __iommu_group_set_domain(group, new_domain);
- if (ret)
__iommu_group_set_domain(group, group->domain);
- mutex_unlock(&group->mutex);
- return ret;
+} +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
static int iommu_group_do_set_platform_dma(struct device *dev, void *data) { const struct iommu_ops *ops = dev_iommu_ops(dev); -- 2.39.1
On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
Is there value of allowing NULL new domain so this plays like iommu_detach_group() then iommufd only needs call one function in both attach/detach path?
We've used NULL to mean the 'platform domain' in the iommu core code in a few places, I'd prefer to avoid overloading NULL.
IMHO it doesn't help iommufd to substitute detach with replace.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, February 9, 2023 9:23 PM
On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
Is there value of allowing NULL new domain so this plays like iommu_detach_group() then iommufd only needs call one function in both attach/detach path?
We've used NULL to mean the 'platform domain' in the iommu core code in a few places, I'd prefer to avoid overloading NULL.
IMHO it doesn't help iommufd to substitute detach with replace.
OK
On Tue, 7 Feb 2023 13:17:54 -0800 Nicolin Chen nicolinc@nvidia.com wrote:
qemu has a need to replace the translations associated with a domain when the guest does large-scale operations like switching between an IDENTITY domain and, say, dma-iommu.c.
Currently, it does this by replacing all the mappings in a single domain, but this is very inefficient and means that domains have to be per-device rather than per-translation.
Provide a high-level API to allow replacements of one domain with another. This is similar to a detach/attach cycle except it doesn't force the group to go to the blocking domain in-between.
By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows.
It could be possible to adderss this by simply removing the protection from the iommu_attach_group(), but it is not so clear if that is safe for the few users. Thus, add a new API to serve this new purpose.
Atomic replacement allows the qemu emulation of the viommu to be more complete, as real hardware has this ability.
I was under the impression that we could not atomically switch a device's domain relative to in-flight DMA. IIRC, the discussion was relative to VT-d, and I vaguely recall something about the domain needing to be invalidated before it could be replaced. Am I mis-remembering or has this since been solved? Adding Ashok, who might have been involved in one of those conversations.
Or maybe atomic is the wrong word here since we expect no in-flight DMA during the sort of mode transitions referred to here, and we're really just trying to convey that we can do this via a single operation with reduced latency? Thanks,
Alex
All drivers are already required to support changing between active UNMANAGED domains when using their attach_dev ops.
This API is expected to be used by IOMMUFD, so add to the iommu-priv header and mark it as IOMMUFD_INTERNAL.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommu-priv.h | 4 ++++ drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)
diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h index 9e1497027cff..b546795a7e49 100644 --- a/drivers/iommu/iommu-priv.h +++ b/drivers/iommu/iommu-priv.h @@ -15,4 +15,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev) */ return dev->iommu->iommu_dev->ops; }
+extern int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain);
#endif /* __LINUX_IOMMU_PRIV_H */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index a18b7f1a4e6e..15e07d39cd8d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2151,6 +2151,34 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group) } EXPORT_SYMBOL_GPL(iommu_attach_group); +/**
- iommu_group_replace_domain - replace the domain that a group is attached to
- @new_domain: new IOMMU domain to replace with
- @group: IOMMU group that will be attached to the new domain
- This API allows the group to switch domains without being forced to go to
- the blocking domain in-between.
- If the currently attached domain is a core domain (e.g. a default_domain),
- it will act just like the iommu_attach_group().
- */
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
- mutex_lock(&group->mutex);
- ret = __iommu_group_set_domain(group, new_domain);
- if (ret)
__iommu_group_set_domain(group, group->domain);
- mutex_unlock(&group->mutex);
- return ret;
+} +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
static int iommu_group_do_set_platform_dma(struct device *dev, void *data) { const struct iommu_ops *ops = dev_iommu_ops(dev);
On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
On Tue, 7 Feb 2023 13:17:54 -0800 Nicolin Chen nicolinc@nvidia.com wrote:
qemu has a need to replace the translations associated with a domain when the guest does large-scale operations like switching between an IDENTITY domain and, say, dma-iommu.c.
Currently, it does this by replacing all the mappings in a single domain, but this is very inefficient and means that domains have to be per-device rather than per-translation.
Provide a high-level API to allow replacements of one domain with another. This is similar to a detach/attach cycle except it doesn't force the group to go to the blocking domain in-between.
By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows.
It could be possible to adderss this by simply removing the protection from the iommu_attach_group(), but it is not so clear if that is safe for the few users. Thus, add a new API to serve this new purpose.
Atomic replacement allows the qemu emulation of the viommu to be more complete, as real hardware has this ability.
I was under the impression that we could not atomically switch a device's domain relative to in-flight DMA.
Certainly all the HW can be proper atomic but not necessarily easily - the usual issue is a SW complication to manage the software controlled cache tags in a way that doesn't corrupt the cache.
This is because the cache tag and the io page table top are in different 64 bit words so atomic writes don't cover both, and thus the IOMMU HW could tear the two stores and mismatch the cache tag to the table top. This would corrupt the cache.
The easiest way to avoid this is for SW to use the same DID for the new and old tables. This is possible if this translation entry is the only user of the DID. A more complex way would use a temporary DID that can be safely corrupted. But realistically I'd expect VT-d drivers to simply make the PASID invalid for the duration of the update.
However something like AMD has a single cache tag for every entry in the PASID table so you could do an atomic replace trivially. Just update the PASID and invalidate the caches.
ARM has a flexible PASID table and atomic replace would be part of resizing it. eg you can atomically update the top of the PASID table with a single 64 bit store.
So replace lets the drivers implement those special behaviors if it makes sense for them.
Or maybe atomic is the wrong word here since we expect no in-flight DMA during the sort of mode transitions referred to here, and we're really just trying to convey that we can do this via a single operation with reduced latency? Thanks,
atomic means DMA will either translate with the old domain or the new domain but never a blocking domain. Keep in mind that with nesting "domain" can mean a full PASID table in guest memory.
I should reiterate that replace is not an API that is required to be atomic.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, February 11, 2023 8:45 AM
On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
On Tue, 7 Feb 2023 13:17:54 -0800 Nicolin Chen nicolinc@nvidia.com wrote:
qemu has a need to replace the translations associated with a domain when the guest does large-scale operations like switching between an IDENTITY domain and, say, dma-iommu.c.
Currently, it does this by replacing all the mappings in a single domain, but this is very inefficient and means that domains have to be per-device rather than per-translation.
Provide a high-level API to allow replacements of one domain with another. This is similar to a detach/attach cycle except it doesn't force the group to go to the blocking domain in-between.
By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows.
It could be possible to adderss this by simply removing the protection from the iommu_attach_group(), but it is not so clear if that is safe for the few users. Thus, add a new API to serve this new purpose.
Atomic replacement allows the qemu emulation of the viommu to be
more
complete, as real hardware has this ability.
I was under the impression that we could not atomically switch a device's domain relative to in-flight DMA.
it's possible as long as the mappings for in-flight DMA don't change in the transition.
Certainly all the HW can be proper atomic but not necessarily easily - the usual issue is a SW complication to manage the software controlled cache tags in a way that doesn't corrupt the cache.
This is because the cache tag and the io page table top are in different 64 bit words so atomic writes don't cover both, and thus the IOMMU HW could tear the two stores and mismatch the cache tag to the table top. This would corrupt the cache.
VT-d spec recommends using 128bit cmpxchg instruction to update page table pointer and DID together.
The easiest way to avoid this is for SW to use the same DID for the new and old tables. This is possible if this translation entry is the only user of the DID. A more complex way would use a temporary DID that can be safely corrupted. But realistically I'd expect VT-d drivers to simply make the PASID invalid for the duration of the update.
However something like AMD has a single cache tag for every entry in the PASID table so you could do an atomic replace trivially. Just update the PASID and invalidate the caches.
ARM has a flexible PASID table and atomic replace would be part of resizing it. eg you can atomically update the top of the PASID table with a single 64 bit store.
So replace lets the drivers implement those special behaviors if it makes sense for them.
Or maybe atomic is the wrong word here since we expect no in-flight DMA during the sort of mode transitions referred to here, and we're really just trying to convey that we can do this via a single operation with reduced latency? Thanks,
atomic means DMA will either translate with the old domain or the new domain but never a blocking domain. Keep in mind that with nesting "domain" can mean a full PASID table in guest memory.
I should reiterate that replace is not an API that is required to be atomic.
yes. Just as explained in the commit message:
" By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows. "
API level replace only implies removing transition to/from blocking domain. and then it's driver's call whether it wants to take this chance to implement a true 'atomic' behavior.
On 2023/2/13 10:24, Tian, Kevin wrote:
From: Jason Gunthorpejgg@nvidia.com Sent: Saturday, February 11, 2023 8:45 AM
On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
On Tue, 7 Feb 2023 13:17:54 -0800 Nicolin Chennicolinc@nvidia.com wrote:
qemu has a need to replace the translations associated with a domain when the guest does large-scale operations like switching between an IDENTITY domain and, say, dma-iommu.c.
Currently, it does this by replacing all the mappings in a single domain, but this is very inefficient and means that domains have to be per-device rather than per-translation.
Provide a high-level API to allow replacements of one domain with another. This is similar to a detach/attach cycle except it doesn't force the group to go to the blocking domain in-between.
By removing this forced blocking domain the iommu driver has the opportunity to implement an atomic replacement of the domains to the greatest extent its hardware allows.
It could be possible to adderss this by simply removing the protection from the iommu_attach_group(), but it is not so clear if that is safe for the few users. Thus, add a new API to serve this new purpose.
Atomic replacement allows the qemu emulation of the viommu to be
more
complete, as real hardware has this ability.
I was under the impression that we could not atomically switch a device's domain relative to in-flight DMA.
it's possible as long as the mappings for in-flight DMA don't change in the transition.
It also requires the mappings in old and new domains are identical. In another word, any IOVA should be translated to a same result no matter through the old domain, the new domain, or hardware caches.
A similar replacement has been implemented in the code. For example, the Intel IOMMU driver dynamically transforms a large range (2M or 1G) mapping from discrete 4k pages to a super pages to improve the paging cache efficiency.
Best regards, baolu
On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:
This is because the cache tag and the io page table top are in different 64 bit words so atomic writes don't cover both, and thus the IOMMU HW could tear the two stores and mismatch the cache tag to the table top. This would corrupt the cache.
VT-d spec recommends using 128bit cmpxchg instruction to update page table pointer and DID together.
Oh really? Such a thing exists? That neat!
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Monday, February 13, 2023 10:46 PM
On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:
This is because the cache tag and the io page table top are in different 64 bit words so atomic writes don't cover both, and thus the IOMMU HW could tear the two stores and mismatch the cache tag to the table top. This would corrupt the cache.
VT-d spec recommends using 128bit cmpxchg instruction to update page table pointer and DID together.
Oh really? Such a thing exists? That neat!
yes. see cmpxchg_double().
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
- mutex_lock(&group->mutex);
- ret = __iommu_group_set_domain(group, new_domain);
- if (ret)
__iommu_group_set_domain(group, group->domain);
Just realize the error unwind is a nop given below:
__iommu_group_set_domain() { if (group->domain == new_domain) return 0;
...
There was an attempt [1] to fix error unwind in iommu_attach_group(), by temporarily set group->domain to NULL before calling set_domain().
Jason, I wonder why this recovering cannot be done in __iommu_group_set_domain() directly, e.g.:
ret = __iommu_group_for_each_dev(group, new_domain, iommu_group_do_attach_device); if (ret) { __iommu_group_for_each_dev(group, group->domain, iommu_group_do_attach_device); return ret; } group->domain = new_domain;
Thanks, Kevin
[1] https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.c...
On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
- mutex_lock(&group->mutex);
- ret = __iommu_group_set_domain(group, new_domain);
- if (ret)
__iommu_group_set_domain(group, group->domain);
Just realize the error unwind is a nop given below:
__iommu_group_set_domain() { if (group->domain == new_domain) return 0;
...
There was an attempt [1] to fix error unwind in iommu_attach_group(), by temporarily set group->domain to NULL before calling set_domain().
Jason, I wonder why this recovering cannot be done in __iommu_group_set_domain() directly, e.g.:
ret = __iommu_group_for_each_dev(group, new_domain, iommu_group_do_attach_device); if (ret) { __iommu_group_for_each_dev(group, group->domain, iommu_group_do_attach_device); return ret; } group->domain = new_domain;
We talked about this already, some times this is not the correct recovery case, eg if we are going to a blocking domain we need to drop all references to the prior domain, not put them back.
Failures are WARN_ON events not error recovery.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, February 15, 2023 8:53 PM
On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+int iommu_group_replace_domain(struct iommu_group *group,
struct iommu_domain *new_domain)
+{
- int ret;
- if (!new_domain)
return -EINVAL;
- mutex_lock(&group->mutex);
- ret = __iommu_group_set_domain(group, new_domain);
- if (ret)
__iommu_group_set_domain(group, group->domain);
Just realize the error unwind is a nop given below:
__iommu_group_set_domain() { if (group->domain == new_domain) return 0;
...
There was an attempt [1] to fix error unwind in iommu_attach_group(), by temporarily set group->domain to NULL before calling set_domain().
Jason, I wonder why this recovering cannot be done in __iommu_group_set_domain() directly, e.g.:
ret = __iommu_group_for_each_dev(group, new_domain, iommu_group_do_attach_device); if (ret) { __iommu_group_for_each_dev(group, group->domain, iommu_group_do_attach_device); return ret; } group->domain = new_domain;
We talked about this already, some times this is not the correct recovery case, eg if we are going to a blocking domain we need to drop all references to the prior domain, not put them back.
Failures are WARN_ON events not error recovery.
OK, I remember that. Then here looks we also need temporarily set group->domain to NULL before calling set_domain() to recover, as [1] does.
[1] https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.c...
On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:
There was an attempt [1] to fix error unwind in iommu_attach_group(), by temporarily set group->domain to NULL before calling set_domain().
Jason, I wonder why this recovering cannot be done in __iommu_group_set_domain() directly, e.g.:
ret = __iommu_group_for_each_dev(group, new_domain, iommu_group_do_attach_device); if (ret) { __iommu_group_for_each_dev(group, group->domain, iommu_group_do_attach_device); return ret; } group->domain = new_domain;
We talked about this already, some times this is not the correct recovery case, eg if we are going to a blocking domain we need to drop all references to the prior domain, not put them back.
Failures are WARN_ON events not error recovery.
OK, I remember that. Then here looks we also need temporarily set group->domain to NULL before calling set_domain() to recover, as [1] does.
Sigh, this is too much.
I made a series to clean up all the domain attach logic so the error handling is all in one place and all the same.
What do you think?
https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, February 24, 2023 8:58 AM
On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:
There was an attempt [1] to fix error unwind in iommu_attach_group(),
by
temporarily set group->domain to NULL before calling set_domain().
Jason, I wonder why this recovering cannot be done in __iommu_group_set_domain() directly, e.g.:
ret = __iommu_group_for_each_dev(group, new_domain, iommu_group_do_attach_device); if (ret) { __iommu_group_for_each_dev(group, group->domain, iommu_group_do_attach_device); return ret; } group->domain = new_domain;
We talked about this already, some times this is not the correct recovery case, eg if we are going to a blocking domain we need to drop all references to the prior domain, not put them back.
Failures are WARN_ON events not error recovery.
OK, I remember that. Then here looks we also need temporarily set group->domain to NULL before calling set_domain() to recover, as [1] does.
Sigh, this is too much.
I made a series to clean up all the domain attach logic so the error handling is all in one place and all the same.
What do you think?
Yeah, that sounds a right cleanup at a glance.
To prepare for an access->ioas replacement, move iommufd_access_create() call into vfio_iommufd_emulated_bind(), making it symmetric with the __vfio_iommufd_access_destroy() call in vfio_iommufd_emulated_unbind(). This means an access is created/destroyed by the bind()/unbind(), and the vfio_iommufd_emulated_attach_ioas() only updates the access->ioas pointer.
Since there's no longer an ioas_id input for iommufd_access_create(), add a new helper iommufd_access_set_ioas() to set access->ioas. We can later add a "replace" feature simply to the new iommufd_access_set_ioas() too.
Leaving the access->ioas in vfio_iommufd_emulated_attach_ioas(), however, can introduce some potential of a race condition during pin_/unpin_pages() call where access->ioas->iopt is getting referenced. So, add an ioas_lock to protect it.
Note that the "refcount_dec(&access->ioas->obj.users)" line is also moved to the new iommufd_access_set_ioas() from iommufd_access_destroy_object() for symmetry. Without this change, the old_ioas would also lose the track of its refcount when the replace support is added.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/device.c | 100 ++++++++++++++++++------ drivers/iommu/iommufd/iommufd_private.h | 1 + drivers/iommu/iommufd/selftest.c | 5 +- drivers/vfio/iommufd.c | 27 ++++--- include/linux/iommufd.h | 3 +- 5 files changed, 97 insertions(+), 39 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index d81f93a321af..f4bd6f532a90 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -418,9 +418,9 @@ void iommufd_access_destroy_object(struct iommufd_object *obj) struct iommufd_access *access = container_of(obj, struct iommufd_access, obj);
- iopt_remove_access(&access->ioas->iopt, access); + iommufd_access_set_ioas(access, 0); iommufd_ctx_put(access->ictx); - refcount_dec(&access->ioas->obj.users); + mutex_destroy(&access->ioas_lock); }
/** @@ -437,12 +437,10 @@ void iommufd_access_destroy_object(struct iommufd_object *obj) * The provided ops are required to use iommufd_access_pin_pages(). */ struct iommufd_access * -iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id, +iommufd_access_create(struct iommufd_ctx *ictx, const struct iommufd_access_ops *ops, void *data) { struct iommufd_access *access; - struct iommufd_object *obj; - int rc;
/* * There is no uAPI for the access object, but to keep things symmetric @@ -455,33 +453,18 @@ iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id, access->data = data; access->ops = ops;
- obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS); - if (IS_ERR(obj)) { - rc = PTR_ERR(obj); - goto out_abort; - } - access->ioas = container_of(obj, struct iommufd_ioas, obj); - iommufd_ref_to_users(obj); - if (ops->needs_pin_pages) access->iova_alignment = PAGE_SIZE; else access->iova_alignment = 1; - rc = iopt_add_access(&access->ioas->iopt, access); - if (rc) - goto out_put_ioas;
/* The calling driver is a user until iommufd_access_destroy() */ refcount_inc(&access->obj.users); + mutex_init(&access->ioas_lock); access->ictx = ictx; iommufd_ctx_get(ictx); iommufd_object_finalize(ictx, &access->obj); return access; -out_put_ioas: - refcount_dec(&access->ioas->obj.users); -out_abort: - iommufd_object_abort(ictx, &access->obj); - return ERR_PTR(rc); } EXPORT_SYMBOL_NS_GPL(iommufd_access_create, IOMMUFD);
@@ -500,6 +483,50 @@ void iommufd_access_destroy(struct iommufd_access *access) } EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);
+int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) +{ + struct iommufd_ioas *new_ioas = NULL, *cur_ioas; + struct iommufd_ctx *ictx = access->ictx; + struct iommufd_object *obj; + int rc = 0; + + if (ioas_id) { + obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS); + if (IS_ERR(obj)) + return PTR_ERR(obj); + new_ioas = container_of(obj, struct iommufd_ioas, obj); + } + + mutex_lock(&access->ioas_lock); + cur_ioas = access->ioas; + if (cur_ioas == new_ioas) + goto out_unlock; + + if (new_ioas) { + rc = iopt_add_access(&new_ioas->iopt, access); + if (rc) + goto out_unlock; + iommufd_ref_to_users(obj); + } + + if (cur_ioas) { + iopt_remove_access(&cur_ioas->iopt, access); + refcount_dec(&cur_ioas->obj.users); + } + + access->ioas = new_ioas; + mutex_unlock(&access->ioas_lock); + + return 0; + +out_unlock: + mutex_unlock(&access->ioas_lock); + if (new_ioas) + iommufd_put_object(obj); + return rc; +} +EXPORT_SYMBOL_NS_GPL(iommufd_access_set_ioas, IOMMUFD); + /** * iommufd_access_notify_unmap - Notify users of an iopt to stop using it * @iopt: iopt to work on @@ -550,8 +577,8 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, void iommufd_access_unpin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length) { - struct io_pagetable *iopt = &access->ioas->iopt; struct iopt_area_contig_iter iter; + struct io_pagetable *iopt; unsigned long last_iova; struct iopt_area *area;
@@ -559,6 +586,13 @@ void iommufd_access_unpin_pages(struct iommufd_access *access, WARN_ON(check_add_overflow(iova, length - 1, &last_iova))) return;
+ mutex_lock(&access->ioas_lock); + if (!access->ioas) { + mutex_unlock(&access->ioas_lock); + return; + } + iopt = &access->ioas->iopt; + down_read(&iopt->iova_rwsem); iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) iopt_area_remove_access( @@ -568,6 +602,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access, min(last_iova, iopt_area_last_iova(area)))); up_read(&iopt->iova_rwsem); WARN_ON(!iopt_area_contig_done(&iter)); + mutex_unlock(&access->ioas_lock); } EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);
@@ -613,8 +648,8 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length, struct page **out_pages, unsigned int flags) { - struct io_pagetable *iopt = &access->ioas->iopt; struct iopt_area_contig_iter iter; + struct io_pagetable *iopt; unsigned long last_iova; struct iopt_area *area; int rc; @@ -629,6 +664,13 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, if (check_add_overflow(iova, length - 1, &last_iova)) return -EOVERFLOW;
+ mutex_lock(&access->ioas_lock); + if (!access->ioas) { + mutex_unlock(&access->ioas_lock); + return -ENOENT; + } + iopt = &access->ioas->iopt; + down_read(&iopt->iova_rwsem); iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { unsigned long last = min(last_iova, iopt_area_last_iova(area)); @@ -659,6 +701,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, }
up_read(&iopt->iova_rwsem); + mutex_unlock(&access->ioas_lock); return 0;
err_remove: @@ -673,6 +716,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, iopt_area_last_iova(area)))); } up_read(&iopt->iova_rwsem); + mutex_unlock(&access->ioas_lock); return rc; } EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, IOMMUFD); @@ -692,8 +736,8 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, IOMMUFD); int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, void *data, size_t length, unsigned int flags) { - struct io_pagetable *iopt = &access->ioas->iopt; struct iopt_area_contig_iter iter; + struct io_pagetable *iopt; struct iopt_area *area; unsigned long last_iova; int rc; @@ -703,6 +747,13 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, if (check_add_overflow(iova, length - 1, &last_iova)) return -EOVERFLOW;
+ mutex_lock(&access->ioas_lock); + if (!access->ioas) { + mutex_unlock(&access->ioas_lock); + return -ENOENT; + } + iopt = &access->ioas->iopt; + down_read(&iopt->iova_rwsem); iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { unsigned long last = min(last_iova, iopt_area_last_iova(area)); @@ -729,6 +780,7 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, rc = -ENOENT; err_out: up_read(&iopt->iova_rwsem); + mutex_unlock(&access->ioas_lock); return rc; } EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 222e86591f8a..2f4bb106bac6 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -261,6 +261,7 @@ struct iommufd_access { struct iommufd_object obj; struct iommufd_ctx *ictx; struct iommufd_ioas *ioas; + struct mutex ioas_lock; const struct iommufd_access_ops *ops; void *data; unsigned long iova_alignment; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index cfb5fe9a5e0e..db4011bdc8a9 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -571,7 +571,7 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd, }
access = iommufd_access_create( - ucmd->ictx, ioas_id, + ucmd->ictx, (flags & MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES) ? &selftest_access_ops_pin : &selftest_access_ops, @@ -580,6 +580,9 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd, rc = PTR_ERR(access); goto out_put_fdno; } + rc = iommufd_access_set_ioas(access, ioas_id); + if (rc) + goto out_destroy; cmd->create_access.out_access_fd = fdno; rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); if (rc) diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index 026f81a87dd7..dc9feab73db7 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = { int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id) { + struct iommufd_access *user; + lockdep_assert_held(&vdev->dev_set->lock);
- vdev->iommufd_ictx = ictx; iommufd_ctx_get(ictx); + user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops, vdev); + if (IS_ERR(user)) { + iommufd_ctx_put(vdev->iommufd_ictx); + return PTR_ERR(user); + } + iommufd_access_set_ioas(user, 0); + vdev->iommufd_access = user; + vdev->iommufd_ictx = ictx; return 0; } EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_bind); @@ -168,22 +177,14 @@ EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_unbind);
int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id) { - struct iommufd_access *user; - lockdep_assert_held(&vdev->dev_set->lock);
if (!vdev->iommufd_ictx) return -EINVAL; + if (!vdev->iommufd_access) + return -ENOENT;
- if (vdev->iommufd_access) - return -EBUSY; - - user = iommufd_access_create(vdev->iommufd_ictx, *pt_id, &vfio_user_ops, - vdev); - if (IS_ERR(user)) - return PTR_ERR(user); - vdev->iommufd_access = user; - return 0; + return iommufd_access_set_ioas(vdev->iommufd_access, *pt_id); } EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_attach_ioas);
@@ -194,6 +195,6 @@ void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev) if (!vdev->iommufd_ictx || !vdev->iommufd_access) return;
- __vfio_iommufd_access_destroy(vdev); + iommufd_access_set_ioas(vdev->iommufd_access, 0); } EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_detach_ioas); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 9672cf839687..f9bac6f9db2e 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -46,9 +46,10 @@ enum { };
struct iommufd_access * -iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id, +iommufd_access_create(struct iommufd_ctx *ictx, const struct iommufd_access_ops *ops, void *data); void iommufd_access_destroy(struct iommufd_access *access); +int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id);
void iommufd_ctx_get(struct iommufd_ctx *ictx);
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
@@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = { int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id) {
- struct iommufd_access *user;
- lockdep_assert_held(&vdev->dev_set->lock);
- vdev->iommufd_ictx = ictx; iommufd_ctx_get(ictx);
- user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops,
vdev);
- if (IS_ERR(user)) {
iommufd_ctx_put(vdev->iommufd_ictx);
return PTR_ERR(user);
- }
- iommufd_access_set_ioas(user, 0);
this is not required since ioas has been NULL after creation.
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Thu, Feb 09, 2023 at 02:56:39AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
@@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = { int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id) {
struct iommufd_access *user;
lockdep_assert_held(&vdev->dev_set->lock);
vdev->iommufd_ictx = ictx; iommufd_ctx_get(ictx);
user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops,
vdev);
if (IS_ERR(user)) {
iommufd_ctx_put(vdev->iommufd_ictx);
return PTR_ERR(user);
}
iommufd_access_set_ioas(user, 0);
this is not required since ioas has been NULL after creation.
Will drop it.
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
And add this too.
Thanks! Nic
On Tue, 2023-02-07 at 13:17 -0800, Nicolin Chen wrote: ...snip...
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index 026f81a87dd7..dc9feab73db7 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = { Â int vfio_iommufd_emulated_bind(struct vfio_device *vdev, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct iommufd_ctx *ictx, u32 *out_device_id) Â { +Â Â Â Â Â Â Â struct iommufd_access *user;
lockdep_assert_held(&vdev->dev_set->lock); Â -Â Â Â Â Â Â Â vdev->iommufd_ictx = ictx; Â Â Â Â Â Â Â Â iommufd_ctx_get(ictx); +Â Â Â Â Â Â Â user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops, vdev); +Â Â Â Â Â Â Â if (IS_ERR(user)) { +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â iommufd_ctx_put(vdev->iommufd_ictx);
Matthew noticed a vfio-ccw and -ap regression that blames this patch.
Probably both the iommufd_access_create() and iommufd_ctx_put() calls want the ictx variable itself, instead of the (uninitialized) pointer in the vfio_device. (At least that gets -ccw and -ap working again.)
Thanks, Eric
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return PTR_ERR(user); +Â Â Â Â Â Â Â } +Â Â Â Â Â Â Â iommufd_access_set_ioas(user, 0); +Â Â Â Â Â Â Â vdev->iommufd_access = user; +Â Â Â Â Â Â Â vdev->iommufd_ictx = ictx; Â Â Â Â Â Â Â Â return 0; Â } Â EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_bind);
On Thu, Feb 09, 2023 at 01:58:47PM -0500, Eric Farman wrote:
External email: Use caution opening links or attachments
On Tue, 2023-02-07 at 13:17 -0800, Nicolin Chen wrote: ...snip...
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index 026f81a87dd7..dc9feab73db7 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = { int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id) {
struct iommufd_access *user;
lockdep_assert_held(&vdev->dev_set->lock);
vdev->iommufd_ictx = ictx; iommufd_ctx_get(ictx);
user = iommufd_access_create(vdev->iommufd_ictx,
&vfio_user_ops, vdev);
if (IS_ERR(user)) {
iommufd_ctx_put(vdev->iommufd_ictx);
Matthew noticed a vfio-ccw and -ap regression that blames this patch.
Probably both the iommufd_access_create() and iommufd_ctx_put() calls want the ictx variable itself, instead of the (uninitialized) pointer in the vfio_device. (At least that gets -ccw and -ap working again.)
Oops. Yes, it should be:
iommufd_ctx_get(ictx); user = iommufd_access_create(ictx, &vfio_user_ops, vdev); if (IS_ERR(user)) { iommufd_ctx_put(ictx);
Will fix in v3.
Thanks! Nic
Add a new IOMMU_TEST_OP_ACCESS_SET_IOAS to allow setting access->ioas individually, corresponding to the iommufd_access_set_ioas() helper.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 4 +++ drivers/iommu/iommufd/selftest.c | 26 +++++++++++++++---- tools/testing/selftests/iommu/iommufd_utils.h | 22 ++++++++++++++-- 3 files changed, 45 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 1d96a8f466fd..f2c61a9500e7 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -13,6 +13,7 @@ enum { IOMMU_TEST_OP_MD_CHECK_MAP, IOMMU_TEST_OP_MD_CHECK_REFS, IOMMU_TEST_OP_CREATE_ACCESS, + IOMMU_TEST_OP_ACCESS_SET_IOAS, IOMMU_TEST_OP_DESTROY_ACCESS_PAGES, IOMMU_TEST_OP_ACCESS_PAGES, IOMMU_TEST_OP_ACCESS_RW, @@ -66,6 +67,9 @@ struct iommu_test_cmd { __u32 out_access_fd; __u32 flags; } create_access; + struct { + __u32 ioas_id; + } access_set_ioas; struct { __u32 access_pages_id; } destroy_access_pages; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index db4011bdc8a9..b94870f93138 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -549,7 +549,7 @@ static struct selftest_access *iommufd_test_alloc_access(void) }
static int iommufd_test_create_access(struct iommufd_ucmd *ucmd, - unsigned int ioas_id, unsigned int flags) + unsigned int flags) { struct iommu_test_cmd *cmd = ucmd->cmd; struct selftest_access *staccess; @@ -580,9 +580,6 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd, rc = PTR_ERR(access); goto out_put_fdno; } - rc = iommufd_access_set_ioas(access, ioas_id); - if (rc) - goto out_destroy; cmd->create_access.out_access_fd = fdno; rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); if (rc) @@ -601,6 +598,22 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd, return rc; }
+static int iommufd_test_access_set_ioas(struct iommufd_ucmd *ucmd, + unsigned int access_id, + unsigned int ioas_id) +{ + struct selftest_access *staccess; + int rc; + + staccess = iommufd_access_get(access_id); + if (IS_ERR(staccess)) + return PTR_ERR(staccess); + + rc = iommufd_access_set_ioas(staccess->access, ioas_id); + fput(staccess->file); + return rc; +} + /* Check that the pages in a page array match the pages in the user VA */ static int iommufd_test_check_pages(void __user *uptr, struct page **pages, size_t npages) @@ -810,8 +823,11 @@ int iommufd_test(struct iommufd_ucmd *ucmd) ucmd, u64_to_user_ptr(cmd->check_refs.uptr), cmd->check_refs.length, cmd->check_refs.refs); case IOMMU_TEST_OP_CREATE_ACCESS: - return iommufd_test_create_access(ucmd, cmd->id, + return iommufd_test_create_access(ucmd, cmd->create_access.flags); + case IOMMU_TEST_OP_ACCESS_SET_IOAS: + return iommufd_test_access_set_ioas( + ucmd, cmd->id, cmd->access_set_ioas.ioas_id); case IOMMU_TEST_OP_ACCESS_PAGES: return iommufd_test_access_pages( ucmd, cmd->id, cmd->access_pages.iova, diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 0d1f46369c2a..67805afc620f 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -66,13 +66,31 @@ static int _test_cmd_mock_domain(int fd, unsigned int ioas_id, __u32 *device_id, EXPECT_ERRNO(_errno, _test_cmd_mock_domain(self->fd, ioas_id, \ device_id, hwpt_id))
+static int _test_cmd_access_set_ioas(int fd, __u32 access_id, + unsigned int ioas_id) +{ + struct iommu_test_cmd cmd = { + .size = sizeof(cmd), + .op = IOMMU_TEST_OP_ACCESS_SET_IOAS, + .id = access_id, + .access_set_ioas = { .ioas_id = ioas_id }, + }; + int ret; + + ret = ioctl(fd, IOMMU_TEST_CMD, &cmd); + if (ret) + return ret; + return 0; +} +#define test_cmd_access_set_ioas(access_id, ioas_id) \ + ASSERT_EQ(0, _test_cmd_access_set_ioas(self->fd, access_id, ioas_id)) + static int _test_cmd_create_access(int fd, unsigned int ioas_id, __u32 *access_id, unsigned int flags) { struct iommu_test_cmd cmd = { .size = sizeof(cmd), .op = IOMMU_TEST_OP_CREATE_ACCESS, - .id = ioas_id, .create_access = { .flags = flags }, }; int ret; @@ -81,7 +99,7 @@ static int _test_cmd_create_access(int fd, unsigned int ioas_id, if (ret) return ret; *access_id = cmd.create_access.out_access_fd; - return 0; + return _test_cmd_access_set_ioas(fd, *access_id, ioas_id); } #define test_cmd_create_access(ioas_id, access_id, flags) \ ASSERT_EQ(0, _test_cmd_create_access(self->fd, ioas_id, access_id, \
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Add a new IOMMU_TEST_OP_ACCESS_SET_IOAS to allow setting access->ioas individually, corresponding to the iommufd_access_set_ioas() helper.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Support an access->ioas replacement in iommufd_access_set_ioas(), which sets the access->ioas to NULL provisionally so that any further incoming iommufd_access_pin_pages() callback can be blocked.
Then, call access->ops->unmap() to clean up the entire iopt. To allow an iommufd_access_unpin_pages() callback to happen via this unmap() call, add an ioas_unpin pointer so the unpin routine won't be affected by the "access->ioas = NULL" trick above.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/device.c | 16 ++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 1 + 2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index f4bd6f532a90..10ce47484ffa 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) iommufd_ref_to_users(obj); }
+ /* + * Set ioas to NULL to block any further iommufd_access_pin_pages(). + * iommufd_access_unpin_pages() can continue using access->ioas_unpin. + */ + access->ioas = NULL; + if (cur_ioas) { + if (new_ioas) { + mutex_unlock(&access->ioas_lock); + access->ops->unmap(access->data, 0, ULONG_MAX); + mutex_lock(&access->ioas_lock); + } iopt_remove_access(&cur_ioas->iopt, access); refcount_dec(&cur_ioas->obj.users); }
+ access->ioas_unpin = new_ioas; access->ioas = new_ioas; mutex_unlock(&access->ioas_lock);
@@ -587,11 +599,11 @@ void iommufd_access_unpin_pages(struct iommufd_access *access, return;
mutex_lock(&access->ioas_lock); - if (!access->ioas) { + if (!access->ioas_unpin) { mutex_unlock(&access->ioas_lock); return; } - iopt = &access->ioas->iopt; + iopt = &access->ioas_unpin->iopt;
down_read(&iopt->iova_rwsem); iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 2f4bb106bac6..593138bb37b8 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -261,6 +261,7 @@ struct iommufd_access { struct iommufd_object obj; struct iommufd_ctx *ictx; struct iommufd_ioas *ioas; + struct iommufd_ioas *ioas_unpin; struct mutex ioas_lock; const struct iommufd_access_ops *ops; void *data;
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Support an access->ioas replacement in iommufd_access_set_ioas(), which sets the access->ioas to NULL provisionally so that any further incoming iommufd_access_pin_pages() callback can be blocked.
Then, call access->ops->unmap() to clean up the entire iopt. To allow an iommufd_access_unpin_pages() callback to happen via this unmap() call, add an ioas_unpin pointer so the unpin routine won't be affected by the "access->ioas = NULL" trick above.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
move patch10 before this one.
drivers/iommu/iommufd/device.c | 16 ++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 1 + 2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index f4bd6f532a90..10ce47484ffa 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) iommufd_ref_to_users(obj); }
- /*
* Set ioas to NULL to block any further iommufd_access_pin_pages().
* iommufd_access_unpin_pages() can continue using access-
ioas_unpin.
*/
- access->ioas = NULL;
- if (cur_ioas) {
if (new_ioas) {
mutex_unlock(&access->ioas_lock);
access->ops->unmap(access->data, 0, ULONG_MAX);
mutex_lock(&access->ioas_lock);
}
why does above only apply to a valid new_ioas? this is the cleanup on cur_ioas then required even when new_ioas=NULL.
Instead here the check should be "if (access->ops->unmap)". with patch10 the cleanup is required only for driver which is allowed to pin pages.
On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:
--- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) iommufd_ref_to_users(obj); }
/*
* Set ioas to NULL to block any further iommufd_access_pin_pages().
* iommufd_access_unpin_pages() can continue using access-
ioas_unpin.
*/
access->ioas = NULL;
if (cur_ioas) {
if (new_ioas) {
mutex_unlock(&access->ioas_lock);
access->ops->unmap(access->data, 0, ULONG_MAX);
mutex_lock(&access->ioas_lock);
}
why does above only apply to a valid new_ioas? this is the cleanup on cur_ioas then required even when new_ioas=NULL.
Though it'd make sense to put it in the common path, our current detach routine doesn't call this unmap. If we do so, it'd become something new to the normal detach routine. Or does this mean the detach routine has been missing an unmap call so far?
Thanks Nic
On Thu, Feb 09, 2023 at 12:28:45PM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:
--- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) iommufd_ref_to_users(obj); }
/*
* Set ioas to NULL to block any further iommufd_access_pin_pages().
* iommufd_access_unpin_pages() can continue using access-
ioas_unpin.
*/
access->ioas = NULL;
if (cur_ioas) {
if (new_ioas) {
mutex_unlock(&access->ioas_lock);
access->ops->unmap(access->data, 0, ULONG_MAX);
mutex_lock(&access->ioas_lock);
}
why does above only apply to a valid new_ioas? this is the cleanup on cur_ioas then required even when new_ioas=NULL.
Though it'd make sense to put it in the common path, our current detach routine doesn't call this unmap. If we do so, it'd become something new to the normal detach routine. Or does this mean the detach routine has been missing an unmap call so far?
By the time vfio_iommufd_emulated_unbind() is called the driver's close_device() has already returned
At this point the driver should have removed all active pins.
We should not call back into the driver with unmap after its close_device() returns.
However, this function is not on the close_device path so it should always flush all existing mappings before attempting to change the ioas to anything.
Jason
On Thu, Feb 09, 2023 at 04:49:55PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 12:28:45PM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:
--- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id) iommufd_ref_to_users(obj); }
/*
* Set ioas to NULL to block any further iommufd_access_pin_pages().
* iommufd_access_unpin_pages() can continue using access-
ioas_unpin.
*/
access->ioas = NULL;
if (cur_ioas) {
if (new_ioas) {
mutex_unlock(&access->ioas_lock);
access->ops->unmap(access->data, 0, ULONG_MAX);
mutex_lock(&access->ioas_lock);
}
why does above only apply to a valid new_ioas? this is the cleanup on cur_ioas then required even when new_ioas=NULL.
Though it'd make sense to put it in the common path, our current detach routine doesn't call this unmap. If we do so, it'd become something new to the normal detach routine. Or does this mean the detach routine has been missing an unmap call so far?
By the time vfio_iommufd_emulated_unbind() is called the driver's close_device() has already returned
At this point the driver should have removed all active pins.
We should not call back into the driver with unmap after its close_device() returns.
I see. Just found that vfio_device_last_close().
However, this function is not on the close_device path so it should always flush all existing mappings before attempting to change the ioas to anything.
OK. That means I also need to fix my change here:
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 8a9834fc129a..ba3fd35b7540 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -465,7 +465,10 @@ void iommufd_access_destroy_object(struct iommufd_object *obj) struct iommufd_access *access = container_of(obj, struct iommufd_access, obj);
- iommufd_access_set_ioas(access, 0); + if (access->ioas) { + iopt_remove_access(&access->ioas->iopt, access); + refcount_dec(&access->ioas->obj.users); + } iommufd_ctx_put(access->ictx); mutex_destroy(&access->ioas_lock); }
Otherwise, the close_device path would invoke this function via the unbind() call.
Thanks Nic
Add replace coverage as a part of user_copy test case. It basically repeats the copy test after replacing the old ioas with a new one.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd.c | 29 +++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index fa08209268c4..1e293950ac88 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -1239,7 +1239,13 @@ TEST_F(iommufd_mock_domain, user_copy) .dst_iova = MOCK_APERTURE_START, .length = BUFFER_SIZE, }; - unsigned int ioas_id; + struct iommu_ioas_unmap unmap_cmd = { + .size = sizeof(unmap_cmd), + .ioas_id = self->ioas_id, + .iova = MOCK_APERTURE_START, + .length = BUFFER_SIZE, + }; + unsigned int new_ioas_id, ioas_id;
/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */ test_ioctl_ioas_alloc(&ioas_id); @@ -1257,11 +1263,30 @@ TEST_F(iommufd_mock_domain, user_copy) ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, ©_cmd)); check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);
+ /* Now replace the ioas with a new one */ + test_ioctl_ioas_alloc(&new_ioas_id); + test_ioctl_ioas_map_id(new_ioas_id, buffer, BUFFER_SIZE, + ©_cmd.src_iova); + test_cmd_access_set_ioas(access_cmd.id, new_ioas_id); + + /* Destroy the old ioas and cleanup copied mapping */ + ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd)); + test_ioctl_destroy(ioas_id); + + /* Then run the same test again with the new ioas */ + access_cmd.access_pages.iova = copy_cmd.src_iova; + ASSERT_EQ(0, + ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES), + &access_cmd)); + copy_cmd.src_ioas_id = new_ioas_id; + ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, ©_cmd)); + check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE); + test_cmd_destroy_access_pages( access_cmd.id, access_cmd.access_pages.out_access_pages_id); test_cmd_destroy_access(access_cmd.id);
- test_ioctl_destroy(ioas_id); + test_ioctl_destroy(new_ioas_id); }
/* VFIO compatibility IOCTLs */
Because list_del() is together with iopt_table_remove_domain(), it makes sense to have list_add_tail() together with iopt_table_add_domain().
Also place the mutex outside the iommufd_device_do_attach() call, similar to what's in the iommufd_device_auto_get_domain() function.
Co-developed-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/device.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 10ce47484ffa..b8c3e3baccb5 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -200,6 +200,8 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc;
+ lockdep_assert_held(&hwpt->ioas->mutex); + mutex_lock(&hwpt->devices_lock);
/* @@ -243,6 +245,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, hwpt->domain); if (rc) goto out_detach; + list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); } }
@@ -304,7 +307,6 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev, rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_abort; - list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);
mutex_unlock(&ioas->mutex); iommufd_object_finalize(idev->ictx, &hwpt->obj); @@ -343,13 +345,11 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+ mutex_lock(&hwpt->ioas->mutex); rc = iommufd_device_do_attach(idev, hwpt); + mutex_unlock(&hwpt->ioas->mutex); if (rc) goto out_put_pt_obj; - - mutex_lock(&hwpt->ioas->mutex); - list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); - mutex_unlock(&hwpt->ioas->mutex); break; } case IOMMUFD_OBJ_IOAS: {
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Because list_del() is together with iopt_table_remove_domain(), it makes sense to have list_add_tail() together with iopt_table_add_domain().
Also place the mutex outside the iommufd_device_do_attach() call, similar to what's in the iommufd_device_auto_get_domain() function.
Co-developed-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
shouldn't this be a separate bug fix and backported? double adding a list item would certainly clobber the list...
On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Because list_del() is together with iopt_table_remove_domain(), it makes sense to have list_add_tail() together with iopt_table_add_domain().
Also place the mutex outside the iommufd_device_do_attach() call, similar to what's in the iommufd_device_auto_get_domain() function.
Co-developed-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
shouldn't this be a separate bug fix and backported? double adding a list item would certainly clobber the list...
AFAIK there is no bug, this is just reorganizing things
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, February 9, 2023 9:24 PM
On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Because list_del() is together with iopt_table_remove_domain(), it makes sense to have list_add_tail() together with iopt_table_add_domain().
Also place the mutex outside the iommufd_device_do_attach() call,
similar
to what's in the iommufd_device_auto_get_domain() function.
Co-developed-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
shouldn't this be a separate bug fix and backported? double adding a list item would certainly clobber the list...
AFAIK there is no bug, this is just reorganizing things
there is semantics change.
here is the current code:
case IOMMUFD_OBJ_HW_PAGETABLE: { struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_put_pt_obj;
mutex_lock(&hwpt->ioas->mutex); list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); mutex_unlock(&hwpt->ioas->mutex); break; }
Above means every attach to hwpt will try to add the hwpt to the list tail. Isn't it a bug?
with this patch the hwpt is added to the list only when it's attached by the first device, i.e. when iopt_table_add_domain() is called.
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { rc = iommu_attach_group(hwpt->domain, idev->group); if (rc) goto out_iova;
if (list_empty(&hwpt->devices)) { rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain); if (rc) goto out_detach; list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); } }
so it's actually a bug fix.
On Fri, Feb 10, 2023 at 01:46:18AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, February 9, 2023 9:24 PM
On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Because list_del() is together with iopt_table_remove_domain(), it makes sense to have list_add_tail() together with iopt_table_add_domain().
Also place the mutex outside the iommufd_device_do_attach() call,
similar
to what's in the iommufd_device_auto_get_domain() function.
Co-developed-by: Yi Liu yi.l.liu@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
shouldn't this be a separate bug fix and backported? double adding a list item would certainly clobber the list...
AFAIK there is no bug, this is just reorganizing things
there is semantics change.
here is the current code:
case IOMMUFD_OBJ_HW_PAGETABLE: { struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_put_pt_obj; mutex_lock(&hwpt->ioas->mutex); list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); mutex_unlock(&hwpt->ioas->mutex); break;
}
Above means every attach to hwpt will try to add the hwpt to the list tail. Isn't it a bug?
Yes, that looks like a bug..
But this patch isn't the right way to fix that.
The HWPT should be permanently linked to the IOAS as long as it exists, and the linkage should happen when it is first created.
So attaching a HWPT to another device should never re-link it to the ioas, thus delete these lines here.
However it looks like iommufd_device_detach() is technically wrong too, it should only detach the IOAS and HWPT if it is going to destroy the HWPT. We can't hit those kinds of bugs ATM because we cannot create naked HWPTs that are not autodomains.
Maybe something like this.. I'll look closer next week
Jason
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index d81f93a321afcb..4e87a44533048a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -279,7 +279,7 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev, */ mutex_lock(&ioas->mutex); list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) { - if (!hwpt->auto_domain) + if (!hwpt->auto_domain || iommufd_object_alive(&hwpt->obj)) continue;
rc = iommufd_device_do_attach(idev, hwpt); @@ -304,6 +304,7 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev, rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_abort; + list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);
mutex_unlock(&ioas->mutex); @@ -346,10 +347,6 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_put_pt_obj; - - mutex_lock(&hwpt->ioas->mutex); - list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); - mutex_unlock(&hwpt->ioas->mutex); break; } case IOMMUFD_OBJ_IOAS: { @@ -390,14 +387,8 @@ void iommufd_device_detach(struct iommufd_device *idev) mutex_lock(&hwpt->ioas->mutex); mutex_lock(&hwpt->devices_lock); list_del(&idev->devices_item); - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - if (list_empty(&hwpt->devices)) { - iopt_table_remove_domain(&hwpt->ioas->iopt, - hwpt->domain); - list_del(&hwpt->hwpt_item); - } + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) iommu_detach_group(hwpt->domain, idev->group); - } iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); mutex_unlock(&hwpt->devices_lock); mutex_unlock(&hwpt->ioas->mutex); diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 43d473989a0667..b11738bbdff7ec 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -13,6 +13,11 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
WARN_ON(!list_empty(&hwpt->devices));
+ mutex_lock(&hwpt->ioas->mutex); + iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); + list_del(&hwpt->hwpt_item); + mutex_unlock(&hwpt->ioas->mutex); + iommu_domain_free(hwpt->domain); refcount_dec(&hwpt->ioas->obj.users); mutex_destroy(&hwpt->devices_lock);
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, February 11, 2023 5:17 AM
there is semantics change.
here is the current code:
case IOMMUFD_OBJ_HW_PAGETABLE: { struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable,
obj);
rc = iommufd_device_do_attach(idev, hwpt); if (rc) goto out_put_pt_obj; mutex_lock(&hwpt->ioas->mutex); list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); mutex_unlock(&hwpt->ioas->mutex); break;
}
Above means every attach to hwpt will try to add the hwpt to the list tail. Isn't it a bug?
Yes, that looks like a bug..
But this patch isn't the right way to fix that.
The HWPT should be permanently linked to the IOAS as long as it exists, and the linkage should happen when it is first created.
So attaching a HWPT to another device should never re-link it to the ioas, thus delete these lines here.
yes, this makes more sense
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as: 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N) 2) Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains. as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
Also, block replace operations that are from/to auto_domains, i.e. only user-allocated hw_pagetables can be replaced or replaced with.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- drivers/iommu/iommufd/iommufd_private.h | 2 + 2 files changed, 76 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index b8c3e3baccb5..8a9834fc129a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -9,6 +9,8 @@ #include "io_pagetable.h" #include "iommufd_private.h"
+MODULE_IMPORT_NS(IOMMUFD_INTERNAL); + static bool allow_unsafe_interrupts; module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC( @@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+/** + * __iommmufd_device_detach - Detach a device from idev->hwpt to new_hwpt + * @idev: device to detach + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple detach) + * @detach_group: flag to call iommu_detach_group + * + * This is a cleanup helper shared by the replace and detach routines. Comparing + * to a detach routine, a replace routine only needs a partial detach procedure: + * it does not need the iommu_detach_group(); it will attach the device to a new + * hw_pagetable after a partial detach from the currently attached hw_pagetable, + * so certain steps can be skipped if two hw_pagetables have the same IOAS. + */ +static void __iommmufd_device_detach(struct iommufd_device *idev, + struct iommufd_hw_pagetable *new_hwpt, + bool detach_group) +{ + struct iommufd_hw_pagetable *hwpt = idev->hwpt; + struct iommufd_ioas *new_ioas = NULL; + + if (new_hwpt) + new_ioas = new_hwpt->ioas; + + mutex_lock(&hwpt->devices_lock); + list_del(&idev->devices_item); + if (hwpt->ioas != new_ioas) + mutex_lock(&hwpt->ioas->mutex); + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { + if (list_empty(&hwpt->devices)) { + iopt_table_remove_domain(&hwpt->ioas->iopt, + hwpt->domain); + list_del(&hwpt->hwpt_item); + } + if (detach_group) + iommu_detach_group(hwpt->domain, idev->group); + } + if (hwpt->ioas != new_ioas) { + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); + mutex_unlock(&hwpt->ioas->mutex); + } + mutex_unlock(&hwpt->devices_lock); + + if (hwpt->auto_domain) + iommufd_object_destroy_user(idev->ictx, &hwpt->obj); + else + refcount_dec(&hwpt->obj.users); + + idev->hwpt = NULL; + + refcount_dec(&idev->obj.users); +} + static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc;
@@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, * the group once for the first device that is in the group. */ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - rc = iommu_attach_group(hwpt->domain, idev->group); + rc = iommu_group_replace_domain(idev->group, hwpt->domain); if (rc) goto out_iova;
@@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } }
+ /* Replace the cur_hwpt without iommu_detach_group() */ + if (cur_hwpt) + __iommmufd_device_detach(idev, hwpt, false); + idev->hwpt = hwpt; refcount_inc(&hwpt->obj.users); list_add(&idev->devices_item, &hwpt->devices); @@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, return 0;
out_detach: - iommu_detach_group(hwpt->domain, idev->group); + if (cur_hwpt) + iommu_group_replace_domain(idev->group, cur_hwpt->domain); + else + iommu_detach_group(hwpt->domain, idev->group); out_iova: iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); out_unlock: @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+ if (idev->hwpt == hwpt) + goto out_done; + if (idev->hwpt && idev->hwpt->auto_domain) { + rc = -EBUSY; + goto out_put_pt_obj; + } + mutex_lock(&hwpt->ioas->mutex); rc = iommufd_device_do_attach(idev, hwpt); mutex_unlock(&hwpt->ioas->mutex); @@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_ioas *ioas = container_of(pt_obj, struct iommufd_ioas, obj);
+ if (idev->hwpt) + return -EBUSY; rc = iommufd_device_auto_get_domain(idev, ioas); if (rc) goto out_put_pt_obj; @@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) }
refcount_inc(&idev->obj.users); +out_done: *pt_id = idev->hwpt->obj.id; rc = 0;
@@ -385,31 +456,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD); */ void iommufd_device_detach(struct iommufd_device *idev) { - struct iommufd_hw_pagetable *hwpt = idev->hwpt; - - mutex_lock(&hwpt->ioas->mutex); - mutex_lock(&hwpt->devices_lock); - list_del(&idev->devices_item); - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - if (list_empty(&hwpt->devices)) { - iopt_table_remove_domain(&hwpt->ioas->iopt, - hwpt->domain); - list_del(&hwpt->hwpt_item); - } - iommu_detach_group(hwpt->domain, idev->group); - } - iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); - mutex_unlock(&hwpt->devices_lock); - mutex_unlock(&hwpt->ioas->mutex); - - if (hwpt->auto_domain) - iommufd_object_destroy_user(idev->ictx, &hwpt->obj); - else - refcount_dec(&hwpt->obj.users); - - idev->hwpt = NULL; - - refcount_dec(&idev->obj.users); + __iommmufd_device_detach(idev, NULL, true); } EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 593138bb37b8..200c783800ad 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -9,6 +9,8 @@ #include <linux/refcount.h> #include <linux/uaccess.h>
+#include "../iommu-priv.h" + struct iommu_domain; struct iommu_group; struct iommu_option;
Hi Nic,
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
Also, block replace operations that are from/to auto_domains, i.e. only user-allocated hw_pagetables can be replaced or replaced with.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- drivers/iommu/iommufd/iommufd_private.h | 2 + 2 files changed, 76 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index b8c3e3baccb5..8a9834fc129a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -9,6 +9,8 @@ #include "io_pagetable.h" #include "iommufd_private.h"
+MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
static bool allow_unsafe_interrupts; module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC( @@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
new_hwpt
This function doesn't do anything to make this device attached to new_hwpt. It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if this detach requires to do some extra thing. E.g. remove reserved iova from the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt", and explain the usage of new_hwpt in the below.
- @idev: device to detach
- @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
detach)
The new hw_pagetable to be attached.
- @detach_group: flag to call iommu_detach_group
- This is a cleanup helper shared by the replace and detach routines.
Comparing
- to a detach routine, a replace routine only needs a partial detach
procedure:
- it does not need the iommu_detach_group(); it will attach the device to
a new
- hw_pagetable after a partial detach from the currently attached
hw_pagetable,
- so certain steps can be skipped if two hw_pagetables have the same
IOAS.
- */
+static void __iommmufd_device_detach(struct iommufd_device *idev,
struct iommufd_hw_pagetable
*new_hwpt,
bool detach_group)
+{
- struct iommufd_hw_pagetable *hwpt = idev->hwpt;
- struct iommufd_ioas *new_ioas = NULL;
- if (new_hwpt)
new_ioas = new_hwpt->ioas;
- mutex_lock(&hwpt->devices_lock);
- list_del(&idev->devices_item);
- if (hwpt->ioas != new_ioas)
mutex_lock(&hwpt->ioas->mutex);
The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock. See the iommufd_device_auto_get_domain(). If possible, may switch the order sequence here. Also, rename hwpt to be cur_hwpt, this may help reviewers to distinguish it from the hwpt in the caller of this function. It looks to be a deadlock at first look, but not after closer reading.
- if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
if (detach_group)
iommu_detach_group(hwpt->domain, idev->group);
- }
- if (hwpt->ioas != new_ioas) {
iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev-
dev);
mutex_unlock(&hwpt->ioas->mutex);
- }
- mutex_unlock(&hwpt->devices_lock);
- if (hwpt->auto_domain)
iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
- else
refcount_dec(&hwpt->obj.users);
- idev->hwpt = NULL;
- refcount_dec(&idev->obj.users);
+}
static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) {
- struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc;
@@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, * the group once for the first device that is in the group. */ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
rc = iommu_attach_group(hwpt->domain, idev->group);
rc = iommu_group_replace_domain(idev->group, hwpt-
domain);
if (rc) goto out_iova;
@@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } }
- /* Replace the cur_hwpt without iommu_detach_group() */
- if (cur_hwpt)
__iommmufd_device_detach(idev, hwpt, false);
- idev->hwpt = hwpt; refcount_inc(&hwpt->obj.users); list_add(&idev->devices_item, &hwpt->devices);
@@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, return 0;
out_detach:
- iommu_detach_group(hwpt->domain, idev->group);
- if (cur_hwpt)
iommu_group_replace_domain(idev->group, cur_hwpt-
domain);
- else
iommu_detach_group(hwpt->domain, idev->group);
out_iova: iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); out_unlock: @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
if (idev->hwpt == hwpt)
goto out_done;
if (idev->hwpt && idev->hwpt->auto_domain) {
rc = -EBUSY;
This means if device was attached to an auto_created hwpt, then we cannot replace it with a user allocated hwpt? If yes, this means the replace is not available until user hwpt support, which is part of nesting.
goto out_put_pt_obj;
}
- mutex_lock(&hwpt->ioas->mutex); rc = iommufd_device_do_attach(idev, hwpt); mutex_unlock(&hwpt->ioas->mutex);
@@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_ioas *ioas = container_of(pt_obj, struct iommufd_ioas, obj);
if (idev->hwpt)
return -EBUSY;
So we don't allow ioas replacement for physical devices. Is it? Looks like emulated devices allows it.
rc = iommufd_device_auto_get_domain(idev, ioas); if (rc) goto out_put_pt_obj;
@@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) }
refcount_inc(&idev->obj.users); +out_done: *pt_id = idev->hwpt->obj.id; rc = 0;
@@ -385,31 +456,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD); */ void iommufd_device_detach(struct iommufd_device *idev) {
- struct iommufd_hw_pagetable *hwpt = idev->hwpt;
- mutex_lock(&hwpt->ioas->mutex);
- mutex_lock(&hwpt->devices_lock);
- list_del(&idev->devices_item);
- if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
iommu_detach_group(hwpt->domain, idev->group);
- }
- iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
- mutex_unlock(&hwpt->devices_lock);
- mutex_unlock(&hwpt->ioas->mutex);
- if (hwpt->auto_domain)
iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
- else
refcount_dec(&hwpt->obj.users);
- idev->hwpt = NULL;
- refcount_dec(&idev->obj.users);
- __iommmufd_device_detach(idev, NULL, true);
} EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 593138bb37b8..200c783800ad 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -9,6 +9,8 @@ #include <linux/refcount.h> #include <linux/uaccess.h>
+#include "../iommu-priv.h"
struct iommu_domain; struct iommu_group; struct iommu_option; -- 2.39.1
Regards, Yi Liu
On Wed, Feb 08, 2023 at 08:08:42AM +0000, Liu, Yi L wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
Also, block replace operations that are from/to auto_domains, i.e. only user-allocated hw_pagetables can be replaced or replaced with.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- drivers/iommu/iommufd/iommufd_private.h | 2 + 2 files changed, 76 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index b8c3e3baccb5..8a9834fc129a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -9,6 +9,8 @@ #include "io_pagetable.h" #include "iommufd_private.h"
+MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
static bool allow_unsafe_interrupts; module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC( @@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
new_hwpt
This function doesn't do anything to make this device attached to new_hwpt. It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if this detach requires to do some extra thing. E.g. remove reserved iova from the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt", and explain the usage of new_hwpt in the below.
Yea. You are right.
- @idev: device to detach
- @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
detach)
The new hw_pagetable to be attached.
OK.
- @detach_group: flag to call iommu_detach_group
- This is a cleanup helper shared by the replace and detach routines.
Comparing
- to a detach routine, a replace routine only needs a partial detach
procedure:
- it does not need the iommu_detach_group(); it will attach the device to
a new
- hw_pagetable after a partial detach from the currently attached
hw_pagetable,
- so certain steps can be skipped if two hw_pagetables have the same
IOAS.
- */
+static void __iommmufd_device_detach(struct iommufd_device *idev,
struct iommufd_hw_pagetable
*new_hwpt,
bool detach_group)
+{
struct iommufd_hw_pagetable *hwpt = idev->hwpt;
struct iommufd_ioas *new_ioas = NULL;
if (new_hwpt)
new_ioas = new_hwpt->ioas;
mutex_lock(&hwpt->devices_lock);
list_del(&idev->devices_item);
if (hwpt->ioas != new_ioas)
mutex_lock(&hwpt->ioas->mutex);
The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock. See the iommufd_device_auto_get_domain(). If possible, may switch the order sequence here.
Yea, I know it's a bit strange. Yet...
Our nesting series simplifies this part to: if (cur_ioas != new_ioas) { mutex_lock(&hwpt->ioas->mutex); iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); mutex_unlock(&hwpt->ioas->mutex); }
So, here is trying to avoid something like: if (cur_ioas != new_ioas) mutex_lock(&hwpt->ioas->mutex); // doing something if (cur_ioas != new_ioas) iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); // doing something if (cur_ioas != new_ioas) mutex_unlock(&hwpt->ioas->mutex);
Also, rename hwpt to be cur_hwpt, this may help reviewers to distinguish it from the hwpt in the caller of this function. It looks to be a deadlock at first look, but not after closer reading.
Sure.
@@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
if (idev->hwpt == hwpt)
goto out_done;
if (idev->hwpt && idev->hwpt->auto_domain) {
rc = -EBUSY;
This means if device was attached to an auto_created hwpt, then we cannot replace it with a user allocated hwpt? If yes, this means the replace is not available until user hwpt support, which is part of nesting.
After aligning with Jason, this limit here might be wrong, as we should be able to support replacing an IOAS. I'd need to take a closer look and fix it in v3.
if (idev->hwpt)
return -EBUSY;
So we don't allow ioas replacement for physical devices. Is it? Looks like emulated devices allows it.
This was to avoid an replace with an auto_domain. Similarly, it's likely wrong, as I replied above.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
Nit: s/_iommmufd/__iommufd
Regards, Yi Liu
On Wed, Feb 08, 2023 at 08:12:55AM +0000, Liu, Yi L wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
Nit: s/_iommmufd/__iommufd
Will fix it. Thanks!
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
then why not moving those changes into this series to make it simple?
Also, block replace operations that are from/to auto_domains, i.e. only user-allocated hw_pagetables can be replaced or replaced with.
where does this restriction come from? iommu_group_replace_domain() can switch between any two UNMANAGED domains. What is the extra problem in iommufd to support from/to auto domains?
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- drivers/iommu/iommufd/iommufd_private.h | 2 + 2 files changed, 76 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index b8c3e3baccb5..8a9834fc129a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -9,6 +9,8 @@ #include "io_pagetable.h" #include "iommufd_private.h"
+MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
static bool allow_unsafe_interrupts; module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC( @@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
new_hwpt
'from ... to ...' means a replace semantics. then this should be called iommufd_device_replace_hwpt().
- @idev: device to detach
- @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
detach)
- @detach_group: flag to call iommu_detach_group
- This is a cleanup helper shared by the replace and detach routines.
Comparing
- to a detach routine, a replace routine only needs a partial detach
procedure:
- it does not need the iommu_detach_group(); it will attach the device to a
new
- hw_pagetable after a partial detach from the currently attached
hw_pagetable,
- so certain steps can be skipped if two hw_pagetables have the same IOAS.
- */
+static void __iommmufd_device_detach(struct iommufd_device *idev,
struct iommufd_hw_pagetable
*new_hwpt,
bool detach_group)
+{
- struct iommufd_hw_pagetable *hwpt = idev->hwpt;
- struct iommufd_ioas *new_ioas = NULL;
- if (new_hwpt)
new_ioas = new_hwpt->ioas;
- mutex_lock(&hwpt->devices_lock);
- list_del(&idev->devices_item);
- if (hwpt->ioas != new_ioas)
mutex_lock(&hwpt->ioas->mutex);
I think new_ioas->mutext was meant here.
- if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
then why not moving those changes into this series to make it simple?
The simplification depends on the new ->domain_alloc_user op and its implementation in SMMU driver, which would be introduced by the nesting series of VT-d and SMMU respectively.
At this point, it's hard to decide the best sequence of our three series. Putting this replace series first simply because it seems to be closer to get merged than the other two bigger series.
Also, block replace operations that are from/to auto_domains, i.e. only user-allocated hw_pagetables can be replaced or replaced with.
where does this restriction come from? iommu_group_replace_domain() can switch between any two UNMANAGED domains. What is the extra problem in iommufd to support from/to auto domains?
It was my misunderstanding. We should have supported that. Will fix in v3 and add the corresponding support.
+/**
- __iommmufd_device_detach - Detach a device from idev->hwpt to
new_hwpt
'from ... to ...' means a replace semantics. then this should be called iommufd_device_replace_hwpt().
Had a hard time to think about the naming, it's indeed a detach- only routine, but it takes a parameter named new_hwpt...
Perhaps I should just follow Yi's suggestion by rephrasing the narrative of this function.
+static void __iommmufd_device_detach(struct iommufd_device *idev,
struct iommufd_hw_pagetable
*new_hwpt,
bool detach_group)
+{
struct iommufd_hw_pagetable *hwpt = idev->hwpt;
struct iommufd_ioas *new_ioas = NULL;
if (new_hwpt)
new_ioas = new_hwpt->ioas;
mutex_lock(&hwpt->devices_lock);
list_del(&idev->devices_item);
if (hwpt->ioas != new_ioas)
mutex_lock(&hwpt->ioas->mutex);
I think new_ioas->mutext was meant here.
new_hwpt is an input from an replace routine, where it holds the new_ioas->mutex already. Yi's right that the code here is a bit confusing. I will try to change it a bit for readability.
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
Thanks Nic
On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
then why not moving those changes into this series to make it simple?
The simplification depends on the new ->domain_alloc_user op and its implementation in SMMU driver, which would be introduced by the nesting series of VT-d and SMMU respectively.
Since we are not fixing all the drivers at this point, this argument doesn't hold water.
It is as I said in the other email, there should be no changes to the normal attach/replace path when adding manual HWPT creation - once those are removed there should be minimal connection between these two series.
Jason
On Thu, Feb 09, 2023 at 08:01:00PM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
iommu_group_replace_domain() is introduced to support use cases where an iommu_group can be attached to a new domain without getting detached from the old one. This replacement feature will be useful, for cases such as:
- vPASID mode, when a guest wants to replace a single pasid (PASID=0) table with a larger table (PASID=N)
- Nesting mode, when switching the attaching device from an S2 domain to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). And add a __iommmufd_device_detach helper to allow the replace routine to do a partial detach on the current hwpt that's being replaced. Though the updated locking logic is overcomplicated, it will be eased, once those iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's allocation/destroy() functions in the coming nesting series, as that'll depend on a new ->domain_alloc_user op in the iommu core.
then why not moving those changes into this series to make it simple?
The simplification depends on the new ->domain_alloc_user op and its implementation in SMMU driver, which would be introduced by the nesting series of VT-d and SMMU respectively.
Since we are not fixing all the drivers at this point, this argument doesn't hold water.
It is as I said in the other email, there should be no changes to the normal attach/replace path when adding manual HWPT creation - once those are removed there should be minimal connection between these two series.
Yes. I replied here earlier than the discussion with you in that thread. I think I should just drop this line of commit messages.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, February 10, 2023 5:13 AM
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; }
if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; }
Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt;
Add device to new hwpt;
I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
Oh... thinking it carefully, I see the flow does look a bit off. Perhaps it's better to have a similar flow for replace.
However, I think something would be still different due to its tricky nature, especially for a multi-device iommu_group.
An iommu_group_detach happens only when a device is the last one in its group to go through the routine via a DETACH ioctl, while an iommu_group_replace_domain() happens only when the device is the first one in its group to go through the routine via another ATTACH ioctl. However, when the first device does a replace, the cleanup routine of the old hwpt is a NOP, since there are still other devices (same group) in the old hwpt. And two implications here: 1) Any other device in the same group has to forcibly switch to the new domain, when the first device does a replace. 2) The actual hwpt cleanup can only happen at the last device's replace call.
This also means that kernel has to rely on the integrity of the user space that it must replace all active devices in the group:
For a three-device iommu_group, [scenario 1] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. ATTACH (REPLACE) dev1 to hwpt2; (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do (no hwpt1 cleanup; no dev2 replace; no hwpt2 init) f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do (do hwpt1 cleanup; no dev3 replace; no hwpt2 init)
[scenario 2] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. DETACH dev3 from hwpt1; (detach dev3; no hwpt1 cleanup) f. ATTACH (REPLACE) dev1 to hwpt2; (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) g. ATTACH (REPLACE) dev2 to hwpt2; // user space must do (do hwpt1 cleanup; no dev2 replace; no hwpt2 init) h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev3 replace; no hwpt2 init)
[scenario 3] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. DETACH dev3 from hwpt1; (detach dev3; no hwpt1 cleanup) e. DETACH dev2 from hwpt1; (detach dev2; no hwpt1 cleanup) f. ATTACH (REPLACE) dev1 to hwpt2; (do hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) g. (optional) ATTACH dev2 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev2 replace; no hwpt2 init) h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev3 replace; no hwpt2 init)
Following the narratives above,
[current detach+attach flow] // DETACH dev1 from hwpt1; Log dev1 out of the hwpt1's device list; NOP; // hwpt1 has its group; iopt_remove_reserved_iova(hwpt1->iopt, dev1); idev1->hwpt = NULL; refcount_dec(); // DETACH dev2 from hwpt1; Log dev2 out of the hwpt1's device list; if (hwpt1 does not have its group) { // last device to detach if (hwpt1's device list is empty) iopt_table_remove_domain/list_del(hwpt1); iommu_detach_group(); } iopt_remove_reserved_iova(hwpt1->iopt, dev2); idev2->hwpt = NULL; refcount_dec(); ... // ATTACH dev1 to hwpt2; iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1); if (hwpt2 does not have its group) { // first device to attach iommu_attach_group(); if (hwpt2's device list is empty) iopt_table_add_domain/list_add(hwpt2); } idev1->hwpt = hwpt2; refcount_inc(); Log dev1 in the hwpt2's device list; // ATTACH dev2 to hwpt2; iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2); NOP; // hwpt2 has its group; idev2->hwpt = hwpt2; refcount_inc(); Log dev2 in to the hwpt2's device list;
[correct (?) replace flow - scenario 1 above]
// 1.d Switch (REPLACE) dev1 from hwpt1 to hwpt2; partial detach (dev1) { Log dev1 out of the hwpt1's device list; NOP // hwpt1 has its group, and hwpt1's device list isn't empty iopt_remove_reserved_iova(hwpt1->iopt, dev1); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1); if (hwpt2 does not have its group) { // first device to replace iommu_group_replace_domain(); if (hwpt2's device list is empty) iopt_table_add_domain/list_add(hwpt2); } idev1->hwpt = hwpt2; refcount_int(); Log dev1 in the hwpt2's device list;
// 1.e Switch (REPLACE) dev2 from hwpt1 to hwpt2; partial detach (dev2) { Log dev2 out of the hwpt1's device list; NOP // hwpt1 has its group, and hwpt1's device list isn't empty iopt_remove_reserved_iova(hwpt1->iopt, dev2); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2); NOP; // hwpt2 has its group, and hwpt2's device list isn't empty idev2->hwpt = hwpt2; refcount_int(); Log dev2 in the hwpt2's device list;
// 1.f Switch (REPLACE) dev3 from hwpt1 to hwpt2; partial detach (dev3) { Log dev3 out of the hwpt1's device list; if (hwpt1 does not have its group) { // last device to detach if (hwpt1's device list is empty) iopt_table_remove_domain/list_del(hwpt1); } iopt_remove_reserved_iova(hwpt1->iopt, dev3); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev3); NOP; // hwpt2 has its group, and hwpt2's device list isn't empty idev3->hwpt = hwpt2; refcount_int(); Log dev3 in the hwpt2's device list;
And, this would also complicate the error-out routines too...
I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
With the correct (?) flow, I think it'd be safer for one-device group. But it's gets tricker for the multi-device case above: the dev2 and dev3 are already in the new domain, but all their iommufd objects (idev->hwpt and iopt) are lagging. Do we need a lock across the three IOCTLs?
One way to avoid it is to force-update idev2 and idev3 too when idev1 does a replace -- by iterating all same-group devices out of the old hwpt. This, however, feels a violation against being device-centric...
i.e. scenario 1: a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. ATTACH (REPLACE) dev1 to hwpt2; (do hwpt1 cleanup; replace dev2&3 and update idev2&3 too; do hwpt2 init) e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do, or be aware of 1.d (kernel does dummy) f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do, or be aware of 1.d (kernel does dummy)
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, February 11, 2023 8:10 AM
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
Oh... thinking it carefully, I see the flow does look a bit off. Perhaps it's better to have a similar flow for replace.
However, I think something would be still different due to its tricky nature, especially for a multi-device iommu_group.
An iommu_group_detach happens only when a device is the last one in its group to go through the routine via a DETACH ioctl, while an iommu_group_replace_domain() happens only when the device is the first one in its group to go through the routine via another ATTACH ioctl. However, when the first device does a replace, the cleanup routine of the old hwpt is a NOP, since there are still other devices (same group) in the old hwpt. And two implications here:
- Any other device in the same group has to forcibly switch to the new domain, when the first device does a replace.
- The actual hwpt cleanup can only happen at the last device's replace call.
This also means that kernel has to rely on the integrity of the user space that it must replace all active devices in the group:
Jason suggested to move hwpt cleanup out of the detach path in his reply to patch7. Presumably with that fix the major tricky point about hwpt in following scenarios would disappear. Let's see how it will work out then. 😊
On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, February 11, 2023 8:10 AM
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
if (list_empty(&hwpt->devices)) {
iopt_table_remove_domain(&hwpt->ioas->iopt,
hwpt->domain);
list_del(&hwpt->hwpt_item);
}
I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
Oh... thinking it carefully, I see the flow does look a bit off. Perhaps it's better to have a similar flow for replace.
However, I think something would be still different due to its tricky nature, especially for a multi-device iommu_group.
An iommu_group_detach happens only when a device is the last one in its group to go through the routine via a DETACH ioctl, while an iommu_group_replace_domain() happens only when the device is the first one in its group to go through the routine via another ATTACH ioctl. However, when the first device does a replace, the cleanup routine of the old hwpt is a NOP, since there are still other devices (same group) in the old hwpt. And two implications here:
- Any other device in the same group has to forcibly switch to the new domain, when the first device does a replace.
- The actual hwpt cleanup can only happen at the last device's replace call.
This also means that kernel has to rely on the integrity of the user space that it must replace all active devices in the group:
Jason suggested to move hwpt cleanup out of the detach path in his reply to patch7. Presumably with that fix the major tricky point about hwpt in following scenarios would disappear. Let's see how it will work out then. 😊
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
Should we only call iommu_group_replace_domain() when the last device in the group gets a replace IOCTL?
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Monday, February 13, 2023 3:49 PM
On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, February 11, 2023 8:10 AM
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
> + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > + if (list_empty(&hwpt->devices)) { > + iopt_table_remove_domain(&hwpt->ioas->iopt, > + hwpt->domain); > + list_del(&hwpt->hwpt_item); > + }
I'm not sure how this can be fully shared between detach and
replace.
Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done
afterwards.
This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain().
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
Oh... thinking it carefully, I see the flow does look a bit off. Perhaps it's better to have a similar flow for replace.
However, I think something would be still different due to its tricky nature, especially for a multi-device iommu_group.
An iommu_group_detach happens only when a device is the last one in its group to go through the routine via a DETACH ioctl, while an iommu_group_replace_domain() happens only when the device is the first one in its group to go through the routine via another ATTACH ioctl. However, when the first device does a replace, the cleanup routine of the old hwpt is a NOP, since there are still other devices (same group) in the old hwpt. And two implications here:
- Any other device in the same group has to forcibly switch to the new domain, when the first device does a replace.
- The actual hwpt cleanup can only happen at the last device's replace call.
This also means that kernel has to rely on the integrity of the user space that it must replace all active devices in the group:
Jason suggested to move hwpt cleanup out of the detach path in his reply to patch7. Presumably with that fix the major tricky point about hwpt in following scenarios would disappear. Let's see how it will work out then. 😊
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
Should we only call iommu_group_replace_domain() when the last device in the group gets a replace IOCTL?
With Jason's proposal now the attach/detach paths only handle the connection between device and hwpt. I don't see a problem here with devices in a group attached to different hwpt's in this short transition window.
There is no impact to ioas/iopt operations (though the user is not expected to change it before the group transition is fully completed). Just that the old hwpt cannot be destroyed until the last dev detaches from it.
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Jason
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 6:54 PM
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
I didn't see what would be broken w/o such blocking measure.
Can you elaborate?
On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 6:54 PM
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
I didn't see what would be broken w/o such blocking measure.
Can you elaborate?
iommu_group { idev1, idev2 }
(1) Attach all devices to domain1 { group->domain = domain1; idev1->hwpt = hwpt1; // domain1 idev2->hwpt = hwpt1; // domain1 }
(2) Attach (replace) idev1 only to domain2 { group->domain = domain2 idev1->hwpt = hwpt2; // domain2 idev2->hwpt == domain1 // != iommu_get_domain_for_dev() }
Then if user space isn't aware of these and continues to do IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added onto the domain2 correctly, yet domain2's iopt itree won't reflect that, until idev2->hwpt is updated too, right?
And the tricky thing is that, though we advocate a device- centric uAPI, we'd still have to ask user space to align the devices in the same iommu_group, which is also not present in QEMU's RFC v3 series.
The traditional detach+attach flow doesn't seem to have this issue, since there's no re-entry so the work flow is always that detaching all devices first before attaching to another domain.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 15, 2023 9:59 AM
On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 6:54 PM
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
I didn't see what would be broken w/o such blocking measure.
Can you elaborate?
iommu_group { idev1, idev2 }
(1) Attach all devices to domain1 { group->domain = domain1; idev1->hwpt = hwpt1; // domain1 idev2->hwpt = hwpt1; // domain1 }
(2) Attach (replace) idev1 only to domain2 { group->domain = domain2 idev1->hwpt = hwpt2; // domain2 idev2->hwpt == domain1 // != iommu_get_domain_for_dev() }
Then if user space isn't aware of these and continues to do IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added onto the domain2 correctly, yet domain2's iopt itree won't reflect that, until idev2->hwpt is updated too, right?
IOMMU_IOAS_MAP is for ioas instead of for device.
even w/o any device attached we still allow ioas map.
also note in case of domain1 still actively attached to idev3 in another group, banning IOAS_MAP due to idev2 in transition is also incorrect for idev3.
And the tricky thing is that, though we advocate a device- centric uAPI, we'd still have to ask user space to align the devices in the same iommu_group, which is also not present in QEMU's RFC v3 series.
IMHO this requirement has been stated clearly from the start when designing this device centric interface. QEMU should be improved to take care of it.
On Wed, Feb 15, 2023 at 02:15:59AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 15, 2023 9:59 AM
On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 6:54 PM
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right?
We have a complicated model for multi-device groups...
The first device in the group to change domains must move all the devices in the group
But userspace is still expected to run through and change all the other devices
So replace should be a NOP if the group is already linked to the right domain.
This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem.
Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
I didn't see what would be broken w/o such blocking measure.
Can you elaborate?
iommu_group { idev1, idev2 }
(1) Attach all devices to domain1 { group->domain = domain1; idev1->hwpt = hwpt1; // domain1 idev2->hwpt = hwpt1; // domain1 }
(2) Attach (replace) idev1 only to domain2 { group->domain = domain2 idev1->hwpt = hwpt2; // domain2 idev2->hwpt == domain1 // != iommu_get_domain_for_dev() }
Then if user space isn't aware of these and continues to do IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added onto the domain2 correctly, yet domain2's iopt itree won't reflect that, until idev2->hwpt is updated too, right?
IOMMU_IOAS_MAP is for ioas instead of for device.
even w/o any device attached we still allow ioas map.
also note in case of domain1 still actively attached to idev3 in another group, banning IOAS_MAP due to idev2 in transition is also incorrect for idev3.
OK. That's likely true. It doesn't seem to be correct to block an IOMMU_IOAS_MAP.
But things will be out of control, if user space continues mapping something onto domain1's iopt for idev2, even after it is attached covertly to domain2's iopt by the replace routine. I wonder how kernel should handle this and keep the consistency between IOMMUFD objects and iommu_group.
And the tricky thing is that, though we advocate a device- centric uAPI, we'd still have to ask user space to align the devices in the same iommu_group, which is also not present in QEMU's RFC v3 series.
IMHO this requirement has been stated clearly from the start when designing this device centric interface. QEMU should be improved to take care of it.
OK. It has to be so...
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 15, 2023 3:15 PM
But things will be out of control, if user space continues mapping something onto domain1's iopt for idev2, even after it is attached covertly to domain2's iopt by the replace routine. I wonder how kernel should handle this and keep the consistency between IOMMUFD objects and iommu_group.
this is where I don't understand.
domain mapping just reflects what an address space has.
Take Qemu for example. w/o vIOMMU domain mappings is added/ removed according to the change in the GPA address space.
w/ vIOMMU then it is synced with guest page table.
why would userspace construct mappings for a specific device?
when device is moved from one domain to another domain, it just bears with whatever the new domain allows it to access.
On Tue, Feb 14, 2023 at 11:15:09PM -0800, Nicolin Chen wrote:
But things will be out of control, if user space continues mapping something onto domain1's iopt for idev2, even after it is attached covertly to domain2's iopt by the replace routine. I wonder how kernel should handle this and keep the consistency between IOMMUFD objects and iommu_group.
I've been looking at this, the reason the locking is such a PITA is because we are still trying to use the device list short cut.
We need to have a iommu group object instead then everything will work smoothly.
Jason
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
With Jason's new series, the detach() routine is lighter now.
I wonder if it'd be safer now to do the detach() call after iommu_group_replace_domain()?
Thanks Nic
@@ -196,10 +198,41 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; }
+/** + * __iommufd_device_detach - Detach a device from idev->hwpt + * @idev: device to detach + * @detach_group: flag to call iommu_detach_group + * + * This is a cleanup helper shared by the replace and detach routines. Comparing + * to a detach routine, a replace call does not need the iommu_detach_group(). + */ +static void __iommufd_device_detach(struct iommufd_device *idev, + bool detach_group) +{ + struct iommufd_hw_pagetable *hwpt = idev->hwpt; + + mutex_lock(&hwpt->devices_lock); + list_del(&idev->devices_item); + if (detach_group && !iommufd_hw_pagetable_has_group(hwpt, idev->group)) + iommu_detach_group(hwpt->domain, idev->group); + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); + mutex_unlock(&hwpt->devices_lock); + + if (hwpt->auto_domain) + iommufd_object_destroy_user(idev->ictx, &hwpt->obj); + else + refcount_dec(&hwpt->obj.users); + + idev->hwpt = NULL; + + refcount_dec(&idev->obj.users); +} + /* On success this consumes a hwpt reference from the caller */ static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc;
@@ -237,7 +270,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, * the group once for the first device that is in the group. */ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - rc = iommu_attach_group(hwpt->domain, idev->group); + rc = iommu_group_replace_domain(idev->group, hwpt->domain); if (rc) goto out_iova;
@@ -249,6 +282,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } }
+ /* Detach from the cur_hwpt without iommu_detach_group() */ + if (cur_hwpt) + __iommufd_device_detach(idev, false); + idev->hwpt = hwpt; /* The HWPT reference from the caller is moved to this list */ list_add(&idev->devices_item, &hwpt->devices);
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 7:00 PM
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
With Jason's new series, the detach() routine is lighter now.
I wonder if it'd be safer now to do the detach() call after iommu_group_replace_domain()?
yes, looks so.
On Wed, Feb 15, 2023 at 01:38:32AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Tuesday, February 14, 2023 7:00 PM
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
My confusion is that we have different flows between detach/attach and replace.
today with separate detach+attach we have following flow:
Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt;
but replace flow is different on the detach part:
if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt;
I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
With Jason's new series, the detach() routine is lighter now.
I wonder if it'd be safer now to do the detach() call after iommu_group_replace_domain()?
yes, looks so.
Will rebase a v3 once Jason updates his series. Thanks!
Remove the vdev->iommufd_attached check, since the kernel can internally handle a replacement of the IO page table now.
Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI header.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/vfio/iommufd.c | 3 --- include/uapi/linux/vfio.h | 6 ++++++ 2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index dc9feab73db7..8b719d9424b8 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -97,9 +97,6 @@ int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id) if (!vdev->iommufd_device) return -EINVAL;
- if (vdev->iommufd_attached) - return -EBUSY; - rc = iommufd_device_attach(vdev->iommufd_device, pt_id); if (rc) return rc; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index c86cfe442884..69f3ceb18d7d 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -236,6 +236,12 @@ struct vfio_device_bind_iommufd { * * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close. * + * If a vfio device is currently attached to a valid hw_pagetable, without doing + * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl + * passing in another hw_pagetable (hwpt) id is allowed. This action, also known + * as a hw_pagetable replacement, will replace the device's currently attached + * hw_pagetable with a new hw_pagetable corresponding to the given pt_id. + * * @argsz: user filled size of this data. * @flags: must be 0. * @pt_id: Input the target id which can represent an ioas or a hwpt
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
Remove the vdev->iommufd_attached check, since the kernel can internally handle a replacement of the IO page table now.
Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI header.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
Suggested-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/vfio/vfio_main.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 8559c3dfb335..c7f3251ad6e5 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
if (iova > ULONG_MAX) return -EINVAL; + if (!device->ops->dma_unmap) + return -EINVAL; /* * VFIO ignores the sub page offset, npages is from the start of * a PAGE_SIZE chunk of IOVA. The caller is expected to recover @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage) if (device->iommufd_access) { if (WARN_ON(iova > ULONG_MAX)) return; + if (!device->ops->dma_unmap) + return; iommufd_access_unpin_pages(device->iommufd_access, ALIGN_DOWN(iova, PAGE_SIZE), npage * PAGE_SIZE);
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
Suggested-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/vfio/vfio_main.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 8559c3dfb335..c7f3251ad6e5 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
if (iova > ULONG_MAX) return -EINVAL;
if (!device->ops->dma_unmap)
/*return -EINVAL;
- VFIO ignores the sub page offset, npages is from the start
of * a PAGE_SIZE chunk of IOVA. The caller is expected to recover @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage) if (device->iommufd_access) { if (WARN_ON(iova > ULONG_MAX)) return;
if (!device->ops->dma_unmap)
return;
IMHO this restriction applies to both iommufd and legacy container.
On Thu, Feb 09, 2023 at 04:10:04AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
Suggested-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/vfio/vfio_main.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 8559c3dfb335..c7f3251ad6e5 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
if (iova > ULONG_MAX) return -EINVAL;
if (!device->ops->dma_unmap)
/*return -EINVAL;
- VFIO ignores the sub page offset, npages is from the start
of * a PAGE_SIZE chunk of IOVA. The caller is expected to recover @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage) if (device->iommufd_access) { if (WARN_ON(iova > ULONG_MAX)) return;
if (!device->ops->dma_unmap)
return;
IMHO this restriction applies to both iommufd and legacy container.
Yeah that makes sense
Jason
On Thu, Feb 09, 2023 at 09:26:15AM -0400, Jason Gunthorpe wrote:
On Thu, Feb 09, 2023 at 04:10:04AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
Suggested-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/vfio/vfio_main.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 8559c3dfb335..c7f3251ad6e5 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
if (iova > ULONG_MAX) return -EINVAL;
if (!device->ops->dma_unmap)
/*return -EINVAL;
- VFIO ignores the sub page offset, npages is from the start
of * a PAGE_SIZE chunk of IOVA. The caller is expected to recover @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage) if (device->iommufd_access) { if (WARN_ON(iova > ULONG_MAX)) return;
if (!device->ops->dma_unmap)
return;
IMHO this restriction applies to both iommufd and legacy container.
Yeah that makes sense
Will move them to the beginning of vfio_pin_pages/vfio_unpin_pages.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
QEMU with this feature should have the vIOMMU maintain a cache of the guest io page table addresses and assign a unique IOAS to each unique guest page table.
I still didn't see a clear reply why above is required. Please elaborate.
On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
QEMU with this feature should have the vIOMMU maintain a cache of the guest io page table addresses and assign a unique IOAS to each unique guest page table.
I still didn't see a clear reply why above is required. Please elaborate.
Hmm..this piece was directly copied from Jason's inputs before I sent v1. And I thought he explained in the previous version. I can drop it in v3 if that makes you happy.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, February 10, 2023 12:13 AM
On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
External email: Use caution opening links or attachments
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, February 8, 2023 5:18 AM
QEMU with this feature should have the vIOMMU maintain a cache of the guest io page table addresses and assign a unique IOAS to each unique guest page table.
I still didn't see a clear reply why above is required. Please elaborate.
Hmm..this piece was directly copied from Jason's inputs before I sent v1. And I thought he explained in the previous version. I can drop it in v3 if that makes you happy.
Jason's reply was:
I think the guidance is about the change to VFIO uAPI where it is now possible to change the domain attached, previously that was not possible
I cannot connect it to the above statement. Anyway if you think that guidance to userspace is necessary please connect it to the replace semantics to explain the motivation.
linux-kselftest-mirror@lists.linaro.org