On Wed, Oct 29, 2025 at 11:25:34AM +0200, Leon Romanovsky wrote:
> On Tue, Oct 28, 2025 at 09:27:26PM -0300, Jason Gunthorpe wrote:
> > On Sun, Oct 26, 2025 at 09:44:12PM -0700, Vivek Kasireddy wrote:
> > > In a typical dma-buf use case, a dmabuf exporter makes its buffer
> > > buffer available to an importer by mapping it using DMA APIs
> > > such as dma_map_sgtable() or dma_map_resource(). However, this
> > > is not desirable in some cases where the exporter and importer
> > > are directly connected via a physical or virtual link (or
> > > interconnect) and the importer can access the buffer without
> > > having it DMA mapped.
> >
> > I think my explanation was not so clear, I spent a few hours and typed
> > in what I was thinking about here:
> >
> > https://github.com/jgunthorpe/linux/commits/dmabuf_map_type
> >
> > I didn't type in the last patch for iommufd side, hopefully it is
> > clear enough. Adding iov should follow the pattern of the "physical
> > address list" patch.
> >
> > I think the use of EXPORT_SYMBOL_FOR_MODULES() to lock down the
> > physical addres list mapping type to iommufd is clever and I'm hoping
> > addresses Chrsitian's concerns about abuse.
> >
> > Single GPU drivers can easilly declare their own mapping type for
> > their own private interconnect without needing to change the core
> > code.
> >
> > This seems to be fairly straightforward and reasonably type safe..
>
> It makes me wonder what am I supposed to do with my series now [1]?
> How do you see submission plan now?
>
> [1] https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/
IMHO that series needs the small tweaks and should go this merge
window, ideally along with the iommufd half.
I think this thread is a topic for the next cycle, I expect it will
take some time to converge on the dmabuf core changes, and adapting
your series is quite simple.
Jason
On Sun, Oct 26, 2025 at 09:44:12PM -0700, Vivek Kasireddy wrote:
> In a typical dma-buf use case, a dmabuf exporter makes its buffer
> buffer available to an importer by mapping it using DMA APIs
> such as dma_map_sgtable() or dma_map_resource(). However, this
> is not desirable in some cases where the exporter and importer
> are directly connected via a physical or virtual link (or
> interconnect) and the importer can access the buffer without
> having it DMA mapped.
I think my explanation was not so clear, I spent a few hours and typed
in what I was thinking about here:
https://github.com/jgunthorpe/linux/commits/dmabuf_map_type
I didn't type in the last patch for iommufd side, hopefully it is
clear enough. Adding iov should follow the pattern of the "physical
address list" patch.
I think the use of EXPORT_SYMBOL_FOR_MODULES() to lock down the
physical addres list mapping type to iommufd is clever and I'm hoping
addresses Chrsitian's concerns about abuse.
Single GPU drivers can easilly declare their own mapping type for
their own private interconnect without needing to change the core
code.
This seems to be fairly straightforward and reasonably type safe..
What do you think?
Jason
On Tue, Oct 28, 2025 at 05:39:39AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
>
> > Subject: Re: [RFC v2 1/8] dma-buf: Add support for map/unmap APIs for
> > interconnects
> >
> > On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote:
> > > For the map operation, the dma-buf core will create an xarray but
> > > the exporter needs to populate it with the interconnect specific
> > > addresses. And, similarly for unmap, the exporter is expected to
> > > cleanup the individual entries of the xarray.
> >
> > I don't think we should limit this to xarrays, nor do I think it is a
> > great datastructure for what is usually needed here..
> One of the goals (as suggested by Christian) is to have a container that
> can be used with an iterator.
I thought Christian was suggesting to avoid the container and have
some kind of iterator?
> So, instead of creating a new data structure,
> I figured using an xarray would make sense here. And, since the entries
> of an xarray can be of any type, I think another advantage is that the
> dma-buf core only needs to be aware of the xarray but the exporter can
> use an interconnect specific type to populate the entries that the importer
> would be aware of.
It is excessively memory wasteful.
> > I just posted the patches showing what iommufd needs, and it wants
> > something like
> >
> > struct mapping {
> > struct p2p_provider *provider;
> > size_t nelms;
> > struct phys_vec *phys;
> > };
> >
> > Which is not something that make sense as an xarray.
> If we do not want to use an xarray, I guess we can try to generalize the
> struct that holds the addresses and any additional info (such as provider).
> Would any of the following look OK to you:
I think just don't try to have a general struct, it is not required
once we have interconnects. Each interconnect can define what makes
sense for it.
> struct dma_buf_ranges {
> struct range *ranges;
> unsigned int nranges;
> void *ranges_data;
> };
Like this is just pointless, it destroys type safety for no benifit.
> > struct dma_buf_iov_interconnect_ops {
> > struct dma_buf_interconnect_ops ic_ops;
> > struct xx *(*map)(struct dma_buf_attachment *attach,
> Do we want each specific interconnect to have its own return type for map?
I think yes, then you have type safety and so on. The types should all
be different. We need to get away from using dma_addr_t or phys_addr_t
for something that is not in those address spaces.
Jason
On Mon, Oct 27, 2025 at 04:13:05PM -0700, David Matlack wrote:
> On Mon, Oct 13, 2025 at 8:44 AM Leon Romanovsky <leon(a)kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > Add support for exporting PCI device MMIO regions through dma-buf,
> > enabling safe sharing of non-struct page memory with controlled
> > lifetime management. This allows RDMA and other subsystems to import
> > dma-buf FDs and build them into memory regions for PCI P2P operations.
>
> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > + * regions selected.
> > + *
> > + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
> > + * etc. offset/length specify a slice of the region to create the dmabuf from.
> > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
> > + *
> > + * Return: The fd number on success, -1 and errno is set on failure.
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > +
> > +struct vfio_region_dma_range {
> > + __u64 offset;
> > + __u64 length;
> > +};
> > +
> > +struct vfio_device_feature_dma_buf {
> > + __u32 region_index;
> > + __u32 open_flags;
> > + __u32 flags;
> > + __u32 nr_ranges;
> > + struct vfio_region_dma_range dma_ranges[];
> > +};
>
> This uAPI would be a good candidate for a VFIO selftest. You can test
> that it returns an error when it's supposed to, and a valid fd when
> it's supposed to. And once the iommufd importer side is ready, we can
> extend the test and verify that the fd can be mapped into iommufd.
No problem, I'll add such test, but let's focus on making sure that this
series is accepted first.
Thanks
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
The branch has various modifications to Leon's series I've suggested.
Jason Gunthorpe (8):
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 74 +++-
drivers/iommu/iommufd/io_pagetable.h | 53 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 13 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 142 ++++++
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
10 files changed, 741 insertions(+), 63 deletions(-)
base-commit: fc882154e421f82677925d33577226e776bb07a4
--
2.43.0
On Sun, Oct 26, 2025 at 09:44:14PM -0700, Vivek Kasireddy wrote:
> +/**
> + * dma_buf_match_interconnects - determine if there is a specific interconnect
> + * that is supported by both exporter and importer.
> + * @attach: [in] attachment to populate ic_match field
> + * @exp: [in] array of interconnects supported by exporter
> + * @exp_ics: [in] number of interconnects supported by exporter
> + * @imp: [in] array of interconnects supported by importer
> + * @imp_ics: [in] number of interconnects supported by importer
> + *
> + * This helper function iterates through the list interconnects supported by
> + * both exporter and importer to find a match. A successful match means that
> + * a common interconnect type is supported by both parties and the exporter's
> + * match_interconnect() callback also confirms that the importer is compatible
> + * with the exporter for that interconnect type.
Document which of the exporter/importer is supposed to call this
> + *
> + * If a match is found, the attach->ic_match field is populated with a copy
> + * of the exporter's match data.
> + * Return: true if a match is found, false otherwise.
> + */
> +bool dma_buf_match_interconnects(struct dma_buf_attachment *attach,
> + const struct dma_buf_interconnect_match *exp,
> + unsigned int exp_ics,
> + const struct dma_buf_interconnect_match *imp,
> + unsigned int imp_ics)
> +{
> + const struct dma_buf_interconnect_ops *ic_ops;
> + struct dma_buf_interconnect_match *ic_match;
> + struct dma_buf *dmabuf = attach->dmabuf;
> + unsigned int i, j;
> +
> + if (!exp || !imp)
> + return false;
> +
> + if (!attach->allow_ic)
> + return false;
Seems redundant with this check for ic_ops == NULL:
> + ic_ops = dmabuf->ops->interconnect_ops;
> + if (!ic_ops || !ic_ops->match_interconnect)
> + return false;
This seems like too much of a maze to me..
I think you should structure it like this. First declare an interconnect:
struct dma_buf_interconnect iov_interconnect {
.name = "IOV interconnect",
.match =..
}
Then the exporters "subclass"
struct dma_buf_interconnect_ops vfio_iov_interconnect {
.interconnect = &iov_interconnect,
.map = vfio_map,
}
I guess no container_of technique..
Then in VFIO's attach trigger the new code:
const struct dma_buf_interconnect_match vfio_exp_ics[] = {
{&vfio_iov_interconnect},
};
dma_buf_match_interconnects(attach, &vfio_exp_ics))
Which will callback to the importer:
static const struct dma_buf_attach_ops xe_dma_buf_attach_ops = {
.get_importer_interconnects
}
dma_buf_match_interconnects() would call
aops->get_importer_interconnects
and matchs first on .interconnect, then call the interconnect->match
function with exp/inpt match structs if not NULL.
> +struct dma_buf_interconnect_match {
> + const struct dma_buf_interconnect *type;
> + struct device *dev;
> + unsigned int bar;
> +};
This should be more general, dev and bar are unique to the iov
importer. Maybe just simple:
struct dma_buf_interconnect_match {
struct dma_buf_interconnect *ic; // no need for type
const struct dma_buf_interconnct_ops *exporter_ic_ops;
u64 match_data[2]; // dev and bar are IOV specific, generalize
};
Then some helper
const struct dma_buf_interconnect_match supports_ics[] = {
IOV_INTERCONNECT(&vfio_iov_interconnect, dev, bar),
}
And it would be nice if interconnect aware drivers could more easially
interwork with non-interconnect importers.
So I'd add a exporter type of 'p2p dma mapped scatterlist' that just
matches the legacy importer.
Jason
On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote:
> For the map operation, the dma-buf core will create an xarray but
> the exporter needs to populate it with the interconnect specific
> addresses. And, similarly for unmap, the exporter is expected to
> cleanup the individual entries of the xarray.
I don't think we should limit this to xarrays, nor do I think it is a
great datastructure for what is usually needed here..
I just posted the patches showing what iommufd needs, and it wants
something like
struct mapping {
struct p2p_provider *provider;
size_t nelms;
struct phys_vec *phys;
};
Which is not something that make sense as an xarray.
I think the interconnect should have its own functions for map/unmap,
ie instead of trying to have them as a commmon
dma_buf_interconnect_ops do something like
struct dma_buf_interconnect_ops {
const char *name;
bool (*supports_interconnects)(struct dma_buf_attachment *attach,
const struct dma_buf_interconnect_match *,
unsigned int num_ics);
};
struct dma_buf_iov_interconnect_ops {
struct dma_buf_interconnect_ops ic_ops;
struct xx *(*map)(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms);
// No unmap for iov
};
static inline struct xx *dma_buf_iov_map(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms)
{
return container_of(attach->ic_ops, struct dma_buf_iov_interconnect_ops, ic_ops)->map(
attach, bar_number, nelms));
}
> +/**
> + * dma_buf_attachment_is_dynamic - check if the importer can handle move_notify.
> + * @attach: the attachment to check
> + *
> + * Returns true if a DMA-buf importer has indicated that it can handle dmabuf
> + * location changes through the move_notify callback.
> + */
> +static inline bool
> +dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
> +{
> + return !!attach->importer_ops;
> +}
Why is this in this patch?
I also think this patch should be second in the series, it makes more
sense to figure out how to attach with an interconnect then show how
to map/unmap with that interconnect
Like I'm not sure why this introduces allow_ic?
Jason
On Sun, Oct 26, 2025 at 03:55:04PM +0800, Shuai Xue wrote:
>
>
> 在 2025/10/22 20:50, Jason Gunthorpe 写道:
> > On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > >
> > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > enabling safe sharing of non-struct page memory with controlled
> > > lifetime management. This allows RDMA and other subsystems to import
> > > dma-buf FDs and build them into memory regions for PCI P2P operations.
> > >
> > > The implementation provides a revocable attachment mechanism using
> > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > don't change physical addresses, but access is revoked when the VFIO
> > > device is closed or a PCI reset is issued. This ensures kernel
> > > self-defense against potentially hostile userspace.
> >
> > Let's enhance this:
> >
> > Currently VFIO can take MMIO regions from the device's BAR and map
> > them into a PFNMAP VMA with special PTEs. This mapping type ensures
> > the memory cannot be used with things like pin_user_pages(), hmm, and
> > so on. In practice only the user process CPU and KVM can safely make
> > use of these VMA. When VFIO shuts down these VMAs are cleaned by
> > unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
> > unbind.
> >
> > However, VFIO type 1 has an insecure behavior where it uses
> > follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
> > into the IOMMU. This has a long history of enabling P2P DMA inside
> > VMs, but has serious lifetime problems by allowing a UAF of the MMIO
> > after the VFIO driver has been unbound.
>
> Hi, Jason,
>
> Can you elaborate on this more?
>
> From my understanding of the VFIO type 1 implementation:
>
> - When a device is opened through VFIO type 1, it increments the
> device->refcount
> - During unbind, the driver waits for this refcount to drop to zero via
> wait_for_completion(&device->comp)
> - This should prevent the unbind() from completing while the device is
> still in use
>
> Given this refcount mechanism, I do not figure out how the UAF can
> occur.
A second vfio device can be opened and then use follow_pfnmap_*() to
read the first vfio device's PTEs. There is no relationship betweent
the first and second VFIO devices, so once the first is unbound it
sails through the device->comp while the second device retains the PFN
in its type1 iommu_domain.
Jason