On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> Bear me if it's an ignorant question.
>
> The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete
eventually under kernel control, because these sleeps are
sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something
like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at
least this will work "correctly" even if it takes a huge amount of
time.
I wouldn't mind seeing a shorter timeout and print on the fence too
just in case.
Jason
On Thu, Jan 29, 2026 at 08:13:18AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon(a)kernel.org>
> > Sent: Thursday, January 29, 2026 3:34 PM
> >
> > On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg(a)ziepe.ca>
> > > > Sent: Wednesday, January 28, 2026 12:28 AM
> > > >
> > > > On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > > > > > > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
> > > > vfio_pci_core_device *vdev, bool revoked)
> > > > > > > dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > > > > priv->revoked = revoked;
> > > > > > > dma_buf_invalidate_mappings(priv-
> > >dmabuf);
> > > > > > > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > > > > > > +
> > DMA_RESV_USAGE_BOOKKEEP,
> > > > false,
> > > > > > > +
> > MAX_SCHEDULE_TIMEOUT);
> > > > > > > dma_resv_unlock(priv->dmabuf->resv);
> > > > > > > + if (revoked) {
> > > > > > > + kref_put(&priv->kref,
> > > > vfio_pci_dma_buf_done);
> > > > > > > + /* Let's wait till all DMA unmap are
> > > > completed. */
> > > > > > > + wait = wait_for_completion_timeout(
> > > > > > > + &priv->comp,
> > secs_to_jiffies(1));
> > > > > >
> > > > > > Is the 1-second constant sufficient for all hardware, or should the
> > > > > > invalidate_mappings() contract require the callback to block until
> > > > > > speculative reads are strictly fenced? I'm wondering about a case
> > where
> > > > > > a device's firmware has a high response latency, perhaps due to
> > internal
> > > > > > management tasks like error recovery or thermal and it exceeds the
> > 1s
> > > > > > timeout.
> > > > > >
> > > > > > If the device is in the middle of a large DMA burst and the firmware is
> > > > > > slow to flush the internal pipelines to a fully "quiesced"
> > > > > > read-and-discard state, reclaiming the memory at exactly 1.001
> > seconds
> > > > > > risks triggering platform-level faults..
> > > > > >
> > > > > > Since the wen explicitly permit these speculative reads until unmap is
> > > > > > complete, relying on a hardcoded timeout in the exporter seems to
> > > > > > introduce a hardware-dependent race condition that could
> > compromise
> > > > > > system stability via IOMMU errors or AER faults.
> > > > > >
> > > > > > Should the importer instead be required to guarantee that all
> > > > > > speculative access has ceased before the invalidation call returns?
> > > > >
> > > > > It is guaranteed by the dma_resv_wait_timeout() call above. That call
> > > > ensures
> > > > > that the hardware has completed all pending operations. The 1‑second
> > > > delay is
> > > > > meant to catch cases where an in-kernel DMA unmap call is missing,
> > which
> > > > should
> > > > > not trigger any DMA activity at that point.
> > > >
> > > > Christian may know actual examples, but my general feeling is he was
> > > > worrying about drivers that have pushed the DMABUF to visibility on
> > > > the GPU and the move notify & fences only shoot down some access. So
> > > > it has to wait until the DMABUF is finally unmapped.
> > > >
> > > > Pranjal's example should be covered by the driver adding a fence and
> > > > then the unbounded fence wait will complete it.
> > > >
> > >
> > > Bear me if it's an ignorant question.
> > >
> > > The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> > > wait, which is the reason behind the 2nd timeout wait here.
> >
> > It is not accurate. A second timeout is present both in the
> > description of patch 6 and in VFIO implementation. The difference is
> > that the timeout is enforced within VFIO.
> >
> > >
> > > Then why is "the unbounded fence wait" not a problem in the same
> > > code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case
> > > timeout in hundreds of years...
> >
> > "An unbounded fence wait" is a different class of wait. It indicates broken
> > hardware that continues to issue DMA transactions even after it has been
> > told to
> > stop.
> >
> > The second wait exists to catch software bugs or misuse, where the dma-buf
> > importer has misrepresented its capabilities.
> >
>
> Okay I see.
>
> > >
> > > and it'd be helpful to put some words in the code based on what's
> > > discussed here.
> >
> > We've documented as much as we can in dma_buf_attach_revocable() and
> > dma_buf_invalidate_mappings(). Do you have any suggestions on what else
> > should be added here?
> >
>
> the selection of 1s?
It is indirectly written in description of WARN_ON(), but let's add
more. What about the following?
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 93795ad2e025..948ba75288c6 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -357,7 +357,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
dma_resv_unlock(priv->dmabuf->resv);
if (revoked) {
kref_put(&priv->kref, vfio_pci_dma_buf_done);
- /* Let's wait till all DMA unmap are completed. */
+ /*
+ * Let's wait for 1 second till all DMA unmap
+ * are completed. It is supposed to catch dma-buf
+ * importers which lied about their support
+ * of dmabuf revoke. See dma_buf_invalidate_mappings()
+ * for the expected behaviour,
+ */
wait = wait_for_completion_timeout(
&priv->comp, secs_to_jiffies(1));
/*
>
> then,
>
> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Thanks
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)ziepe.ca>
> > Sent: Wednesday, January 28, 2026 12:28 AM
> >
> > On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > > > > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
> > vfio_pci_core_device *vdev, bool revoked)
> > > > > dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > > priv->revoked = revoked;
> > > > > dma_buf_invalidate_mappings(priv->dmabuf);
> > > > > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > > > > + DMA_RESV_USAGE_BOOKKEEP,
> > false,
> > > > > + MAX_SCHEDULE_TIMEOUT);
> > > > > dma_resv_unlock(priv->dmabuf->resv);
> > > > > + if (revoked) {
> > > > > + kref_put(&priv->kref,
> > vfio_pci_dma_buf_done);
> > > > > + /* Let's wait till all DMA unmap are
> > completed. */
> > > > > + wait = wait_for_completion_timeout(
> > > > > + &priv->comp, secs_to_jiffies(1));
> > > >
> > > > Is the 1-second constant sufficient for all hardware, or should the
> > > > invalidate_mappings() contract require the callback to block until
> > > > speculative reads are strictly fenced? I'm wondering about a case where
> > > > a device's firmware has a high response latency, perhaps due to internal
> > > > management tasks like error recovery or thermal and it exceeds the 1s
> > > > timeout.
> > > >
> > > > If the device is in the middle of a large DMA burst and the firmware is
> > > > slow to flush the internal pipelines to a fully "quiesced"
> > > > read-and-discard state, reclaiming the memory at exactly 1.001 seconds
> > > > risks triggering platform-level faults..
> > > >
> > > > Since the wen explicitly permit these speculative reads until unmap is
> > > > complete, relying on a hardcoded timeout in the exporter seems to
> > > > introduce a hardware-dependent race condition that could compromise
> > > > system stability via IOMMU errors or AER faults.
> > > >
> > > > Should the importer instead be required to guarantee that all
> > > > speculative access has ceased before the invalidation call returns?
> > >
> > > It is guaranteed by the dma_resv_wait_timeout() call above. That call
> > ensures
> > > that the hardware has completed all pending operations. The 1‑second
> > delay is
> > > meant to catch cases where an in-kernel DMA unmap call is missing, which
> > should
> > > not trigger any DMA activity at that point.
> >
> > Christian may know actual examples, but my general feeling is he was
> > worrying about drivers that have pushed the DMABUF to visibility on
> > the GPU and the move notify & fences only shoot down some access. So
> > it has to wait until the DMABUF is finally unmapped.
> >
> > Pranjal's example should be covered by the driver adding a fence and
> > then the unbounded fence wait will complete it.
> >
>
> Bear me if it's an ignorant question.
>
> The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the
description of patch 6 and in VFIO implementation. The difference is
that the timeout is enforced within VFIO.
>
> Then why is "the unbounded fence wait" not a problem in the same
> code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case
> timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken
hardware that continues to issue DMA transactions even after it has been told to
stop.
The second wait exists to catch software bugs or misuse, where the dma-buf
importer has misrepresented its capabilities.
>
> and it'd be helpful to put some words in the code based on what's
> discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and
dma_buf_invalidate_mappings(). Do you have any suggestions on what else
should be added here?
Thanks
On Mon, Jan 26, 2026 at 08:53:57PM +0000, Pranjal Shrivastava wrote:
> On Sat, Jan 24, 2026 at 09:14:16PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > dma-buf invalidation is handled asynchronously by the hardware, so VFIO
> > must wait until all affected objects have been fully invalidated.
> >
> > In addition, the dma-buf exporter is expecting that all importers unmap any
> > buffers they previously mapped.
> >
> > Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions")
> > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > ---
> > drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 68 insertions(+), 3 deletions(-)
<...>
> > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> > dma_resv_lock(priv->dmabuf->resv, NULL);
> > priv->revoked = revoked;
> > dma_buf_invalidate_mappings(priv->dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > + DMA_RESV_USAGE_BOOKKEEP, false,
> > + MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv);
> > + if (revoked) {
> > + kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > + /* Let's wait till all DMA unmap are completed. */
> > + wait = wait_for_completion_timeout(
> > + &priv->comp, secs_to_jiffies(1));
>
> Is the 1-second constant sufficient for all hardware, or should the
> invalidate_mappings() contract require the callback to block until
> speculative reads are strictly fenced? I'm wondering about a case where
> a device's firmware has a high response latency, perhaps due to internal
> management tasks like error recovery or thermal and it exceeds the 1s
> timeout.
>
> If the device is in the middle of a large DMA burst and the firmware is
> slow to flush the internal pipelines to a fully "quiesced"
> read-and-discard state, reclaiming the memory at exactly 1.001 seconds
> risks triggering platform-level faults..
>
> Since the wen explicitly permit these speculative reads until unmap is
> complete, relying on a hardcoded timeout in the exporter seems to
> introduce a hardware-dependent race condition that could compromise
> system stability via IOMMU errors or AER faults.
>
> Should the importer instead be required to guarantee that all
> speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call ensures
that the hardware has completed all pending operations. The 1‑second delay is
meant to catch cases where an in-kernel DMA unmap call is missing, which should
not trigger any DMA activity at that point.
So yes, one second is more than sufficient.
Thanks
>
> Thanks
> Praan
>
> > + /*
> > + * If you see this WARN_ON, it means that
> > + * importer didn't call unmap in response to
> > + * dma_buf_invalidate_mappings() which is not
> > + * allowed.
> > + */
> > + WARN(!wait,
> > + "Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");
> > + } else {
> > + /*
> > + * Kref is initialize again, because when revoke
> > + * was performed the reference counter was decreased
> > + * to zero to trigger completion.
> > + */
> > + kref_init(&priv->kref);
> > + /*
> > + * There is no need to wait as no mapping was
> > + * performed when the previous status was
> > + * priv->revoked == true.
> > + */
> > + reinit_completion(&priv->comp);
> > + }
> > }
> > fput(priv->dmabuf->file);
> > }
> > @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> >
> > down_write(&vdev->memory_lock);
> > list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> > + unsigned long wait;
> > +
> > if (!get_file_active(&priv->dmabuf->file))
> > continue;
> >
> > @@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> > priv->vdev = NULL;
> > priv->revoked = true;
> > dma_buf_invalidate_mappings(priv->dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > + DMA_RESV_USAGE_BOOKKEEP, false,
> > + MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv);
> > + kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > + wait = wait_for_completion_timeout(&priv->comp,
> > + secs_to_jiffies(1));
> > + WARN_ON(!wait);
> > vfio_device_put_registration(&vdev->vdev);
> > fput(priv->dmabuf->file);
> > }
> >
> > --
> > 2.52.0
> >
> >
>
Hi everyone,
dma_fences have ever lived under the tyranny dictated by the module
lifetime of their issuer, leading to crashes should anybody still holding
a reference to a dma_fence when the module of the issuer was unloaded.
The basic problem is that when buffer are shared between drivers
dma_fence objects can leak into external drivers and stay there even
after they are signaled. The dma_resv object for example only lazy releases
dma_fences.
So what happens is that when the module who originally created the dma_fence
unloads the dma_fence_ops function table becomes unavailable as well and so
any attempt to release the fence crashes the system.
Previously various approaches have been discussed, including changing the
locking semantics of the dma_fence callbacks (by me) as well as using the
drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
from their actual users, but none of them are actually solving all problems.
Tvrtko did some really nice prerequisite work by protecting the returned
strings of the dma_fence_ops by RCU. This way dma_fence creators where
able to just wait for an RCU grace period after fence signaling before
they could be save to free those data structures.
Now this patch set here goes a step further and protects the whole
dma_fence_ops structure by RCU, so that after the fence signals the
pointer to the dma_fence_ops is set to NULL when there is no wait nor
release callback given. All functionality which use the dma_fence_ops
reference are put inside an RCU critical section, except for the
deprecated issuer specific wait and of course the optional release
callback.
Additional to the RCU changes the lock protecting the dma_fence state
previously had to be allocated external. This set here now changes the
functionality to make that external lock optional and allows dma_fences
to use an inline lock and be self contained.
v4:
Rebases the whole set on upstream changes, especially the cleanup
from Philip in patch "drm/amdgpu: independence for the amdkfd_fence!".
Adding two patches which brings the DMA-fence self tests up to date.
The first selftest changes removes the mock_wait and so actually starts
testing the default behavior instead of some hacky implementation in the
test. This one got upstreamed independent of this set.
The second drops the mock_fence as well and tests the new RCU and inline
spinlock functionality.
v5:
Rebase on top of drm-misc-next instead of drm-tip, leave out all driver
changes for now since those should go through the driver specific paths
anyway.
Address a few more review comments, especially some rebase mess and
typos. And finally fix one more bug found by AMDs CI system.
v6:
Minor style changes, re-ordered patch #1, dropped the scheduler fence
change for now
Please review and comment,
Christian.
On Mon, Jan 26, 2026 at 08:38:44PM +0000, Pranjal Shrivastava wrote:
> I noticed that Patch 5 removes the invalidate_mappings stub from
> umem_dmabuf.c, effectively making the callback NULL for an RDMA
> importer. Consequently, dma_buf_attach_revocable() (introduced here)
> will return false for these importers.
Yes, that is the intention.
> Since the cover letter mentions that VFIO will use
> dma_buf_attach_revocable() to prevent unbounded waits, this appears to
> effectively block paths like the VFIO-export -> RDMA-import path..
It remains usable with the ODP path and people are using that right
now.
> Given that RDMA is a significant consumer of dma-bufs, are there plans
> to implement proper revocation support in the IB/RDMA core (umem_dmabuf)?
This depends on each HW, they need a way to implement the revoke
semantic. I can't guess what is possible, but I would hope that most
HW could at least do a revoke on a real MR.
Eg a MR rereg operation to a kernel owned empty PD is an effective
"revoke", and MR rereg is at least defined by standards so HW should
implement it.
> It would be good to know if there's a plan for bringing such importers
> into compliance with the new revocation semantics so they can interop
> with VFIO OR are we completely ruling out users like RDMA / IB importing
> any DMABUFs exported by VFIO?
It will be driver dependent, there is no one shot update here.
Jason
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
> From: Yishai Hadas <yishaih(a)nvidia.com>
>
> Expose DMABUF functionality to userspace through the uverbs interface,
> enabling InfiniBand/RDMA devices to export PCI based memory regions
> (e.g. device memory) as DMABUF file descriptors. This allows
> zero-copy sharing of RDMA memory with other subsystems that support the
> dma-buf framework.
>
> A new UVERBS_OBJECT_DMABUF object type and allocation method were
> introduced.
>
> During allocation, uverbs invokes the driver to supply the
> rdma_user_mmap_entry associated with the given page offset (pgoff).
>
> Based on the returned rdma_user_mmap_entry, uverbs requests the driver
> to provide the corresponding physical-memory details as well as the
> driver’s PCI provider information.
>
> Using this information, dma_buf_export() is called; if it succeeds,
> uobj->object is set to the underlying file pointer returned by the
> dma-buf framework.
>
> The file descriptor number follows the standard uverbs allocation flow,
> but the file pointer comes from the dma-buf subsystem, including its own
> fops and private data.
>
> Because of this, alloc_begin_fd_uobject() must handle cases where
> fd_type->fops is NULL, and both alloc_commit_fd_uobject() and
> alloc_abort_fd_uobject() must account for whether filp->private_data
> exists, since it is only populated after a successful dma_buf_export().
>
> When an mmap entry is removed, uverbs iterates over its associated
> DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that
> their importers are notified.
>
> The same procedure applies during the disassociate flow; final cleanup
> occurs when the application closes the file.
>
> Signed-off-by: Yishai Hadas <yishaih(a)nvidia.com>
> Signed-off-by: Edward Srouji <edwards(a)nvidia.com>
> ---
> drivers/infiniband/core/Makefile | 1 +
> drivers/infiniband/core/device.c | 2 +
> drivers/infiniband/core/ib_core_uverbs.c | 19 +++
> drivers/infiniband/core/rdma_core.c | 63 ++++----
> drivers/infiniband/core/rdma_core.h | 1 +
> drivers/infiniband/core/uverbs.h | 10 ++
> drivers/infiniband/core/uverbs_std_types_dmabuf.c | 172 ++++++++++++++++++++++
> drivers/infiniband/core/uverbs_uapi.c | 1 +
> include/rdma/ib_verbs.h | 9 ++
> include/rdma/uverbs_types.h | 1 +
> include/uapi/rdma/ib_user_ioctl_cmds.h | 10 ++
> 11 files changed, 263 insertions(+), 26 deletions(-)
<...>
> +static struct sg_table *
> +uverbs_dmabuf_map(struct dma_buf_attachment *attachment,
> + enum dma_data_direction dir)
> +{
> + struct ib_uverbs_dmabuf_file *priv = attachment->dmabuf->priv;
> +
> + dma_resv_assert_held(priv->dmabuf->resv);
> +
> + if (priv->revoked)
> + return ERR_PTR(-ENODEV);
> +
> + return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
> + &priv->phys_vec, 1, priv->phys_vec.len,
> + dir);
> +}
> +
> +static void uverbs_dmabuf_unmap(struct dma_buf_attachment *attachment,
> + struct sg_table *sgt,
> + enum dma_data_direction dir)
> +{
> + dma_buf_free_sgt(attachment, sgt, dir);
> +}
Unfortunately, it is not enough. Exporters should count their
map<->unmap calls and make sure that they are equal.
See this VFIO change https://lore.kernel.org/kvm/20260124-dmabuf-revoke-v5-4-f98fca917e96@nvidia…
Thanks
On Fri, Oct 31, 2025 at 05:15:32AM +0000, Kasireddy, Vivek wrote:
> > So the next steps would be to make all the exporters directly declare
> > a SGT and then remove the SGT related ops from dma_ops itself and
> > remove the compat sgt in the attach logic. This is not hard, it is all
> > simple mechanical work.
> IMO, this SGT compatibility stuff should ideally be a separate follow-on
> effort (and patch series) that would also probably include updates to
> various drivers to add the SGT mapping type.
I've beeen working on this idea and have updated my github here:
https://github.com/jgunthorpe/linux/commits/dmabuf_map_type/
I still need to run it through what testing I can do here, but it goes
all the way and converts everything into SGT mapping type, all
drivers. I think this shows the idea works.
I'm hoping to post it next week if the revoke thing settles down and I
can complete some more checking.
We can discuss how to break it up along with get feedback if people
are happy with the idea.
It looks like it turns out fairly well, I didn't find anything
surprising along the way at least.
Thanks,
Jason
Changelog:
v3:
* Used Jason's wordings for commits and cover letter.
* Removed IOMMUFD patch.
* Renamed dma_buf_attachment_is_revoke() to be dma_buf_attach_revocable().
* Added patch to remove CONFIG_DMABUF_MOVE_NOTIFY.
* Added Reviewed-by tags.
* Called to dma_resv_wait_timeout() after dma_buf_move_notify() in VFIO.
* Added dma_buf_attach_revocable() check to VFIO DMABUF attach function.
* Slightly changed commit messages.
v2: https://patch.msgid.link/20260118-dmabuf-revoke-v2-0-a03bb27c0875@nvidia.com
* Changed series to document the revoke semantics instead of
implementing it.
v1: https://patch.msgid.link/20260111-dmabuf-revoke-v1-0-fb4bcc8c259b@nvidia.com
-------------------------------------------------------------------------
This series documents a dma-buf “revoke” mechanism: to allow a dma-buf
exporter to explicitly invalidate (“kill”) a shared buffer after it has
been distributed to importers, so that further CPU and device access is
prevented and importers reliably observe failure.
The change in this series is to properly document and use existing core
“revoked” state on the dma-buf object and a corresponding exporter-triggered
revoke operation.
dma-buf has quietly allowed calling move_notify on pinned dma-bufs, even
though legacy importers using dma_buf_attach() would simply ignore
these calls.
RDMA saw this and needed to use allow_peer2peer=true, so implemented a
new-style pinned importer with an explicitly non-working move_notify()
callback.
This has been tolerable because the existing exporters are thought to
only call move_notify() on a pinned DMABUF under RAS events and we
have been willing to tolerate the UAF that results by allowing the
importer to continue to use the mapping in this rare case.
VFIO wants to implement a pin supporting exporter that will issue a
revoking move_notify() around FLRs and a few other user triggerable
operations. Since this is much more common we are not willing to
tolerate the security UAF caused by interworking with non-move_notify()
supporting drivers. Thus till now VFIO has required dynamic importers,
even though it never actually moves the buffer location.
To allow VFIO to work with pinned importers, according to how dma-buf
was intended, we need to allow VFIO to detect if an importer is legacy
or RDMA and does not actually implement move_notify().
Introduce a new function that exporters can call to detect these less
capable importers. VFIO can then refuse to accept them during attach.
In theory all exporters that call move_notify() on pinned dma-buf's
should call this function, however that would break a number of widely
used NIC/GPU flows. Thus for now do not spread this further than VFIO
until we can understand how much of RDMA can implement the full
semantic.
In the process clarify how move_notify is intended to be used with
pinned dma-bufs.
Thanks
Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
---
Leon Romanovsky (7):
dma-buf: Rename .move_notify() callback to a clearer identifier
dma-buf: Always build with DMABUF_MOVE_NOTIFY
dma-buf: Document RDMA non-ODP invalidate_mapping() special case
dma-buf: Add check function for revoke semantics
iommufd: Pin dma-buf importer for revoke semantics
vfio: Wait for dma-buf invalidation to complete
vfio: Validate dma-buf revocation semantics
drivers/dma-buf/Kconfig | 12 -----
drivers/dma-buf/dma-buf.c | 69 +++++++++++++++++++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 +++---
drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +-
drivers/gpu/drm/xe/tests/xe_dma_buf.c | 7 ++-
drivers/gpu/drm/xe/xe_dma_buf.c | 14 +++---
drivers/infiniband/core/umem_dmabuf.c | 13 +-----
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/iommu/iommufd/pages.c | 11 ++++-
drivers/vfio/pci/vfio_pci_dmabuf.c | 8 ++++
include/linux/dma-buf.h | 9 ++--
12 files changed, 96 insertions(+), 67 deletions(-)
---
base-commit: 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb
change-id: 20251221-dmabuf-revoke-b90ef16e4236
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
Hi Thierry,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20260122]
[cannot apply to drm-misc/drm-misc-next robh/for-next linus/master v6.19-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Thierry-Reding/dt-bindings-r…
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20260122161009.3865888-7-thierry.reding%40kernel.…
patch subject: [PATCH v2 06/10] dma-buf: heaps: Add support for Tegra VPR
config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20260123/202601231123.4V5wVUur-lkp@…)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260123/202601231123.4V5wVUur-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601231123.4V5wVUur-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_protect_pages':
drivers/dma-buf/heaps/tegra-vpr.c:205:21: error: implicit declaration of function '__ptep_get'; did you mean 'ptep_get'? [-Wimplicit-function-declaration]
205 | pte_t pte = __ptep_get(ptep);
| ^~~~~~~~~~
| ptep_get
drivers/dma-buf/heaps/tegra-vpr.c:205:21: error: invalid initializer
drivers/dma-buf/heaps/tegra-vpr.c:207:15: error: implicit declaration of function 'clear_pte_bit'; did you mean 'clear_ptes'? [-Wimplicit-function-declaration]
207 | pte = clear_pte_bit(pte, __pgprot(PROT_NORMAL));
| ^~~~~~~~~~~~~
| clear_ptes
In file included from arch/x86/include/asm/paravirt_types.h:11,
from arch/x86/include/asm/ptrace.h:175,
from arch/x86/include/asm/math_emu.h:5,
from arch/x86/include/asm/processor.h:13,
from arch/x86/include/asm/timex.h:5,
from include/linux/timex.h:67,
from include/linux/time32.h:13,
from include/linux/time.h:60,
from include/linux/stat.h:19,
from include/linux/fs_dirent.h:5,
from include/linux/fs/super_types.h:5,
from include/linux/fs/super.h:5,
from include/linux/fs.h:5,
from include/linux/debugfs.h:15,
from drivers/dma-buf/heaps/tegra-vpr.c:12:
drivers/dma-buf/heaps/tegra-vpr.c:207:43: error: 'PROT_NORMAL' undeclared (first use in this function)
207 | pte = clear_pte_bit(pte, __pgprot(PROT_NORMAL));
| ^~~~~~~~~~~
arch/x86/include/asm/pgtable_types.h:202:48: note: in definition of macro '__pgprot'
202 | #define __pgprot(x) ((pgprot_t) { (x) } )
| ^
drivers/dma-buf/heaps/tegra-vpr.c:207:43: note: each undeclared identifier is reported only once for each function it appears in
207 | pte = clear_pte_bit(pte, __pgprot(PROT_NORMAL));
| ^~~~~~~~~~~
arch/x86/include/asm/pgtable_types.h:202:48: note: in definition of macro '__pgprot'
202 | #define __pgprot(x) ((pgprot_t) { (x) } )
| ^
drivers/dma-buf/heaps/tegra-vpr.c:208:15: error: implicit declaration of function 'set_pte_bit'; did you mean 'set_pte_at'? [-Wimplicit-function-declaration]
208 | pte = set_pte_bit(pte, __pgprot(PROT_DEVICE_nGnRnE));
| ^~~~~~~~~~~
| set_pte_at
drivers/dma-buf/heaps/tegra-vpr.c:208:41: error: 'PROT_DEVICE_nGnRnE' undeclared (first use in this function)
208 | pte = set_pte_bit(pte, __pgprot(PROT_DEVICE_nGnRnE));
| ^~~~~~~~~~~~~~~~~~
arch/x86/include/asm/pgtable_types.h:202:48: note: in definition of macro '__pgprot'
202 | #define __pgprot(x) ((pgprot_t) { (x) } )
| ^
drivers/dma-buf/heaps/tegra-vpr.c:210:9: error: implicit declaration of function '__set_pte'; did you mean 'set_pte'? [-Wimplicit-function-declaration]
210 | __set_pte(ptep, pte);
| ^~~~~~~~~
| set_pte
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_unprotect_pages':
drivers/dma-buf/heaps/tegra-vpr.c:218:21: error: invalid initializer
218 | pte_t pte = __ptep_get(ptep);
| ^~~~~~~~~~
drivers/dma-buf/heaps/tegra-vpr.c:220:43: error: 'PROT_DEVICE_nGnRnE' undeclared (first use in this function)
220 | pte = clear_pte_bit(pte, __pgprot(PROT_DEVICE_nGnRnE));
| ^~~~~~~~~~~~~~~~~~
arch/x86/include/asm/pgtable_types.h:202:48: note: in definition of macro '__pgprot'
202 | #define __pgprot(x) ((pgprot_t) { (x) } )
| ^
drivers/dma-buf/heaps/tegra-vpr.c:221:41: error: 'PROT_NORMAL' undeclared (first use in this function)
221 | pte = set_pte_bit(pte, __pgprot(PROT_NORMAL));
| ^~~~~~~~~~~
arch/x86/include/asm/pgtable_types.h:202:48: note: in definition of macro '__pgprot'
202 | #define __pgprot(x) ((pgprot_t) { (x) } )
| ^
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_buffer_allocate':
>> drivers/dma-buf/heaps/tegra-vpr.c:612:30: warning: variable 'last' set but not used [-Wunused-but-set-variable]
612 | unsigned long first, last;
| ^~~~
>> drivers/dma-buf/heaps/tegra-vpr.c:612:23: warning: variable 'first' set but not used [-Wunused-but-set-variable]
612 | unsigned long first, last;
| ^~~~~
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_buffer_release':
drivers/dma-buf/heaps/tegra-vpr.c:695:30: warning: variable 'last' set but not used [-Wunused-but-set-variable]
695 | unsigned long first, last;
| ^~~~
drivers/dma-buf/heaps/tegra-vpr.c:695:23: warning: variable 'first' set but not used [-Wunused-but-set-variable]
695 | unsigned long first, last;
| ^~~~~
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_setup_chunks':
>> drivers/dma-buf/heaps/tegra-vpr.c:8:21: warning: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
8 | #define pr_fmt(fmt) "tegra-vpr: " fmt
| ^~~~~~~~~~~~~
include/linux/dynamic_debug.h:231:29: note: in expansion of macro 'pr_fmt'
231 | func(&id, ##__VA_ARGS__); \
| ^~~~~~~~~~~
include/linux/dynamic_debug.h:259:9: note: in expansion of macro '__dynamic_func_call_cls'
259 | __dynamic_func_call_cls(__UNIQUE_ID(ddebug), cls, fmt, func, ##__VA_ARGS__)
| ^~~~~~~~~~~~~~~~~~~~~~~
include/linux/dynamic_debug.h:261:9: note: in expansion of macro '_dynamic_func_call_cls'
261 | _dynamic_func_call_cls(_DPRINTK_CLASS_DFLT, fmt, func, ##__VA_ARGS__)
| ^~~~~~~~~~~~~~~~~~~~~~
include/linux/dynamic_debug.h:280:9: note: in expansion of macro '_dynamic_func_call'
280 | _dynamic_func_call(fmt, __dynamic_pr_debug, \
| ^~~~~~~~~~~~~~~~~~
include/linux/printk.h:636:9: note: in expansion of macro 'dynamic_pr_debug'
636 | dynamic_pr_debug(fmt, ##__VA_ARGS__)
| ^~~~~~~~~~~~~~~~
drivers/dma-buf/heaps/tegra-vpr.c:1075:17: note: in expansion of macro 'pr_debug'
1075 | pr_debug(" %2u: %pap-%pap (%lu MiB)\n", i, &start, &end,
| ^~~~~~~~
drivers/dma-buf/heaps/tegra-vpr.c: In function 'tegra_vpr_add_heap':
drivers/dma-buf/heaps/tegra-vpr.c:1120:30: warning: variable 'last' set but not used [-Wunused-but-set-variable]
1120 | unsigned long first, last;
| ^~~~
drivers/dma-buf/heaps/tegra-vpr.c:1120:23: warning: variable 'first' set but not used [-Wunused-but-set-variable]
1120 | unsigned long first, last;
| ^~~~~
vim +/last +612 drivers/dma-buf/heaps/tegra-vpr.c
605
606 static struct tegra_vpr_buffer *
607 tegra_vpr_buffer_allocate(struct tegra_vpr *vpr, size_t size)
608 {
609 unsigned int num_pages = size >> PAGE_SHIFT;
610 unsigned int order = get_order(size);
611 struct tegra_vpr_buffer *buffer;
> 612 unsigned long first, last;
613 int pageno, err;
614 pgoff_t i;
615
616 /*
617 * "order" defines the alignment and size, so this may result in
618 * fragmented memory depending on the allocation patterns. However,
619 * since this is used primarily for video frames, it is expected that
620 * a number of buffers of the same size will be allocated, so
621 * fragmentation should be negligible.
622 */
623 pageno = tegra_vpr_allocate_region(vpr, num_pages, 1);
624 if (pageno < 0)
625 return ERR_PTR(pageno);
626
627 first = find_first_bit(vpr->bitmap, vpr->num_pages);
628 last = find_last_bit(vpr->bitmap, vpr->num_pages);
629
630 buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
631 if (!buffer) {
632 err = -ENOMEM;
633 goto release;
634 }
635
636 INIT_LIST_HEAD(&buffer->attachments);
637 INIT_LIST_HEAD(&buffer->list);
638 mutex_init(&buffer->lock);
639 buffer->start = vpr->base + (pageno << PAGE_SHIFT);
640 buffer->limit = buffer->start + size;
641 buffer->size = size;
642 buffer->num_pages = num_pages;
643 buffer->pageno = pageno;
644 buffer->order = order;
645
646 buffer->pages = kmalloc_array(buffer->num_pages,
647 sizeof(*buffer->pages),
648 GFP_KERNEL);
649 if (!buffer->pages) {
650 err = -ENOMEM;
651 goto free;
652 }
653
654 /* track which chunks this buffer overlaps */
655 if (vpr->num_chunks > 0) {
656 unsigned int limit = buffer->pageno + buffer->num_pages, i;
657
658 for (i = 0; i < vpr->num_chunks; i++) {
659 struct tegra_vpr_chunk *chunk = &vpr->chunks[i];
660
661 if (tegra_vpr_chunk_overlaps(chunk, pageno, limit))
662 set_bit(i, buffer->chunks);
663 }
664
665 /* activate chunks if necessary */
666 err = tegra_vpr_activate_chunks(vpr, buffer);
667 if (err < 0)
668 goto free;
669
670 /* track first and last allocated pages */
671 if (buffer->pageno < vpr->first)
672 vpr->first = buffer->pageno;
673
674 if (limit - 1 > vpr->last)
675 vpr->last = limit - 1;
676 }
677
678 for (i = 0; i < buffer->num_pages; i++)
679 buffer->pages[i] = &vpr->start_page[pageno + i];
680
681 return buffer;
682
683 free:
684 kfree(buffer->pages);
685 kfree(buffer);
686 release:
687 bitmap_release_region(vpr->bitmap, pageno, order);
688 return ERR_PTR(err);
689 }
690
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki