On 2025-11-26 08:19, Philipp Stanner wrote:
> The return code of dma_fence_signal() is not really useful as there is
> nothing reasonable to do if a fence was already signaled. That return
> code shall be removed from the kernel.
>
> Ignore dma_fence_signal()'s return code.
I think this is not correct. Looking at the comment in
evict_process_worker, we use the return value to decide a race
conditions where multiple threads are trying to signal the eviction
fence. Only one of them should schedule the restore work. And the other
ones need to increment the reference count to keep evictions balanced.
Regards,
Felix
>
> Suggested-by: Christian König <christian.koenig(a)amd.com>
> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index ddfe30c13e9d..950fafa4b3c3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1986,7 +1986,6 @@ kfd_process_gpuid_from_node(struct kfd_process *p, struct kfd_node *node,
> static int signal_eviction_fence(struct kfd_process *p)
> {
> struct dma_fence *ef;
> - int ret;
>
> rcu_read_lock();
> ef = dma_fence_get_rcu_safe(&p->ef);
> @@ -1994,10 +1993,10 @@ static int signal_eviction_fence(struct kfd_process *p)
> if (!ef)
> return -EINVAL;
>
> - ret = dma_fence_signal(ef);
> + dma_fence_signal(ef);
> dma_fence_put(ef);
>
> - return ret;
> + return 0;
> }
>
> static void evict_process_worker(struct work_struct *work)
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.c…
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
v2:
- Rebase on Leon's v9
- Fix mislocking in an iopt_fill_domain() error path
- Revise the comments around how the sub page offset works
- Remove a useless WARN_ON in iopt_pages_rw_access()
- Fixed missed memory free in the selftest
v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.…
Jason Gunthorpe (9):
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 78 +++-
drivers/iommu/iommufd/io_pagetable.h | 54 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 14 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 414 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 143 ++++++
drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++
include/linux/vfio_pci_core.h | 4 +
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
12 files changed, 786 insertions(+), 70 deletions(-)
base-commit: f836737ed56db9e2d5b047c56a31e05af0f3f116
--
2.43.0
On Tue, Nov 25, 2025 at 05:11:18PM -0800, Alex Mastro wrote:
> fill_sg_entry() splits large DMA buffers into multiple scatter-gather
> entries, each holding up to UINT_MAX bytes. When calculating the DMA
> address for entries beyond the second one, the expression (i * UINT_MAX)
> causes integer overflow due to 32-bit arithmetic.
>
> This manifests when the input arg length >= 8 GiB results in looping for
> i >= 2.
>
> Fix by casting i to dma_addr_t before multiplication.
>
> Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> Signed-off-by: Alex Mastro <amastro(a)fb.com>
> ---
> More color about how I discovered this in [1] for the commit at [2]:
>
> [1] https://lore.kernel.org/all/aSZHO6otK0Heh+Qj@devgpu015.cco6.facebook.com
> [2] https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.c…
> ---
> drivers/dma-buf/dma-buf-mapping.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
AlexW, can you pick this up?
Thanks,
Jason
On Wed, Nov 26, 2025 at 08:08:24AM -0800, Alex Mastro wrote:
> On Wed, Nov 26, 2025 at 01:12:40PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Nov 25, 2025 at 04:18:03PM -0800, Alex Mastro wrote:
> > > On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote:
> > > > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > > > + dma_addr_t addr)
> > > > +{
> > > > + unsigned int len, nents;
> > > > + int i;
> > > > +
> > > > + nents = DIV_ROUND_UP(length, UINT_MAX);
> > > > + for (i = 0; i < nents; i++) {
> > > > + len = min_t(size_t, length, UINT_MAX);
> > > > + length -= len;
> > > > + /*
> > > > + * DMABUF abuses scatterlist to create a scatterlist
> > > > + * that does not have any CPU list, only the DMA list.
> > > > + * Always set the page related values to NULL to ensure
> > > > + * importers can't use it. The phys_addr based DMA API
> > > > + * does not require the CPU list for mapping or unmapping.
> > > > + */
> > > > + sg_set_page(sgl, NULL, 0, 0);
> > > > + sg_dma_address(sgl) = addr + i * UINT_MAX;
> > >
> > > (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for
> > > addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast:
> > >
> > > sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
Yeah, and i should not be signed.
> > > Discovered this while debugging why dma-buf import was failing for
> > > an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by
> > > ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles
> > > up as an EINVAL.
> > >
> >
> > Thanks a lot for testing & reporting this!
> >
> > However, I believe the casting approach is a little fragile (and
> > potentially prone to issues depending on how dma_addr_t is sized on
> > different platforms). Thus, approaching this with accumulation seems
> > better as it avoids the multiplication logic entirely, maybe something
> > like the following (untested) diff ?
>
> If the function input range is well-formed, then all values in
> [addr..addr+length) must be expressible by dma_addr_t, so I don't think overflow
> after casting is possible as long as nents is valid.
It is probably not perfect, but validate_dmabuf_input() limits length
to a valid size_t
The signature is:
bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
phys_addr_t phys, size_t size)
And that function should fail if size is too large. I think it mostly
does, but it looks like there are a few little misses:
iova_align(iovad, size + iova_off),
return ALIGN(size, iovad->granule);
etc are all unchecked math that could overflow.
> That said, `nents = DIV_ROUND_UP(length, UINT_MAX)` is simply broken on any
> system where size_t is 32b. I don't know if that's a practical consideration for
> these code paths though.
Yeah, that's a good point.
Casting to u64 will trigger 64 bit device errors on 32 bit too.
// DIV_ROUND_UP that is safe at the type limits
nents = size / UINT_MAX;
if (size % UINT_MAX)
nents++;
Compiler should turn the % into bit math.
Jason
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
>
> > >
[...]
> > > My hope would be that in the mid-term future we'd get firmware
> > > rings
> > > that can be preempted through a firmware call for all major
> > > hardware.
> > > Then a huge share of our problems would disappear.
> >
> > At least on AMD HW pre-emption is actually horrible unreliable as
> > well.
>
> Do you mean new GPUs with firmware scheduling, or what is "HW pre-
> emption"?
>
> With firmware interfaces, my hope would be that you could simply tell
>
> stop_running_ring(nr_of_ring)
> // time slice for someone else
> start_running_ring(nr_of_ring)
>
> Thereby getting real scheduling and all that. And eliminating many
> other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for
preemption to be a hard problem on GPUs. CPUs have limited software
visible state that needs to be saved/restored on a context switch and
even there people start complaining now that they need to context
switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be
saved/restored on the context switch if you want fine grained
preemption with low preemption latency. There might be points in the
command execution where you can ignore most of that state, but reaching
those points can have basically unbounded latency. So either you can
reliably save/restore lots of state or you are limited to very coarse
grained preemption with all the usual issues of timeouts and DoS
vectors.
I'm not totally up to speed with the current state across all relevant
GPUs, but until recently NVidia was the only vendor to have real
reliable fine-grained preemption.
Regards,
Lucas
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
> >
> >
> > On 11/26/25 13:37, Philipp Stanner wrote:
> > > On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
> > > >
>
> […]
>
> > > > Well the question is how do you detect *reliable* that there is
> > > > still forward progress?
> > >
> > > My understanding is that that's impossible since the internals of
> > > command submissions are only really understood by userspace, who
> > > submits them.
> >
> > Right, but we can still try to do our best in the kernel to mitigate
> > the situation.
> >
> > I think for now amdgpu will implement something like checking if the
> > HW still makes progress after a timeout but only a limited number of
> > re-tries until we say that's it and reset anyway.
>
> Oh oh, isn't that our dear hang_limit? :)
Not really. The hang limit is the limit on how many times a hanging
submit might be retried.
Limiting the number of timeout extensions is more of a safety net
against a workloads which might appear to make progress to the kernel
driver but in reality are stuck. After all, the kernel driver can only
have limited knowledge of the GPU state and any progress check will
have limited precision with false positives/negatives being a part of
reality we have to deal with.
>
> We agree that you can never really now whether userspace just submitted
> a while(true) job, don't we? Even if some GPU register still indicates
> "progress".
Yea, this is really hardware dependent on what you can read at
runtime.
For etnaviv we define "progress" as the command frontend moving towards
the end of the command buffer. As a single draw call in valid workloads
can blow through our timeout we also use debug registers to look at the
current primitive ID within a draw call.
If userspace submits a workload that requires more than 500ms per
primitive to finish we consider this an invalid workload and go through
the reset/recovery motions.
Regards,
Lucas
On 11/26/25 13:37, Philipp Stanner wrote:
> On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
>> On 11/25/25 18:02, Lucas Stach wrote:
>>>>> I agree that distinguishing the use case that way is not ideal.
>>>>> However, who has the knowledge of how the hardware is being used by
>>>>> customers / users, if not the driver?
>>>>
>>>> Well the end user.
>>>>
>>>> Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
>>>>
>>>> Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
>>>>
>>>> And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
>>>
>>> The question really is what you want to do after you hit the (lowered)
>>> timeout? Users get grumpy if you block things for 10 seconds, but they
>>> get equally if not more grumpy when you kick out a valid workload that
>>> just happens to need a lot of GPU time.
>>
>> Yeah, exactly that summarizes the problem pretty well.
>>
>>> Fences are only defined to signal eventually, with no real concept of a
>>> timeout. IMO all timeouts waiting for fences should be long enough to
>>> only be considered last resort. You may want to give the user some
>>> indication of a failed fence wait instead of stalling indefinitely, but
>>> you really only want to do this after a quite long timeout, not in a
>>> sense of "Sorry, I ran out of patience after 2 seconds".
>>>
>>> Sure memory management depends on fences making forward progress, but
>>> mm also depends on scheduled writeback making forward progress. You
>>> don't kick out writeback requests after an arbitrary timeout just
>>> because the backing storage happens to be loaded heavily.
>>>
>>> This BTW is also why etnaviv has always had a quite short timeout of
>>> 500ms, with the option to extend the timeout when the GPU is still
>>> making progress. We don't ever want to shoot down valid workloads (we
>>> have some that need a few seconds to upload textures, etc on our wimpy
>>> GPU), but you also don't want to wait multiple seconds until you detect
>>> a real GPU hang.
>>
>> That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
>>
>> What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
>>
>>> So we use the short scheduler timeout to check in on the GPU and see if
>>> it is still making progress (for graphics workloads by looking at the
>>> frontend position within the command buffer and current primitive ID).
>>> If we can deduce that the GPU is stuck we do the usual reset/recovery
>>> dance within a reasonable reaction time, acceptable to users hitting a
>>> real GPU hang. But if the GPU is making progress we will give an
>>> infinite number of timeout extensions with no global timeout at all,
>>> only fulfilling the eventual signaling guarantee of the fence.
>>
>> Well the question is how do you detect *reliable* that there is still forward progress?
>
> My understanding is that that's impossible since the internals of
> command submissions are only really understood by userspace, who
> submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
> I think the long-term solution can only be fully fledged GPU scheduling
> with preemption. That's why we don't need such a timeout mechanism for
> userspace processes: the scheduler simply interrupts and lets someone
> else run.
Yeah absolutely.
>
> My hope would be that in the mid-term future we'd get firmware rings
> that can be preempted through a firmware call for all major hardware.
> Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Userspace basically needs to co-operate and provide a buffer where the state on a pre-emption is saved into.
> With the current situation, IDK either. My impression so far is that
> letting the drivers and driver programmers decide is the least bad
> choice.
Yeah, agree. It's the least evil thing we can do.
But I now have a plan how to proceed :)
Thanks for the input,
Christian.
>
>
> P.
On 11/26/25 14:19, Philipp Stanner wrote:
> Barely anyone uses dma_fence_signal()'s (and similar functions') return
> code. Checking it is pretty much useless anyways, because what are you
> going to do if a fence was already signal it? Unsignal it and signal it
> again? ;p
Reviewed-by: Christian König <christian.koenig(a)amd.com> for the entire series.
Please push to drm-misc-next or leave me a note when I should pick it up.
> Removing the return code simplifies the API and makes it easier for me
> to sit on top with Rust DmaFence.
BTW, I have an rb for embedding the lock and I'm now writing test cases.
When that is done you should be able to base the Rust DmaFence abstraction on that as well.
Regards,
Christian.
>
> Philipp Stanner (6):
> dma-buf/dma-fence: Add dma_fence_test_signaled_flag()
> amd/amdkfd: Ignore return code of dma_fence_signal()
> drm/gpu/xe: Ignore dma_fenc_signal() return code
> dma-buf: Don't misuse dma_fence_signal()
> drm/ttm: Remove return check of dma_fence_signal()
> dma-buf/dma-fence: Remove return code of signaling-functions
>
> drivers/dma-buf/dma-fence.c | 59 ++++++-------------
> drivers/dma-buf/st-dma-fence.c | 7 +--
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +-
> .../gpu/drm/ttm/tests/ttm_bo_validate_test.c | 3 +-
> drivers/gpu/drm/xe/xe_hw_fence.c | 5 +-
> include/linux/dma-fence.h | 33 ++++++++---
> 6 files changed, 53 insertions(+), 59 deletions(-)
>
On 11/25/25 11:56, Philipp Stanner wrote:
>>>>
>>>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
>>>>
>>>> The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
>>>
>>> That's a question only the drivers can answer, which is why I like to
>>> think that setting global constants constraining all parties is not the
>>> right thing to do.
>>
>> Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
>>
>> See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
>
> Good and well, but then patch 4 becomes even more problematic:
>
> So we'd just have drivers fire warnings, and then they would still have
> the freedom to set timeouts for drm/sched, as long as those timeouts
> are smaller than your new global constant.
>
> Why then not remove drm/sched's timeout parameter API completely and
> always use your maximum value internally in drm/sched? Or maybe
> truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
> "Maximum timeout parameter exceeded, truncating to %ld.\n"
>
> I suppose some drivers want even higher responsiveness than those 2
> seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
> I do believe that more of the driver folks should be made aware of this
> intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
>>
>>> What is even your motivation? What problem does this solve? Is the OOM
>>> killer currently hanging for anyone? Can you link a bug report?
>>
>> I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
>>
>> So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
>>
>> While working on that I realized that the timeout is actually not driver dependent at all.
>>
>> What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
>>
>> I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
>
> I suppose some drivers are indeed only used for server hardware. And
> for compute you might not care about responsiveness as long as your
> result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
> I agree that distinguishing the use case that way is not ideal.
> However, who has the knowledge of how the hardware is being used by
> customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
Thoughts?
Regards,
Christian.
>
>
> P.