Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
> >
> >
> > On 11/26/25 13:37, Philipp Stanner wrote:
> > > On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
> > > >
>
> […]
>
> > > > Well the question is how do you detect *reliable* that there is
> > > > still forward progress?
> > >
> > > My understanding is that that's impossible since the internals of
> > > command submissions are only really understood by userspace, who
> > > submits them.
> >
> > Right, but we can still try to do our best in the kernel to mitigate
> > the situation.
> >
> > I think for now amdgpu will implement something like checking if the
> > HW still makes progress after a timeout but only a limited number of
> > re-tries until we say that's it and reset anyway.
>
> Oh oh, isn't that our dear hang_limit? :)
Not really. The hang limit is the limit on how many times a hanging
submit might be retried.
Limiting the number of timeout extensions is more of a safety net
against a workloads which might appear to make progress to the kernel
driver but in reality are stuck. After all, the kernel driver can only
have limited knowledge of the GPU state and any progress check will
have limited precision with false positives/negatives being a part of
reality we have to deal with.
>
> We agree that you can never really now whether userspace just submitted
> a while(true) job, don't we? Even if some GPU register still indicates
> "progress".
Yea, this is really hardware dependent on what you can read at
runtime.
For etnaviv we define "progress" as the command frontend moving towards
the end of the command buffer. As a single draw call in valid workloads
can blow through our timeout we also use debug registers to look at the
current primitive ID within a draw call.
If userspace submits a workload that requires more than 500ms per
primitive to finish we consider this an invalid workload and go through
the reset/recovery motions.
Regards,
Lucas
On 11/26/25 13:37, Philipp Stanner wrote:
> On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
>> On 11/25/25 18:02, Lucas Stach wrote:
>>>>> I agree that distinguishing the use case that way is not ideal.
>>>>> However, who has the knowledge of how the hardware is being used by
>>>>> customers / users, if not the driver?
>>>>
>>>> Well the end user.
>>>>
>>>> Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
>>>>
>>>> Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
>>>>
>>>> And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
>>>
>>> The question really is what you want to do after you hit the (lowered)
>>> timeout? Users get grumpy if you block things for 10 seconds, but they
>>> get equally if not more grumpy when you kick out a valid workload that
>>> just happens to need a lot of GPU time.
>>
>> Yeah, exactly that summarizes the problem pretty well.
>>
>>> Fences are only defined to signal eventually, with no real concept of a
>>> timeout. IMO all timeouts waiting for fences should be long enough to
>>> only be considered last resort. You may want to give the user some
>>> indication of a failed fence wait instead of stalling indefinitely, but
>>> you really only want to do this after a quite long timeout, not in a
>>> sense of "Sorry, I ran out of patience after 2 seconds".
>>>
>>> Sure memory management depends on fences making forward progress, but
>>> mm also depends on scheduled writeback making forward progress. You
>>> don't kick out writeback requests after an arbitrary timeout just
>>> because the backing storage happens to be loaded heavily.
>>>
>>> This BTW is also why etnaviv has always had a quite short timeout of
>>> 500ms, with the option to extend the timeout when the GPU is still
>>> making progress. We don't ever want to shoot down valid workloads (we
>>> have some that need a few seconds to upload textures, etc on our wimpy
>>> GPU), but you also don't want to wait multiple seconds until you detect
>>> a real GPU hang.
>>
>> That is a really good point. We considered that as well, but then abandoned the idea, see below for the background.
>>
>> What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that.
>>
>>> So we use the short scheduler timeout to check in on the GPU and see if
>>> it is still making progress (for graphics workloads by looking at the
>>> frontend position within the command buffer and current primitive ID).
>>> If we can deduce that the GPU is stuck we do the usual reset/recovery
>>> dance within a reasonable reaction time, acceptable to users hitting a
>>> real GPU hang. But if the GPU is making progress we will give an
>>> infinite number of timeout extensions with no global timeout at all,
>>> only fulfilling the eventual signaling guarantee of the fence.
>>
>> Well the question is how do you detect *reliable* that there is still forward progress?
>
> My understanding is that that's impossible since the internals of
> command submissions are only really understood by userspace, who
> submits them.
Right, but we can still try to do our best in the kernel to mitigate the situation.
I think for now amdgpu will implement something like checking if the HW still makes progress after a timeout but only a limited number of re-tries until we say that's it and reset anyway.
> I think the long-term solution can only be fully fledged GPU scheduling
> with preemption. That's why we don't need such a timeout mechanism for
> userspace processes: the scheduler simply interrupts and lets someone
> else run.
Yeah absolutely.
>
> My hope would be that in the mid-term future we'd get firmware rings
> that can be preempted through a firmware call for all major hardware.
> Then a huge share of our problems would disappear.
At least on AMD HW pre-emption is actually horrible unreliable as well.
Userspace basically needs to co-operate and provide a buffer where the state on a pre-emption is saved into.
> With the current situation, IDK either. My impression so far is that
> letting the drivers and driver programmers decide is the least bad
> choice.
Yeah, agree. It's the least evil thing we can do.
But I now have a plan how to proceed :)
Thanks for the input,
Christian.
>
>
> P.
On 11/26/25 14:19, Philipp Stanner wrote:
> Barely anyone uses dma_fence_signal()'s (and similar functions') return
> code. Checking it is pretty much useless anyways, because what are you
> going to do if a fence was already signal it? Unsignal it and signal it
> again? ;p
Reviewed-by: Christian König <christian.koenig(a)amd.com> for the entire series.
Please push to drm-misc-next or leave me a note when I should pick it up.
> Removing the return code simplifies the API and makes it easier for me
> to sit on top with Rust DmaFence.
BTW, I have an rb for embedding the lock and I'm now writing test cases.
When that is done you should be able to base the Rust DmaFence abstraction on that as well.
Regards,
Christian.
>
> Philipp Stanner (6):
> dma-buf/dma-fence: Add dma_fence_test_signaled_flag()
> amd/amdkfd: Ignore return code of dma_fence_signal()
> drm/gpu/xe: Ignore dma_fenc_signal() return code
> dma-buf: Don't misuse dma_fence_signal()
> drm/ttm: Remove return check of dma_fence_signal()
> dma-buf/dma-fence: Remove return code of signaling-functions
>
> drivers/dma-buf/dma-fence.c | 59 ++++++-------------
> drivers/dma-buf/st-dma-fence.c | 7 +--
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +-
> .../gpu/drm/ttm/tests/ttm_bo_validate_test.c | 3 +-
> drivers/gpu/drm/xe/xe_hw_fence.c | 5 +-
> include/linux/dma-fence.h | 33 ++++++++---
> 6 files changed, 53 insertions(+), 59 deletions(-)
>
On 11/25/25 11:56, Philipp Stanner wrote:
>>>>
>>>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
>>>>
>>>> The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
>>>
>>> That's a question only the drivers can answer, which is why I like to
>>> think that setting global constants constraining all parties is not the
>>> right thing to do.
>>
>> Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
>>
>> See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
>
> Good and well, but then patch 4 becomes even more problematic:
>
> So we'd just have drivers fire warnings, and then they would still have
> the freedom to set timeouts for drm/sched, as long as those timeouts
> are smaller than your new global constant.
>
> Why then not remove drm/sched's timeout parameter API completely and
> always use your maximum value internally in drm/sched? Or maybe
> truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
> "Maximum timeout parameter exceeded, truncating to %ld.\n"
>
> I suppose some drivers want even higher responsiveness than those 2
> seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
> I do believe that more of the driver folks should be made aware of this
> intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
>>
>>> What is even your motivation? What problem does this solve? Is the OOM
>>> killer currently hanging for anyone? Can you link a bug report?
>>
>> I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
>>
>> So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
>>
>> While working on that I realized that the timeout is actually not driver dependent at all.
>>
>> What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
>>
>> I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
>
> I suppose some drivers are indeed only used for server hardware. And
> for compute you might not care about responsiveness as long as your
> result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
> I agree that distinguishing the use case that way is not ideal.
> However, who has the knowledge of how the hardware is being used by
> customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
Thoughts?
Regards,
Christian.
>
>
> P.
On Tue, Nov 25, 2025 at 05:11:18PM -0800, Alex Mastro wrote:
> fill_sg_entry() splits large DMA buffers into multiple scatter-gather
> entries, each holding up to UINT_MAX bytes. When calculating the DMA
> address for entries beyond the second one, the expression (i * UINT_MAX)
> causes integer overflow due to 32-bit arithmetic.
>
> This manifests when the input arg length >= 8 GiB results in looping for
> i >= 2.
>
> Fix by casting i to dma_addr_t before multiplication.
>
> Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> Signed-off-by: Alex Mastro <amastro(a)fb.com>
> ---
> More color about how I discovered this in [1] for the commit at [2]:
>
> [1] https://lore.kernel.org/all/aSZHO6otK0Heh+Qj@devgpu015.cco6.facebook.com
> [2] https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.c…
> ---
> drivers/dma-buf/dma-buf-mapping.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leon(a)kernel.org>
On 11/25/25 14:52, Pavel Begunkov wrote:
> On 11/24/25 14:17, Christian König wrote:
>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>> On 11/24/25 10:33, Christian König wrote:
>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>
>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>
>>>> Or am I mixing something up here?
>>>
>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>> and avoid passing dma address arrays directly.
>>>
>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>
>>> I'm lost on what part is not doable. Can you elaborate on the core
>>> dma-fence dma-buf rules?
>>
>> I most likely mixed that up, in other words that was a different discussion.
>>
>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
>
> I'm curious, what can happen if there is new IO using a
> move_notify()ed mapping, but let's say it's guaranteed to complete
> strictly before dma_buf_unmap_attachment() and the fence is signaled?
> Is there some loss of data or corruption that can happen?
The problem is that you can't guarantee that because you run into deadlocks.
As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.
So you actually can't issue any new IO which might block the unmap operation.
>
> sg_table = map_attach() |
> move_notify() |
> -> add_fence(fence) |
> | issue_IO(sg_table)
> | // IO completed
> unmap_attachment(sg_table) |
> signal_fence(fence) |
>
>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>
> Looks I have some terminology gap here. By "memory allocations" you
> don't mean kmalloc, right? I assume it's about new users of the
> mapping.
kmalloc() as well as get_free_page() is exactly what is meant here.
You can't make any memory allocation any more after creating/publishing a dma_fence.
The usually flow is the following:
1. Lock dma_resv object
2. Prepare I/O operation, make all memory allocations etc...
3. Allocate dma_fence object
4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
5. Call dma_resv_add_fence(with fence allocate in #3).
6. Unlock dma_resv object
If you stride from that you most likely end up in a deadlock sooner or later.
Regards,
Christian.
>>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>>
>>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>>> callback and is signaled when all inflight IO using the current
>>> mapping are complete. All new IO requests will try to recreate the
>>> mapping, and hence potentially wait with dma_resv_wait_timeout().
>>
>> Without looking at the code that approach sounds more or less correct to me.
>>
>>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>>
>>> Have any reference?
>>
>> There is a WIP feature in AMDs GPU driver package for ROCm.
>>
>> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
>
> Got it
>
>> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
>
> Right. The direct IO path also works with user pages, so the
> constraints are similar in this regard.
>
On 11/25/25 09:13, Philipp Stanner wrote:
> On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
>> On 11/25/25 08:55, Philipp Stanner wrote:
>>>>
>>>> +/**
>>>> + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
>>>> + *
>>>> + * The dma_fence object has a deep inter dependency with core memory
>>>> + * management, for a detailed explanation see section DMA Fences under
>>>> + * Documentation/driver-api/dma-buf.rst.
>>>> + *
>>>> + * Because of this all dma_fence implementations must guarantee that each fence
>>>> + * completes in a finite time. This define here now gives a reasonable value for
>>>> + * the timeout to use. It is possible to use a longer timeout in an
>>>> + * implementation but that should taint the kernel.
>>>> + */
>>>> +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
>>>
>>> HZ can change depending on the config. Is that really a good choice? I
>>> could see racy situations arising in some configs vs others
>>
>> 2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
>
> On AMD64 maybe. What about the other architectures?
HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 depending on the architecture 2*HZ is always two seconds expressed in jiffies.
The HZ define is actually there to make it architecture independent.
>>
>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
>>
>> The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
>
> That's a question only the drivers can answer, which is why I like to
> think that setting global constants constraining all parties is not the
> right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
> What is even your motivation? What problem does this solve? Is the OOM
> killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
>>
>> Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
>>
>
> Nouveau has timeout = 10 seconds. AFAIK we've never seen bugs because
> of that. Have you seen some?
Thanks for that info. And to answer the question, yes certainly.
Regards,
Christian.
>
>
> P.
On 11/25/25 08:55, Philipp Stanner wrote:
> On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
>> Add a define implementations can use as reasonable maximum signaling
>> timeout. Document that if this timeout is exceeded by config options
>> implementations should taint the kernel.
>>
>> Tainting the kernel is important for bug reports to detect that end
>> users might be using a problematic configuration.
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> include/linux/dma-fence.h | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>> index 64639e104110..b31dfa501c84 100644
>> --- a/include/linux/dma-fence.h
>> +++ b/include/linux/dma-fence.h
>> @@ -28,6 +28,20 @@ struct dma_fence_ops;
>> struct dma_fence_cb;
>> struct seq_file;
>>
>> +/**
>> + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
>> + *
>> + * The dma_fence object has a deep inter dependency with core memory
>> + * management, for a detailed explanation see section DMA Fences under
>> + * Documentation/driver-api/dma-buf.rst.
>> + *
>> + * Because of this all dma_fence implementations must guarantee that each fence
>> + * completes in a finite time. This define here now gives a reasonable value for
>> + * the timeout to use. It is possible to use a longer timeout in an
>> + * implementation but that should taint the kernel.
>> + */
>> +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
>
> HZ can change depending on the config. Is that really a good choice? I
> could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
Regards,
Christian.
>
> P.
Hi everybody,
we have documented here https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#dma-fence-cr… that dma_fence objects must signal in a reasonable amount of time, but at the same time note that drivers might have a different idea of what reasonable means.
Recently I realized that this is actually not a good idea. Background is that the wall clock timeout means that for example the OOM killer might actually wait for this timeout to be able to terminate a process and reclaim the memory used. And this is just an example of how general kernel features might depend on that.
Some drivers and fence implementations used 10 seconds and that raised complains by end users. So at least amdgpu recently switched to 2 second which triggered an internal discussion about it.
This patch set here now adds a define to the dma_fence header which gives 2 seconds as reasonable amount of time. SW-sync is modified to always taint the kernel (since it doesn't has a timeout), VGEM is switched over to the new define and the scheduler gets a warning and taints the kernel if a driver uses a timeout longer than that.
I have not much intention of actually committing the patches (maybe except the SW-sync one), but question is if 2 seconds are reasonable?
Regards,
Christian.
On 11/24/25 12:30, Pavel Begunkov wrote:
> On 11/24/25 10:33, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Picking up the work on supporting dmabuf in the read/write path.
>>
>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>
>> Or am I mixing something up here?
>
> The time gap is purely due to me being busy. I wasn't CC'ed to those private
> discussions you mentioned, but the v1 feedback was to use dynamic attachments
> and avoid passing dma address arrays directly.
>
> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>
> I'm lost on what part is not doable. Can you elaborate on the core
> dma-fence dma-buf rules?
I most likely mixed that up, in other words that was a different discussion.
When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>> Since I don't see any dma_fence implementation at all that might actually be the case.
>
> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
> callback and is signaled when all inflight IO using the current
> mapping are complete. All new IO requests will try to recreate the
> mapping, and hence potentially wait with dma_resv_wait_timeout().
Without looking at the code that approach sounds more or less correct to me.
>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>
> Have any reference?
There is a WIP feature in AMDs GPU driver package for ROCm.
But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
Regards,
Christian.
On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.
IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.
On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
Regards,
Christian.
> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
>
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
>
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
>
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
>
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
>
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
>
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
> buffer_size, file_offset, reg_buf_idx);
>
> And as previously, a good bunch of code was taken from Keith's series [1].
>
> liburing based example:
>
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
>
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
>
> Pavel Begunkov (11):
> file: add callback for pre-mapping dmabuf
> iov_iter: introduce iter type for pre-registered dma
> block: move around bio flagging helpers
> block: introduce dma token backed bio type
> block: add infra to handle dmabuf tokens
> nvme-pci: add support for dmabuf reggistration
> nvme-pci: implement dma_token backed requests
> io_uring/rsrc: add imu flags
> io_uring/rsrc: extended reg buffer registration
> io_uring/rsrc: add dmabuf-backed buffer registeration
> io_uring/rsrc: implement dmabuf regbuf import
>
> block/Makefile | 1 +
> block/bdev.c | 14 ++
> block/bio.c | 21 +++
> block/blk-merge.c | 23 +++
> block/blk-mq-dma-token.c | 236 +++++++++++++++++++++++++++++++
> block/blk-mq.c | 20 +++
> block/blk.h | 3 +-
> block/fops.c | 3 +
> drivers/nvme/host/pci.c | 217 ++++++++++++++++++++++++++++
> include/linux/bio.h | 49 ++++---
> include/linux/blk-mq-dma-token.h | 60 ++++++++
> include/linux/blk-mq.h | 21 +++
> include/linux/blk_types.h | 8 +-
> include/linux/blkdev.h | 3 +
> include/linux/dma_token.h | 35 +++++
> include/linux/fs.h | 4 +
> include/linux/uio.h | 10 ++
> include/uapi/linux/io_uring.h | 13 +-
> io_uring/rsrc.c | 201 +++++++++++++++++++++++---
> io_uring/rsrc.h | 23 ++-
> io_uring/rw.c | 7 +-
> lib/iov_iter.c | 30 +++-
> 22 files changed, 948 insertions(+), 54 deletions(-)
> create mode 100644 block/blk-mq-dma-token.c
> create mode 100644 include/linux/blk-mq-dma-token.h
> create mode 100644 include/linux/dma_token.h
>
On Thu, Nov 20, 2025 at 08:04:37AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> > +
> > +static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf,
> > + struct pfn_batch *batch,
> > + unsigned long start_index,
> > + unsigned long last_index)
> > +{
> > + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE;
> > +
> > + /*
> > + * This works in PAGE_SIZE indexes, if the dmabuf is sliced and
> > + * starts/ends at a sub page offset then the batch to domain code will
> > + * adjust it.
> > + */
>
> dmabuf->start_offset comes from pages->dmabuf.start, which is initialized as:
>
> pages->dmabuf.start = start - start_byte;
>
> so it's always page-aligned. Where is the sub-page offset coming from?
I need to go over this again to check it, this sub-page stuff is
a bit convoluted. start_offset should include the sub page offset
here..
> > @@ -1687,6 +1737,12 @@ static void __iopt_area_unfill_domain(struct
> > iopt_area *area,
> >
> > lockdep_assert_held(&pages->mutex);
> >
> > + if (iopt_is_dmabuf(pages)) {
> > + iopt_area_unmap_domain_range(area, domain, start_index,
> > + last_index);
> > + return;
> > + }
> > +
>
> this belongs to patch3?
This is part of programming the domain with the dmabuf, the patch3 was
about the revoke which is a slightly different topic though they are
both similar.
Thanks,
Jason
On Thu, Nov 20, 2025 at 07:55:04AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> >
> >
> > @@ -2031,7 +2155,10 @@ int iopt_pages_rw_access(struct iopt_pages
> > *pages, unsigned long start_byte,
> > if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
> > return -EPERM;
> >
> > - if (pages->type == IOPT_ADDRESS_FILE)
> > + if (iopt_is_dmabuf(pages))
> > + return -EINVAL;
> > +
>
> probably also add helpers for other types, e.g.:
>
> iopt_is_user()
> iopt_is_memfd()
The helper was to integrate the IS_ENABLED() check for DMABUF, there
are not so many others uses, I think leave it to not bloat the patch.
> > + if (pages->type != IOPT_ADDRESS_USER)
> > return iopt_pages_rw_slow(pages, start_index, last_index,
> > start_byte % PAGE_SIZE, data,
> > length,
> > flags);
> > --
>
> then the following WARN_ON() becomes useless:
>
> if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
> WARN_ON(pages->type != IOPT_ADDRESS_USER))
> return -EINVAL;
Yep
Thanks,
Jason
On Thu, Nov 20, 2025 at 05:04:13PM -0700, Alex Williamson wrote:
> On Thu, 20 Nov 2025 11:28:29 +0200
> Leon Romanovsky <leon(a)kernel.org> wrote:
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 142b84b3f225..51a3bcc26f8b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> ...
> > @@ -2487,8 +2500,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >
> > err_undo:
> > list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
> > - vdev.dev_set_list)
> > + vdev.dev_set_list) {
> > + if (__vfio_pci_memory_enabled(vdev))
> > + vfio_pci_dma_buf_move(vdev, false);
> > up_write(&vdev->memory_lock);
> > + }
>
> I ran into a bug here. In the hot reset path we can have dev_sets
> where one or more devices are not opened by the user. The vconfig
> buffer for the device is established on open. However:
>
> bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
> ...
>
> Leads to a NULL pointer dereference.
>
> I think the most straightforward fix is simply to test the open_count
> on the vfio_device, which is also protected by the dev_set->lock that
> we already hold here:
>
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2501,7 +2501,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> err_undo:
> list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
> vdev.dev_set_list) {
> - if (__vfio_pci_memory_enabled(vdev))
> + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
> vfio_pci_dma_buf_move(vdev, false);
> up_write(&vdev->memory_lock);
> }
>
> Any other suggestions? This should be the only reset path with this
> nuance of affecting non-opened devices. Thanks,
It seems right to me.
Thanks
>
> Alex
Hi Barry,
On Fri, 21 Nov 2025 at 06:54, Barry Song <21cnbao(a)gmail.com> wrote:
>
> Hi Sumit,
>
> >
> > Using the micro-benchmark below, we see that mmap becomes
> > 3.5X faster:
>
>
> Marcin pointed out to me off-tree that it is actually 35x faster,
> not 3.5x faster. Sorry for my poor math. I assume you can fix this
> when merging it?
Sure, I corrected this, and is merged to drm-misc-next
Thanks,
Sumit.
>
> >
> > W/ patch:
> >
> > ~ # ./a.out
> > mmap 512MB took 200266.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 198151.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 197069.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 196781.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 198102.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 195552.000 us, verify OK
> >
> > W/o patch:
> >
> > ~ # ./a.out
> > mmap 512MB took 6987470.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6970739.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6984383.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6971311.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6991680.000 us, verify OK
>
>
> Thanks
> Barry
On Thu, Nov 20, 2025 at 05:04:13PM -0700, Alex Williamson wrote:
> @@ -2501,7 +2501,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> err_undo:
> list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
> vdev.dev_set_list) {
> - if (__vfio_pci_memory_enabled(vdev))
> + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
> vfio_pci_dma_buf_move(vdev, false);
> up_write(&vdev->memory_lock);
> }
>
> Any other suggestions? This should be the only reset path with this
> nuance of affecting non-opened devices. Thanks,
Seems reasonable, but should it be in __vfio_pci_memory_enabled() just
to be robust?
Jason
On Thu, Nov 20, 2025 at 07:59:19AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> >
> > +enum batch_kind {
> > + BATCH_CPU_MEMORY = 0,
> > + BATCH_MMIO,
> > +};
>
> with 'CPU_MEMORY' (instead of plain 'MEMORY') implies future
> support of 'DEV_MEMORY'?
Maybe, but I don't have an immediate thought on this. CXL "MMIO" that
is cachable is a thing but we can also label it as CPU_MEMORY.
We might have something for CC shared/protected memory down the road.
Thanks,
Jason