- Linaro-mm-sig - lists.linaro.org

Re: [PATCH v6 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf

by Jason Gunthorpe

On Tue, Nov 04, 2025 at 11:19:43AM -0800, Nicolin Chen wrote: > On Sun, Nov 02, 2025 at 10:00:48AM +0200, Leon Romanovsky wrote: > > Changelog: > > v6: > > * Fixed wrong error check from pcim_p2pdma_init(). > > * Documented pcim_p2pdma_provider() function. > > * Improved commit messages. > > * Added VFIO DMA-BUF selftest. > > * Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf. > > * Fixed error unwind when dma_buf_fd() fails. > > * Document latest changes to p2pmem. > > * Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type. > > * Moved DMA mapping logic to DMA-BUF. > > * Removed types patch to avoid dependencies between subsystems. > > * Moved vfio_pci_dma_buf_move() in err_undo block. > > * Added nvgrace patch. > > I have verified this v6 using Jason's iommufd dmabuf branch: > https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf/ > > by drafting a QEMU patch on top of Shameer's vSMMU v5 series: > https://github.com/nicolinc/qemu/commits/wip/iommufd_dmabuf/ > > with that, I see GPU BAR memory be correctly fetched in the QEMU: > vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 0", offset: 0x0, size: 0x1000000 > vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 2", offset: 0x0, size: 0x44f00000 > vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 4", offset: 0x0, size: 0x17a0000000 Great thanks! This means we finally have a solution to that follow_pfn lifetime problem in type 1! What a long journey :) For those following along this same flow will be used with KVM to allow it to map VFIO as well. Confidential Compute will require this because some arches can't put confidential MMIO (or RAM) into a VMA. Jason

1 month, 1 week

1
0
0 0

Re: [PATCH v1 11/20] drm/amdgpu: use multiple entities in amdgpu_fill_buffer

by Christian König

On 11/4/25 09:35, Pierre-Eric Pelloux-Prayer wrote: > The benefits of using multiple entities is that multiple fill jobs > can run in parallel. Otherwise, even if the entity has access > to multiple engines, a burst of N independent jobs will all > run on the same engine because an entity guarantees the ordering > of execution matches the ordering of the submission. > > Callers can opt-out of this behavior by passing the entity they > want to use (see amdgpu_move_blit). That still sounds like a really bad idea to me. First of all we can't reserve so many fence slots in the release handler, previously we basically just relied on the fact that the BO will most likely be mostly idle. I think we should just use a single SDMA engine for each clear and distribute clearing different BOs over multiple engines. Regards, Christian. > > Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 84 ++++++++++++++++++------- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 1 + > 2 files changed, 64 insertions(+), 21 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index c357a6d9763a..839ea8c7f6be 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -2224,6 +2224,7 @@ u32 amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) > adev->mman.clear_entities = kcalloc(num_clear_entities, > sizeof(struct amdgpu_ttm_entity), > GFP_KERNEL); > + atomic_set(&adev->mman.next_clear_entity, 0); > if (!adev->mman.clear_entities) > goto error_free_entity; > > @@ -2498,10 +2499,12 @@ int amdgpu_fill_buffer(struct amdgpu_ttm_entity *entity, > { > struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring; > + struct dma_fence *fences[TTM_FENCES_MAX_SLOT_COUNT] = {}; > struct dma_fence *fence = NULL; > struct dma_resv *resv = NULL; > struct amdgpu_res_cursor dst; > - int r; > + uint64_t cur_size, to; > + int r, e, n_fences; > > /* The fences will be either added to the resv object or the last fence > * will be returned to the caller. In the latter case, all fill jobs will > @@ -2515,53 +2518,92 @@ int amdgpu_fill_buffer(struct amdgpu_ttm_entity *entity, > } > > if (!entity) { > - entity = &adev->mman.clear_entities[0]; > resv = &bo->tbo.base._resv; > - r = dma_resv_reserve_fences(resv, 1); > + > + /* Determine how much fences we're going to add to the > + * resv object. > + */ > + n_fences = 0; > + amdgpu_res_first(bo->tbo.resource, 0, amdgpu_bo_size(bo), &dst); > + while (dst.remaining) { > + cur_size = min(dst.size, 256ULL << 20); > + > + n_fences += 1; > + amdgpu_res_next(&dst, cur_size); > + } > + if (n_fences == 0) > + return 0; > + > + /* One slot per entity at most. */ > + n_fences = MIN(n_fences, adev->mman.num_clear_entities); > + > + r = dma_resv_reserve_fences(resv, n_fences); > if (r) > return r; > + } else { > + mutex_lock(&entity->gart_window_lock); > } > > amdgpu_res_first(bo->tbo.resource, 0, amdgpu_bo_size(bo), &dst); > > - mutex_lock(&entity->gart_window_lock); > while (dst.remaining) { > - struct dma_fence *next; > - uint64_t cur_size, to; > - > /* Never fill more than 256MiB at once to avoid timeouts */ > cur_size = min(dst.size, 256ULL << 20); > > + if (resv) { > + /* Pick a new entity for each partial clear so they can > + * execute in parallel. > + */ > + e = atomic_inc_return(&adev->mman.next_clear_entity) % > + adev->mman.num_clear_entities; > + entity = &adev->mman.clear_entities[e]; > + mutex_lock(&entity->gart_window_lock); > + } > + > r = amdgpu_ttm_map_buffer(&entity->base, > &bo->tbo, bo->tbo.resource, &dst, > entity->gart_window_id1, ring, false, > &cur_size, &to, > dependency, > resv); > - if (r) > + if (r) { > + mutex_unlock(&entity->gart_window_lock); > goto error; > + } > > r = amdgpu_ttm_fill_mem(ring, &entity->base, > src_data, to, cur_size, resv, > - &next, true, k_job_id); > - if (r) > + &fence, true, k_job_id); > + if (r) { > + mutex_unlock(&entity->gart_window_lock); > goto error; > - > - if (resv) { > - dma_resv_add_fence(resv, next, DMA_RESV_USAGE_KERNEL); > - dma_fence_put(next); > - } else { > - dma_fence_put(fence); > - fence = next; > } > > amdgpu_res_next(&dst, cur_size); > + > + if (resv) { > + /* Delay the addition of the fences to resv, otherwise the next partial > + * clears will depend on this one. > + */ > + fences[e] = fence; > + mutex_unlock(&entity->gart_window_lock); > + } else { > + dma_fence_put(*f); > + *f = fence; > + } > } > error: > - mutex_unlock(&entity->gart_window_lock); > - if (f) > - *f = dma_fence_get(fence); > - dma_fence_put(fence); > + if (resv) { > + for (e = 0; e < adev->mman.num_clear_entities; e++) { > + if (fences[e]) { > + dma_resv_add_fence(resv, fences[e], DMA_RESV_USAGE_KERNEL); > + dma_fence_put(fences[e]); > + } > + } > + } else { > + mutex_unlock(&entity->gart_window_lock); > + } > + > return r; > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > index 38df2b5b4bc7..3fc31c7c6bfe 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > @@ -73,6 +73,7 @@ struct amdgpu_mman { > > struct amdgpu_ttm_entity default_entity; /* has no gart windows */ > struct amdgpu_ttm_entity *clear_entities; > + atomic_t next_clear_entity; > u32 num_clear_entities; > struct amdgpu_ttm_entity move_entities[TTM_FENCES_MAX_SLOT_COUNT]; > u32 num_move_entities;

1 month, 1 week

1
0
0 0

Re: [PATCH v1 03/20] drm/amdgpu: remove direct_submit arg from amdgpu_copy_buffer

by Christian König

On 11/4/25 09:35, Pierre-Eric Pelloux-Prayer wrote: > It was always false. > > Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com> Reviewed-by: Christian König <christian.koenig(a)amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 +++++++------------ > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +- > 4 files changed, 10 insertions(+), 16 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c > index 199693369c7c..02c2479a8840 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c > @@ -39,7 +39,7 @@ static int amdgpu_benchmark_do_move(struct amdgpu_device *adev, unsigned size, > for (i = 0; i < n; i++) { > struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring; > r = amdgpu_copy_buffer(ring, saddr, daddr, size, NULL, &fence, > - false, false, 0); > + false, 0); > if (r) > goto exit_do_move; > r = dma_fence_wait(fence, false); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index c66f00434991..fce22712396b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -354,7 +354,7 @@ static int amdgpu_ttm_copy_mem_to_mem(struct amdgpu_device *adev, > } > > r = amdgpu_copy_buffer(ring, from, to, cur_size, resv, > - &next, false, true, copy_flags); > + &next, true, copy_flags); > if (r) > goto error; > > @@ -2211,16 +2211,13 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) > } > > static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev, > - bool direct_submit, > unsigned int num_dw, > struct dma_resv *resv, > bool vm_needs_flush, > struct amdgpu_job **job, > bool delayed, u64 k_job_id) > { > - enum amdgpu_ib_pool_type pool = direct_submit ? > - AMDGPU_IB_POOL_DIRECT : > - AMDGPU_IB_POOL_DELAYED; > + enum amdgpu_ib_pool_type pool = AMDGPU_IB_POOL_DELAYED; > int r; > struct drm_sched_entity *entity = delayed ? &adev->mman.low_pr : > &adev->mman.high_pr; > @@ -2246,7 +2243,7 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev, > int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset, > uint64_t dst_offset, uint32_t byte_count, > struct dma_resv *resv, > - struct dma_fence **fence, bool direct_submit, > + struct dma_fence **fence, > bool vm_needs_flush, uint32_t copy_flags) > { > struct amdgpu_device *adev = ring->adev; > @@ -2256,7 +2253,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset, > unsigned int i; > int r; > > - if (!direct_submit && !ring->sched.ready) { > + if (!ring->sched.ready) { > dev_err(adev->dev, > "Trying to move memory with ring turned off.\n"); > return -EINVAL; > @@ -2265,7 +2262,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset, > max_bytes = adev->mman.buffer_funcs->copy_max_bytes; > num_loops = DIV_ROUND_UP(byte_count, max_bytes); > num_dw = ALIGN(num_loops * adev->mman.buffer_funcs->copy_num_dw, 8); > - r = amdgpu_ttm_prepare_job(adev, direct_submit, num_dw, > + r = amdgpu_ttm_prepare_job(adev, num_dw, > resv, vm_needs_flush, &job, false, > AMDGPU_KERNEL_JOB_ID_TTM_COPY_BUFFER); > if (r) > @@ -2283,10 +2280,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset, > > amdgpu_ring_pad_ib(ring, &job->ibs[0]); > WARN_ON(job->ibs[0].length_dw > num_dw); > - if (direct_submit) > - r = amdgpu_job_submit_direct(job, ring, fence); > - else > - *fence = amdgpu_job_submit(job); > + *fence = amdgpu_job_submit(job); > if (r) > goto error_free; > > @@ -2315,7 +2309,7 @@ static int amdgpu_ttm_fill_mem(struct amdgpu_ring *ring, uint32_t src_data, > max_bytes = adev->mman.buffer_funcs->fill_max_bytes; > num_loops = DIV_ROUND_UP_ULL(byte_count, max_bytes); > num_dw = ALIGN(num_loops * adev->mman.buffer_funcs->fill_num_dw, 8); > - r = amdgpu_ttm_prepare_job(adev, false, num_dw, resv, vm_needs_flush, > + r = amdgpu_ttm_prepare_job(adev, num_dw, resv, vm_needs_flush, > &job, delayed, k_job_id); > if (r) > return r; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > index 577ee04ce0bf..50e40380fe95 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h > @@ -166,7 +166,7 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, > int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset, > uint64_t dst_offset, uint32_t byte_count, > struct dma_resv *resv, > - struct dma_fence **fence, bool direct_submit, > + struct dma_fence **fence, > bool vm_needs_flush, uint32_t copy_flags); > int amdgpu_ttm_clear_buffer(struct amdgpu_bo *bo, > struct dma_resv *resv, > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > index 46c84fc60af1..378af0b2aaa9 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > @@ -153,7 +153,7 @@ svm_migrate_copy_memory_gart(struct amdgpu_device *adev, dma_addr_t *sys, > } > > r = amdgpu_copy_buffer(ring, gart_s, gart_d, size * PAGE_SIZE, > - NULL, &next, false, true, 0); > + NULL, &next, true, 0); > if (r) { > dev_err(adev->dev, "fail %d to copy memory\n", r); > goto out_unlock;

1 month, 1 week

1
0
0 0

Re: [PATCH v1 02/20] drm/ttm: rework pipelined eviction fence handling

by Christian König

On 11/4/25 09:35, Pierre-Eric Pelloux-Prayer wrote: > Until now ttm stored a single pipelined eviction fence which means > drivers had to use a single entity for these evictions. > > To lift this requirement, this commit allows up to 8 entities to > be used. > > Ideally a dma_resv object would have been used as a container of > the eviction fences, but the locking rules makes it complex. > dma_resv all have the same ww_class, which means "Attempting to > lock more mutexes after ww_acquire_done." is an error. > > One alternative considered was to introduced a 2nd ww_class for > specific resv to hold a single "transient" lock (= the resv lock > would only be held for a short period, without taking any other > locks). > > The other option, is to statically reserve a fence array, and > extend the existing code to deal with N fences, instead of 1. > > The driver is still responsible to reserve the correct number > of fence slots. > > Lastly ttm_resource_manager.pipelined_eviction.n_fences is > initialized to 1, so the new behavior is opt-in. > > Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 ++- > .../gpu/drm/ttm/tests/ttm_bo_validate_test.c | 13 +++-- > drivers/gpu/drm/ttm/tests/ttm_resource_test.c | 5 +- > drivers/gpu/drm/ttm/ttm_bo.c | 56 ++++++++++++------- > drivers/gpu/drm/ttm/ttm_bo_util.c | 36 ++++++++++-- > drivers/gpu/drm/ttm/ttm_resource.c | 45 ++++++++++----- > include/drm/ttm/ttm_resource.h | 34 ++++++++--- > 7 files changed, 139 insertions(+), 58 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index 326476089db3..c66f00434991 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -2156,7 +2156,7 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) > { > struct ttm_resource_manager *man = ttm_manager_type(&adev->mman.bdev, TTM_PL_VRAM); > uint64_t size; > - int r; > + int r, i; > > if (!adev->mman.initialized || amdgpu_in_reset(adev) || > adev->mman.buffer_funcs_enabled == enable || adev->gmc.is_app_apu) > @@ -2190,8 +2190,10 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) > } else { > drm_sched_entity_destroy(&adev->mman.high_pr); > drm_sched_entity_destroy(&adev->mman.low_pr); > - dma_fence_put(man->move); > - man->move = NULL; > + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++) { > + dma_fence_put(man->pipelined_eviction.fences[i]); > + man->pipelined_eviction.fences[i] = NULL; > + } > } > > /* this just adjusts TTM size idea, which sets lpfn to the correct value */ > diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c > index 3148f5d3dbd6..1396674e1923 100644 > --- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c > +++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c > @@ -651,7 +651,8 @@ static void ttm_bo_validate_move_fence_signaled(struct kunit *test) > int err; > > man = ttm_manager_type(priv->ttm_dev, mem_type); > - man->move = dma_fence_get_stub(); > + man->pipelined_eviction.n_fences = 1; > + man->pipelined_eviction.fences[0] = dma_fence_get_stub(); > > bo = ttm_bo_kunit_init(test, test->priv, size, NULL); > bo->type = bo_type; > @@ -668,7 +669,7 @@ static void ttm_bo_validate_move_fence_signaled(struct kunit *test) > KUNIT_EXPECT_EQ(test, ctx.bytes_moved, size); > > ttm_bo_put(bo); > - dma_fence_put(man->move); > + dma_fence_put(man->pipelined_eviction.fences[0]); > } > > static const struct ttm_bo_validate_test_case ttm_bo_validate_wait_cases[] = { > @@ -732,9 +733,10 @@ static void ttm_bo_validate_move_fence_not_signaled(struct kunit *test) > > spin_lock_init(&fence_lock); > man = ttm_manager_type(priv->ttm_dev, fst_mem); > - man->move = alloc_mock_fence(test); > + man->pipelined_eviction.n_fences = 1; > + man->pipelined_eviction.fences[0] = alloc_mock_fence(test); > > - task = kthread_create(threaded_fence_signal, man->move, "move-fence-signal"); > + task = kthread_create(threaded_fence_signal, man->pipelined_eviction.fences[0], "move-fence-signal"); > if (IS_ERR(task)) > KUNIT_FAIL(test, "Couldn't create move fence signal task\n"); > > @@ -742,7 +744,8 @@ static void ttm_bo_validate_move_fence_not_signaled(struct kunit *test) > err = ttm_bo_validate(bo, placement_val, &ctx_val); > dma_resv_unlock(bo->base.resv); > > - dma_fence_wait_timeout(man->move, false, MAX_SCHEDULE_TIMEOUT); > + dma_fence_wait_timeout(man->pipelined_eviction.fences[0], false, MAX_SCHEDULE_TIMEOUT); > + man->pipelined_eviction.fences[0] = NULL; > > KUNIT_EXPECT_EQ(test, err, 0); > KUNIT_EXPECT_EQ(test, ctx_val.bytes_moved, size); > diff --git a/drivers/gpu/drm/ttm/tests/ttm_resource_test.c b/drivers/gpu/drm/ttm/tests/ttm_resource_test.c > index e6ea2bd01f07..6dfdf759a491 100644 > --- a/drivers/gpu/drm/ttm/tests/ttm_resource_test.c > +++ b/drivers/gpu/drm/ttm/tests/ttm_resource_test.c > @@ -207,6 +207,7 @@ static void ttm_resource_manager_init_basic(struct kunit *test) > struct ttm_resource_test_priv *priv = test->priv; > struct ttm_resource_manager *man; > size_t size = SZ_16K; > + int i; > > man = kunit_kzalloc(test, sizeof(*man), GFP_KERNEL); > KUNIT_ASSERT_NOT_NULL(test, man); > @@ -216,8 +217,8 @@ static void ttm_resource_manager_init_basic(struct kunit *test) > KUNIT_ASSERT_PTR_EQ(test, man->bdev, priv->devs->ttm_dev); > KUNIT_ASSERT_EQ(test, man->size, size); > KUNIT_ASSERT_EQ(test, man->usage, 0); > - KUNIT_ASSERT_NULL(test, man->move); > - KUNIT_ASSERT_NOT_NULL(test, &man->move_lock); > + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++) > + KUNIT_ASSERT_NULL(test, man->pipelined_eviction.fences[i]); > > for (int i = 0; i < TTM_MAX_BO_PRIORITY; ++i) > KUNIT_ASSERT_TRUE(test, list_empty(&man->lru[i])); > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index f4d9e68b21e7..bc6d4a6c6d70 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -658,34 +658,48 @@ void ttm_bo_unpin(struct ttm_buffer_object *bo) > EXPORT_SYMBOL(ttm_bo_unpin); > > /* > - * Add the last move fence to the BO as kernel dependency and reserve a new > - * fence slot. > + * Add the pipelined eviction fencesto the BO as kernel dependency and reserve new > + * fence slots. > */ > -static int ttm_bo_add_move_fence(struct ttm_buffer_object *bo, > - struct ttm_resource_manager *man, > - bool no_wait_gpu) > +static int ttm_bo_add_pipelined_eviction_fences(struct ttm_buffer_object *bo, > + struct ttm_resource_manager *man, > + bool no_wait_gpu) > { > + struct dma_fence *fences_to_add[TTM_FENCES_MAX_SLOT_COUNT] = {}; > struct dma_fence *fence; > - int ret; > + bool all_signaled = true, signaled; > + int i, n = 0; > > - spin_lock(&man->move_lock); > - fence = dma_fence_get(man->move); > - spin_unlock(&man->move_lock); > + spin_lock(&man->pipelined_eviction.lock); > + for (i = 0; i < man->pipelined_eviction.n_fences; i++) { > + fence = man->pipelined_eviction.fences[i]; > + if (!fence) > + continue; > + signaled = dma_fence_is_signaled(fence); > > - if (!fence) > + if (signaled) { > + dma_fence_put(man->pipelined_eviction.fences[i]); > + man->pipelined_eviction.fences[i] = NULL; Please completely drop that, only check if the fences are signaled when the no_wait_gpu flag is set. > + } else { > + all_signaled = false; > + if (no_wait_gpu) { > + spin_unlock(&man->pipelined_eviction.lock); > + return -EBUSY; > + } > + fences_to_add[n++] = dma_fence_get(fence); > + } > + } > + spin_unlock(&man->pipelined_eviction.lock); > + > + if (all_signaled) > return 0; > > - if (no_wait_gpu) { > - ret = dma_fence_is_signaled(fence) ? 0 : -EBUSY; > - dma_fence_put(fence); > - return ret; > + for (i = 0; i < n; i++) { > + dma_resv_add_fence(bo->base.resv, fences_to_add[i], DMA_RESV_USAGE_KERNEL); > + dma_fence_put(fences_to_add[i]); > } > > - dma_resv_add_fence(bo->base.resv, fence, DMA_RESV_USAGE_KERNEL); > - > - ret = dma_resv_reserve_fences(bo->base.resv, 1); > - dma_fence_put(fence); > - return ret; > + return dma_resv_reserve_fences(bo->base.resv, TTM_FENCES_MAX_SLOT_COUNT); Please separate out a patch where the call to dma_resv_reserve_fences() is removed here. > } > > /** > @@ -718,7 +732,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo, > int i, ret; > > ticket = dma_resv_locking_ctx(bo->base.resv); > - ret = dma_resv_reserve_fences(bo->base.resv, 1); > + ret = dma_resv_reserve_fences(bo->base.resv, TTM_FENCES_MAX_SLOT_COUNT); > if (unlikely(ret)) > return ret; > > @@ -757,7 +771,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo, > return ret; > } > > - ret = ttm_bo_add_move_fence(bo, man, ctx->no_wait_gpu); > + ret = ttm_bo_add_pipelined_eviction_fences(bo, man, ctx->no_wait_gpu); > if (unlikely(ret)) { > ttm_resource_free(bo, res); > if (ret == -EBUSY) > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > index acbbca9d5c92..ada8af965acf 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > @@ -258,7 +258,7 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, > ret = dma_resv_trylock(&fbo->base.base._resv); > WARN_ON(!ret); > > - ret = dma_resv_reserve_fences(&fbo->base.base._resv, 1); > + ret = dma_resv_reserve_fences(&fbo->base.base._resv, TTM_FENCES_MAX_SLOT_COUNT); > if (ret) { > dma_resv_unlock(&fbo->base.base._resv); > kfree(fbo); > @@ -646,6 +646,8 @@ static void ttm_bo_move_pipeline_evict(struct ttm_buffer_object *bo, > { > struct ttm_device *bdev = bo->bdev; > struct ttm_resource_manager *from; > + struct dma_fence *tmp; > + int i, free_slot = -1; > > from = ttm_manager_type(bdev, bo->resource->mem_type); > > @@ -653,13 +655,35 @@ static void ttm_bo_move_pipeline_evict(struct ttm_buffer_object *bo, > * BO doesn't have a TTM we need to bind/unbind. Just remember > * this eviction and free up the allocation > */ > - spin_lock(&from->move_lock); > - if (!from->move || dma_fence_is_later(fence, from->move)) { > - dma_fence_put(from->move); > - from->move = dma_fence_get(fence); > + spin_lock(&from->pipelined_eviction.lock); > + for (i = 0; i < from->pipelined_eviction.n_fences; i++) { > + tmp = from->pipelined_eviction.fences[i]; > + if (!tmp) { > + if (free_slot < 0) > + free_slot = i; > + continue; Just break here. > + } > + if (fence->context != tmp->context) > + continue; > + if (dma_fence_is_later(fence, tmp)) { > + dma_fence_put(tmp); > + free_slot = i; > + break; > + } > + goto unlock; > + } > + if (free_slot >= 0) { Drop free_slot and check i here. > + from->pipelined_eviction.fences[free_slot] = dma_fence_get(fence); > + } else { > + WARN(1, "not enough fence slots for all fence contexts"); > + spin_unlock(&from->pipelined_eviction.lock); > + dma_fence_wait(fence, false); > + goto end; > } > - spin_unlock(&from->move_lock); > > +unlock: > + spin_unlock(&from->pipelined_eviction.lock); > +end: > ttm_resource_free(bo, &bo->resource); > } > > diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c > index e2c82ad07eb4..ae0d4621cc55 100644 > --- a/drivers/gpu/drm/ttm/ttm_resource.c > +++ b/drivers/gpu/drm/ttm/ttm_resource.c > @@ -523,14 +523,19 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man, > { > unsigned i; > > - spin_lock_init(&man->move_lock); > man->bdev = bdev; > man->size = size; > man->usage = 0; > > for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) > INIT_LIST_HEAD(&man->lru[i]); > - man->move = NULL; > + spin_lock_init(&man->pipelined_eviction.lock); > + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++) > + man->pipelined_eviction.fences[i] = NULL; > + /* Can be overridden by drivers that wants to use more than 1 entity > + * for moves and evictions (limited to TTM_FENCES_MAX_SLOT_COUNT). > + */ > + man->pipelined_eviction.n_fences = 1; > } > EXPORT_SYMBOL(ttm_resource_manager_init); > > @@ -551,7 +556,7 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev, > .no_wait_gpu = false, > }; > struct dma_fence *fence; > - int ret; > + int ret, i; > > do { > ret = ttm_bo_evict_first(bdev, man, &ctx); > @@ -561,18 +566,32 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev, > if (ret && ret != -ENOENT) > return ret; > > - spin_lock(&man->move_lock); > - fence = dma_fence_get(man->move); > - spin_unlock(&man->move_lock); > + ret = 0; > > - if (fence) { > - ret = dma_fence_wait(fence, false); > - dma_fence_put(fence); > - if (ret) > - return ret; > - } > + do { > + fence = NULL; > > - return 0; > + spin_lock(&man->pipelined_eviction.lock); > + for (i = 0; i < man->pipelined_eviction.n_fences; i++) { > + fence = man->pipelined_eviction.fences[i]; > + man->pipelined_eviction.fences[i] = NULL; Drop that. We should never set man->pipelined_eviction.fences to NULL. Potentially even initialize all move fences with a stub fence. > + if (fence) > + break; > + } > + spin_unlock(&man->pipelined_eviction.lock); > + > + if (fence) { > + ret = dma_fence_wait(fence, false); > + dma_fence_put(fence); > + > + if (ret) > + break; > + } else { > + break; > + } > + } while (1); > + > + return ret; > } > EXPORT_SYMBOL(ttm_resource_manager_evict_all); > > diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h > index f49daa504c36..898c429b37ad 100644 > --- a/include/drm/ttm/ttm_resource.h > +++ b/include/drm/ttm/ttm_resource.h > @@ -50,6 +50,15 @@ struct io_mapping; > struct sg_table; > struct scatterlist; > > +/** > + * define TTM_FENCES_MAX_SLOT_COUNT - How many entities can be used for evictions > + * > + * Pipelined evictions can be spread on multiple entities. This > + * is the max number of entities that can be used by the driver > + * for that purpose. > + */ > +#define TTM_FENCES_MAX_SLOT_COUNT 8 Make that TTM_NUM_MOVE_FENCES. > + > /** > * enum ttm_lru_item_type - enumerate ttm_lru_item subclasses > */ > @@ -180,8 +189,10 @@ struct ttm_resource_manager_func { > * @size: Size of the managed region. > * @bdev: ttm device this manager belongs to > * @func: structure pointer implementing the range manager. See above > - * @move_lock: lock for move fence > - * @move: The fence of the last pipelined move operation. > + * @pipelined_eviction.lock: lock for eviction fences > + * @pipelined_eviction.n_fences: The number of fences allowed in the array. If > + * 0, pipelined evictions aren't used. > + * @pipelined_eviction.fences: The fences of the last pipelined move operation. > * @lru: The lru list for this memory type. > * > * This structure is used to identify and manage memory types for a device. > @@ -195,12 +206,15 @@ struct ttm_resource_manager { > struct ttm_device *bdev; > uint64_t size; > const struct ttm_resource_manager_func *func; > - spinlock_t move_lock; > > - /* > - * Protected by @move_lock. > + /* This is very similar to a dma_resv object, but locking rules make > + * it difficult to use a it in this context. > */ > - struct dma_fence *move; > + struct { > + spinlock_t lock; > + int n_fences; > + struct dma_fence *fences[TTM_FENCES_MAX_SLOT_COUNT]; > + } pipelined_eviction; Drop the separate structure, just make move an array instead. And also drop n_fences. Just always take a look at all fences. Regards, Christian. > > /* > * Protected by the bdev->lru_lock. > @@ -421,8 +435,12 @@ static inline bool ttm_resource_manager_used(struct ttm_resource_manager *man) > static inline void > ttm_resource_manager_cleanup(struct ttm_resource_manager *man) > { > - dma_fence_put(man->move); > - man->move = NULL; > + int i; > + > + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++) { > + dma_fence_put(man->pipelined_eviction.fences[i]); > + man->pipelined_eviction.fences[i] = NULL; > + } > } > > void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);

1 month, 1 week

1
0
0 0

[PATCH v6 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf

by Leon Romanovsky

Changelog: v6: * Fixed wrong error check from pcim_p2pdma_init(). * Documented pcim_p2pdma_provider() function. * Improved commit messages. * Added VFIO DMA-BUF selftest. * Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf. * Fixed error unwind when dma_buf_fd() fails. * Document latest changes to p2pmem. * Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type. * Moved DMA mapping logic to DMA-BUF. * Removed types patch to avoid dependencies between subsystems. * Moved vfio_pci_dma_buf_move() in err_undo block. * Added nvgrace patch. v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org * Rebased on top of v6.18-rc1. * Added more validation logic to make sure that DMA-BUF length doesn't overflow in various scenarios. * Hide kernel config from the users. * Fixed type conversion issue. DMA ranges are exposed with u64 length, but DMA-BUF uses "unsigned int" as a length for SG entries. * Added check to prevent from VFIO drivers which reports BAR size different from PCI, do not use DMA-BUF functionality. v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org * Split pcim_p2pdma_provider() to two functions, one that initializes array of providers and another to return right provider pointer. v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org * Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider(). * Cache provider in vfio_pci_dma_buf struct instead of BAR index. * Removed misleading comment from pcim_p2pdma_provider(). * Moved MMIO check to be in pcim_p2pdma_provider(). v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/ * Added extra patch which adds new CONFIG, so next patches can reuse * it. * Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state" into the other patch. * Fixed revoke calls to be aligned with true->false semantics. * Extended p2pdma_providers to be per-BAR and not global to whole * device. * Fixed possible race between dmabuf states and revoke. * Moved revoke to PCI BAR zap block. v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org * Changed commit messages. * Reused DMA_ATTR_MMIO attribute. * Returned support for multiple DMA ranges per-dMABUF. v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com --------------------------------------------------------------------------- Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API" https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series. --------------------------------------------------------------------------- This series extends the VFIO PCI subsystem to support exporting MMIO regions from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations. The series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P. However, as a general mechanism, it can support many other scenarios with VFIO. This dmabuf approach can be usable by iommufd as well for generic and safe P2P mappings. In addition to the SPDK use-case mentioned above, the capability added in this patch series can also be useful when a buffer (located in device memory such as VRAM) needs to be shared between any two dGPU devices or instances (assuming one of them is bound to VFIO PCI) as long as they are P2P DMA compatible. The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace. The series includes significant refactoring of the PCI P2PDMA subsystem to separate core P2P functionality from memory allocation features, making it more modular and suitable for VFIO use cases that don't need struct page support. ----------------------------------------------------------------------- The series is based originally on https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c… but heavily rewritten to be based on DMA physical API. ----------------------------------------------------------------------- The WIP branch can be found here: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=… Thanks --- Jason Gunthorpe (2): PCI/P2PDMA: Document DMABUF model vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky (7): PCI/P2PDMA: Separate the mmap() support from the core logic PCI/P2PDMA: Simplify bus address mapping API PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function dma-buf: provide phys_vec to scatter-gather mapping routine vfio/pci: Enable peer-to-peer DMA transactions by default vfio/pci: Add dma-buf export support for MMIO regions Vivek Kasireddy (2): vfio: Export vfio device get and put registration helpers vfio/pci: Share the core device pointer while invoking feature functions Documentation/driver-api/pci/p2pdma.rst | 95 +++++++--- block/blk-mq-dma.c | 2 +- drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++ drivers/iommu/dma-iommu.c | 4 +- drivers/pci/p2pdma.c | 182 +++++++++++++----- drivers/vfio/pci/Kconfig | 3 + drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/nvgrace-gpu/main.c | 56 ++++++ drivers/vfio/pci/vfio_pci.c | 5 + drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 56 ++++-- drivers/vfio/pci/vfio_pci_dmabuf.c | 315 ++++++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ drivers/vfio/vfio_main.c | 2 + include/linux/dma-buf.h | 18 ++ include/linux/pci-p2pdma.h | 120 +++++++----- include/linux/vfio.h | 2 + include/linux/vfio_pci_core.h | 42 +++++ include/uapi/linux/vfio.h | 27 +++ kernel/dma/direct.c | 4 +- mm/hmm.c | 2 +- 21 files changed, 1077 insertions(+), 139 deletions(-) --- base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787 change-id: 20251016-dmabuf-vfio-6cef732adf5a Best regards, -- Leon Romanovsky <leonro(a)nvidia.com>

1 month, 1 week

2
14
0 0

Re: [PATCH v6 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf

by Jason Gunthorpe

On Mon, Nov 03, 2025 at 12:07:12PM -0800, Alex Mastro wrote: > On Sun, Nov 02, 2025 at 10:00:48AM +0200, Leon Romanovsky wrote: > > Changelog: > > v6: > > * Fixed wrong error check from pcim_p2pdma_init(). > > * Documented pcim_p2pdma_provider() function. > > * Improved commit messages. > > * Added VFIO DMA-BUF selftest. > > * Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf. > > * Fixed error unwind when dma_buf_fd() fails. > > * Document latest changes to p2pmem. > > * Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type. > > * Moved DMA mapping logic to DMA-BUF. > > * Removed types patch to avoid dependencies between subsystems. > > * Moved vfio_pci_dma_buf_move() in err_undo block. > > * Added nvgrace patch. > > Thanks Leon. Attaching a toy program which sanity tests the dma-buf export UAPI > by feeding the allocated dma-buf into an dma-buf importer (libibverbs + CX-7). Oh! Here is my toy program to do the same with iommufd as the importer: #define _GNU_SOURCE #define __user #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <unistd.h> #include "include/uapi/linux/vfio.h" #include "include/uapi/linux/iommufd.h" #include <string.h> #include <sys/mman.h> #include <errno.h> int main(int argc, const char *argv[]) { int vfio_dev_fd, iommufd_fd, ret; // Open the per-device VFIO file (e.g., /dev/vfio/devices/vfio3) vfio_dev_fd = open("/dev/vfio/devices/vfio0", O_RDWR); if (vfio_dev_fd < 0) { perror("Failed to open VFIO per-device file"); return 1; } // Open /dev/iommu for iommufd iommufd_fd = open("/dev/iommu", O_RDWR); if (iommufd_fd < 0) { perror("Failed to open /dev/iommu"); close(vfio_dev_fd); return 1; } // Bind device FD to iommufd struct vfio_device_bind_iommufd bind = { .argsz = sizeof(bind), .flags = 0, .iommufd = iommufd_fd, }; ret = ioctl(vfio_dev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind); if (ret < 0) { perror("VFIO_DEVICE_BIND_IOMMUFD failed"); close(vfio_dev_fd); close(iommufd_fd); return 1; } // Allocate an IOAS (I/O address space) struct iommu_ioas_alloc alloc_data = { .size = sizeof(alloc_data), .flags = 0, }; ret = ioctl(iommufd_fd, IOMMU_IOAS_ALLOC, &alloc_data); if (ret < 0) { perror("IOMMU_IOAS_ALLOC failed"); close(vfio_dev_fd); close(iommufd_fd); return 1; } // Attach the device to the IOAS struct vfio_device_attach_iommufd_pt attach_data = { .argsz = sizeof(attach_data), .flags = 0, .pt_id = alloc_data.out_ioas_id, }; ret = ioctl(vfio_dev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data); if (ret < 0) { perror("VFIO_DEVICE_ATTACH_IOMMUFD_PT failed"); close(vfio_dev_fd); close(iommufd_fd); return 1; } #if 0 int mapfd = memfd_create("test", MFD_CLOEXEC); if (mapfd == -1) { perror("memfd_create failed"); return 1; } ftruncate(mapfd, 4096); #else struct dmabuf_arg { struct vfio_device_feature hdr; struct vfio_device_feature_dma_buf dma_buf; struct vfio_region_dma_range range; } dma_buf_feature = { .hdr = { .argsz = sizeof(dma_buf_feature), .flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF }, .dma_buf = { .region_index = VFIO_PCI_BAR0_REGION_INDEX, .open_flags = O_CLOEXEC, .nr_ranges = 1 }, .range = { .length = 4096 }, }; ret = ioctl(vfio_dev_fd, VFIO_DEVICE_FEATURE, &dma_buf_feature); if (ret < 0) { perror("VFIO_DEVICE_FEATURE_GET failed"); return 1; } int mapfd = ret; #endif struct iommu_ioas_map_file map_file = { .size = sizeof(map_file), .flags = IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE, .ioas_id = alloc_data.out_ioas_id, .fd = mapfd, .start = 0, .length = 4096, }; ret = ioctl(iommufd_fd, IOMMU_IOAS_MAP_FILE, &map_file); if (ret < 0) { perror("IOMMU_IOAS_MAP_FILE failed"); return 1; } printf("Successfully attached device to IOAS ID: %u\n", alloc_data.out_ioas_id); close(vfio_dev_fd); close(iommufd_fd); return 0; }

1 month, 1 week

1
0
0 0

Re: [PATCH v6 10/11] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Sun, Nov 2, 2025, at 19:11, Alex Williamson wrote: > On Sun, 2 Nov 2025 17:12:53 +0200 > Leon Romanovsky <leon(a)kernel.org> wrote: >> On Sun, Nov 02, 2025 at 08:01:37AM -0700, Alex Williamson wrote: >> > We don't need the separate loop or flag, and adding it breaks the >> > existing reverse list walk. Thanks, >> >> Do you want me to send v7? I have a feeling that v6 is good to be merged. > > Let's hold off, if this ends up being the only fixup I can roll it in. > Thanks, Thanks > > Alex > >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index 24204893e221..51a3bcc26f8b 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -2403,7 +2403,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> struct iommufd_ctx *iommufd_ctx) >> { >> struct vfio_pci_core_device *vdev; >> - bool restore_revoke = false; >> struct pci_dev *pdev; >> int ret; >> >> @@ -2473,7 +2472,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> } >> >> vfio_pci_dma_buf_move(vdev, true); >> - restore_revoke = true; >> vfio_pci_zap_bars(vdev); >> } >> >> @@ -2501,15 +2499,12 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> struct vfio_pci_core_device, vdev.dev_set_list); >> >> err_undo: >> - if (restore_revoke) { >> - list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) >> - if (__vfio_pci_memory_enabled(vdev)) >> - vfio_pci_dma_buf_move(vdev, false); >> - } >> - >> list_for_each_entry_from_reverse(vdev, &dev_set->device_list, >> - vdev.dev_set_list) >> + vdev.dev_set_list) { >> + if (__vfio_pci_memory_enabled(vdev)) >> + vfio_pci_dma_buf_move(vdev, false); >> up_write(&vdev->memory_lock); >> + } >> >> list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) >> pm_runtime_put(&vdev->pdev->dev); >> >> >> > >> > Alex >> > >>

1 month, 1 week

1
0
0 0

Re: [PATCH v6 10/11] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Sun, Nov 02, 2025 at 08:01:37AM -0700, Alex Williamson wrote: > On Sun, 2 Nov 2025 10:00:58 +0200 > Leon Romanovsky <leon(a)kernel.org> wrote: > > @@ -2391,6 +2403,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > > struct iommufd_ctx *iommufd_ctx) > > { > > struct vfio_pci_core_device *vdev; > > + bool restore_revoke = false; > > struct pci_dev *pdev; > > int ret; > > > > @@ -2459,6 +2472,8 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > > break; > > } > > > > + vfio_pci_dma_buf_move(vdev, true); > > + restore_revoke = true; > > vfio_pci_zap_bars(vdev); > > } > > > > @@ -2486,6 +2501,12 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > > struct vfio_pci_core_device, vdev.dev_set_list); > > > > err_undo: > > + if (restore_revoke) { > > + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) > > + if (__vfio_pci_memory_enabled(vdev)) > > + vfio_pci_dma_buf_move(vdev, false); > > + } > > + > > list_for_each_entry_from_reverse(vdev, &dev_set->device_list, > > vdev.dev_set_list) > > We don't need the separate loop or flag, and adding it breaks the > existing reverse list walk. Thanks, Do you want me to send v7? I have a feeling that v6 is good to be merged. Thanks diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 24204893e221..51a3bcc26f8b 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -2403,7 +2403,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct iommufd_ctx *iommufd_ctx) { struct vfio_pci_core_device *vdev; - bool restore_revoke = false; struct pci_dev *pdev; int ret; @@ -2473,7 +2472,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, } vfio_pci_dma_buf_move(vdev, true); - restore_revoke = true; vfio_pci_zap_bars(vdev); } @@ -2501,15 +2499,12 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct vfio_pci_core_device, vdev.dev_set_list); err_undo: - if (restore_revoke) { - list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) - if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); - } - list_for_each_entry_from_reverse(vdev, &dev_set->device_list, - vdev.dev_set_list) + vdev.dev_set_list) { + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); + } list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) pm_runtime_put(&vdev->pdev->dev); > > Alex >

1 month, 1 week

1
0
0 0

Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via interconnects

by Christian König

On 10/31/25 06:15, Kasireddy, Vivek wrote: > Hi Jason, > >> Subject: Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via >> interconnects >> >> On Thu, Oct 30, 2025 at 06:17:11AM +0000, Kasireddy, Vivek wrote: >>> It mostly looks OK to me but there are a few things that I want to discuss, >>> after briefly looking at the patches in your branch: >>> - I am wondering what is the benefit of the SGT compatibility stuff especially >>> when Christian suggested that he'd like to see SGT usage gone from >>> dma-buf >> >> I think to get rid of SGT we do need to put it in a little well >> defined box and then create alternatives and remove things using >> SGT. This is a long journey, and I think this is the first step. >> >> If SGT is some special case it will be harder to excise. >> >> So the next steps would be to make all the exporters directly declare >> a SGT and then remove the SGT related ops from dma_ops itself and >> remove the compat sgt in the attach logic. This is not hard, it is all >> simple mechanical work. > IMO, this SGT compatibility stuff should ideally be a separate follow-on > effort (and patch series) that would also probably include updates to > various drivers to add the SGT mapping type. Nope, just the other way around. In other words the SGT compatibility is a pre-requisite. We should first demonstrate with existing drivers that the new interface works and does what it promised to do and then extend it with new functionality. Regards, Christian. > >> >> This way the only compat requirement is to automatically give an >> import match list for a SGT only importer which is very little code in >> the core. >> >> The point is we make the SGT stuff nonspecial and fully aligned with >> the mapping type in small steps. This way neither importer nor >> exporter should have any special code to deal with interworking. >> >> To remove SGT we'd want to teach the core code how to create some kind >> of conversion mapping type, eg exporter uses SGT importer uses NEW so >> the magic conversion mapping type does the adapatation. >> >> In this way we can convert importers and exporters to use NEW in any >> order and they still interwork with each other. >> >>> eventually. Also, if matching fails, IMO, indicating that to the >>> importer (allow_ic) and having both exporter/importer fallback to >>> the current legacy mechanism would be simpler than the SGT >>> compatibility stuff. >> >> I don't want to have three paths in importers. >> >> If the importer supports SGT it should declare it in a match and the >> core code should always return a SGT match for the importer to use >> >> The importer should not have to code 'oh it is sgt but it somehow a >> little different' via an allow_ic type idea. >> >>> - Also, I thought PCIe P2P (along with SGT) use-cases are already well >> handled >>> by the existing map_dma_buf() and other interfaces. So, it might be >> confusing >>> if the newer interfaces also provide a mechanism to handle P2P although a >>> bit differently. I might be missing something here but shouldn't the existing >>> allow_peer2peer and other related stuff be left alone? >> >> P2P is part of SGT, it gets pulled into the SGT stuff as steps toward >> isolating SGT properly. Again as we move things to use native SGT >> exporters we would remove the exporter related allow_peer2peer items >> when they become unused. >> >>> - You are also adding custom attach/detach ops for each mapping_type. I >> think >>> it makes sense to reuse existing attach/detach ops if possible and initiate >> the >>> matching process from there, at-least initially. >> >> I started there, but as soon as I went to adding PAL I realized the >> attach/detach logic was completely different for each of the mapping >> types. So this is looking alot simpler. >> >> If the driver wants to share the same attach/detach ops for some of >> its mapping types then it can just set the same function pointer to >> all of them and pick up the mapping type from the attach->map_type. >> >>> - Looks like your design doesn't call for a dma_buf_map_interconnect() or >> other >>> similar helpers provided by dma-buf core that the importers can use. Is that >>> because the return type would not be known to the core? >> >> I don't want to have a single shared 'map' operation, that is the >> whole point of this design. Each mapping type has its own ops, own >> types, own function signatures that the client calls directly. >> >> No more type confusion or trying to abuse phys_addr_t, dma_addr_t, or >> scatterlist for in appropriate things. If your driver wants something >> special, like IOV, then give it proper clear types so it is >> understandable. >> >>> - And, just to confirm, with your design if I want to add a new interconnect/ >>> mapping_type (not just IOV but in general), all that is needed is to provide >> custom >>> attach/detach, match ops and one or more ops to map/unmap the address >> list >>> right? Does this mean that the role of dma-buf core would be limited to just >>> match and the exporters are expected to do most of the heavy lifting and >>> checking for stuff like dynamic importers, resv lock held, etc? >> >> I expect the core code would continue to provide wrappers and helpers >> to call the ops that can do any required common stuff. >> >> However, keep in mind, when the importer moves to use mapping type it >> also must be upgraded to use the dynamic importer flow as this API >> doesn't support non-dynamic importers using mapping type. >> >> I will add some of these remarks to the commit messages.. > Sounds good. I'll start testing/working on IOV interconnect patches based on > your design. > > Thanks, > Vivek >> >> Thanks! >> Jason

1 month, 2 weeks

2
1
0 0

Re: [PATCH v2] drm/sched: Fix deadlock in drm_sched_entity_kill_jobs_cb

by Christian König

On 10/31/25 12:50, Tvrtko Ursulin wrote: > > On 31/10/2025 09:07, Pierre-Eric Pelloux-Prayer wrote: >> The Mesa issue referenced below pointed out a possible deadlock: >> >> [ 1231.611031] Possible interrupt unsafe locking scenario: >> >> [ 1231.611033] CPU0 CPU1 >> [ 1231.611034] ---- ---- >> [ 1231.611035] lock(&xa->xa_lock#17); >> [ 1231.611038] local_irq_disable(); >> [ 1231.611039] lock(&fence->lock); >> [ 1231.611041] lock(&xa->xa_lock#17); >> [ 1231.611044] <Interrupt> >> [ 1231.611045] lock(&fence->lock); >> [ 1231.611047] >> *** DEADLOCK *** >> >> In this example, CPU0 would be any function accessing job->dependencies >> through the xa_* functions that doesn't disable interrupts (eg: >> drm_sched_job_add_dependency, drm_sched_entity_kill_jobs_cb). >> >> CPU1 is executing drm_sched_entity_kill_jobs_cb as a fence signalling >> callback so in an interrupt context. It will deadlock when trying to >> grab the xa_lock which is already held by CPU0. >> >> Replacing all xa_* usage by their xa_*_irq counterparts would fix >> this issue, but Christian pointed out another issue: dma_fence_signal >> takes fence.lock and so does dma_fence_add_callback. >> >> dma_fence_signal() // locks f1.lock >> -> drm_sched_entity_kill_jobs_cb() >> -> foreach dependencies >> -> dma_fence_add_callback() // locks f2.lock >> >> This will deadlock if f1 and f2 share the same spinlock. > > Is it possible to hit this case? > > Same lock means same execution timeline Nope, exactly that is incorrect. It's completely up to the implementation what they use this lock for. >, which should mean dependency should have been squashed in drm_sched_job_add_dependency(), no? This makes it less likely, but not impossible to trigger. Regards, Christian. > > Or would sharing the lock but not sharing the entity->fence_context be considered legal? It would be surprising at least. > > Also, would anyone have time to add a kunit test? ;) > > Regards, > > Tvrtko > >> To fix both issues, the code iterating on dependencies and re-arming them >> is moved out to drm_sched_entity_kill_jobs_work. >> >> Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 >> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com> >> Suggested-by: Christian König <christian.koenig(a)amd.com> >> Reviewed-by: Christian König <christian.koenig(a)amd.com> >> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- >> 1 file changed, 19 insertions(+), 15 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c >> index c8e949f4a568..fe174a4857be 100644 >> --- a/drivers/gpu/drm/scheduler/sched_entity.c >> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >> @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) >> } >> EXPORT_SYMBOL(drm_sched_entity_error); >> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, >> + struct dma_fence_cb *cb); >> + >> static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) >> { >> struct drm_sched_job *job = container_of(wrk, typeof(*job), work); >> - >> - drm_sched_fence_scheduled(job->s_fence, NULL); >> - drm_sched_fence_finished(job->s_fence, -ESRCH); >> - WARN_ON(job->s_fence->parent); >> - job->sched->ops->free_job(job); >> -} >> - >> -/* Signal the scheduler finished fence when the entity in question is killed. */ >> -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, >> - struct dma_fence_cb *cb) >> -{ >> - struct drm_sched_job *job = container_of(cb, struct drm_sched_job, >> - finish_cb); >> + struct dma_fence *f; >> unsigned long index; >> - dma_fence_put(f); >> - >> /* Wait for all dependencies to avoid data corruptions */ >> xa_for_each(&job->dependencies, index, f) { >> struct drm_sched_fence *s_fence = to_drm_sched_fence(f); >> @@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, >> dma_fence_put(f); >> } >> + drm_sched_fence_scheduled(job->s_fence, NULL); >> + drm_sched_fence_finished(job->s_fence, -ESRCH); >> + WARN_ON(job->s_fence->parent); >> + job->sched->ops->free_job(job); >> +} >> + >> +/* Signal the scheduler finished fence when the entity in question is killed. */ >> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, >> + struct dma_fence_cb *cb) >> +{ >> + struct drm_sched_job *job = container_of(cb, struct drm_sched_job, >> + finish_cb); >> + >> + dma_fence_put(f); >> + >> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); >> schedule_work(&job->work); >> } >

1 month, 2 weeks

1
0
0 0

Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Thu, Oct 30, 2025 at 02:38:36PM -0600, Alex Williamson wrote: > On Mon, 13 Oct 2025 18:26:11 +0300 > Leon Romanovsky <leon(a)kernel.org> wrote: > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index fe247d0e2831..56b1320238a9 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1511,6 +1520,19 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, > > return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); > > case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: > > return vfio_pci_core_feature_token(vdev, flags, arg, argsz); > > + case VFIO_DEVICE_FEATURE_DMA_BUF: > > + if (device->ops->ioctl != vfio_pci_core_ioctl) > > + /* > > + * Devices that overwrite general .ioctl() callback > > + * usually do it to implement their own > > + * VFIO_DEVICE_GET_REGION_INFO handlerm and they present > > Typo, "handlerm" Thanks, this part of code is going to be different in v6. > <...> > > @@ -2482,6 +2506,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > > > > ret = pci_reset_bus(pdev); > > > > + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) > > + if (__vfio_pci_memory_enabled(vdev)) > > + vfio_pci_dma_buf_move(vdev, false); > > + > > vdev = list_last_entry(&dev_set->device_list, > > struct vfio_pci_core_device, vdev.dev_set_list); > > > > This needs to be placed in the existing undo loop with the up_write(), > otherwise it can be missed in the error case. I'll move, but it caused me to wonder what did you want to achieve with this "vdev = list_last_entry ..." line? vdev is overwritten immediately after that line. > > > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c > > new file mode 100644 > > index 000000000000..eaba010777f3 > > --- /dev/null > > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c > > +static unsigned int calc_sg_nents(struct vfio_pci_dma_buf *priv, > > + struct dma_iova_state *state) > > +{ > > + struct phys_vec *phys_vec = priv->phys_vec; > > + unsigned int nents = 0; > > + u32 i; > > + > > + if (!state || !dma_use_iova(state)) > > + for (i = 0; i < priv->nr_ranges; i++) > > + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); > > + else > > + /* > > + * In IOVA case, there is only one SG entry which spans > > + * for whole IOVA address space, but we need to make sure > > + * that it fits sg->length, maybe we need more. > > + */ > > + nents = DIV_ROUND_UP(priv->size, UINT_MAX); > > I think we're arguably running afoul of the coding style standard here > that this is not a single simple statement and should use braces. > <...> > > +err_unmap_dma: > > + if (!i || !state) > > + ; /* Do nothing */ > > + else if (dma_use_iova(state)) > > + dma_iova_destroy(attachment->dev, state, mapped_len, dir, > > + attrs); > > + else > > + for_each_sgtable_dma_sg(sgt, sgl, i) > > + dma_unmap_phys(attachment->dev, sg_dma_address(sgl), > > + sg_dma_len(sgl), dir, attrs); > > Same, here for braces. > <...> > > + if (!state) > > + ; /* Do nothing */ > > + else if (dma_use_iova(state)) > > + dma_iova_destroy(attachment->dev, state, priv->size, dir, > > + attrs); > > + else > > + for_each_sgtable_dma_sg(sgt, sgl, i) > > + dma_unmap_phys(attachment->dev, sg_dma_address(sgl), > > + sg_dma_len(sgl), dir, attrs); > > + > > Here too. I will change it, but it is worth to admit that I'm consistent in my coding style. > > > + sg_free_table(sgt); > > + kfree(sgt); > > +} > ... > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index 75100bf009ba..63214467c875 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -1478,6 +1478,31 @@ struct vfio_device_feature_bus_master { > > }; > > #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 > > > > +/** > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the > > + * regions selected. > > + * > > + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, > > + * etc. offset/length specify a slice of the region to create the dmabuf from. > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. > > + * > > Probably worth noting that .flags should be zero, I see we enforce > that. Thanks, Added, thanks > > Alex >

1 month, 2 weeks

1
0
0 0

Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via interconnects

by Jason Gunthorpe

On Thu, Oct 30, 2025 at 06:17:11AM +0000, Kasireddy, Vivek wrote: > It mostly looks OK to me but there are a few things that I want to discuss, > after briefly looking at the patches in your branch: > - I am wondering what is the benefit of the SGT compatibility stuff especially > when Christian suggested that he'd like to see SGT usage gone from > dma-buf I think to get rid of SGT we do need to put it in a little well defined box and then create alternatives and remove things using SGT. This is a long journey, and I think this is the first step. If SGT is some special case it will be harder to excise. So the next steps would be to make all the exporters directly declare a SGT and then remove the SGT related ops from dma_ops itself and remove the compat sgt in the attach logic. This is not hard, it is all simple mechanical work. This way the only compat requirement is to automatically give an import match list for a SGT only importer which is very little code in the core. The point is we make the SGT stuff nonspecial and fully aligned with the mapping type in small steps. This way neither importer nor exporter should have any special code to deal with interworking. To remove SGT we'd want to teach the core code how to create some kind of conversion mapping type, eg exporter uses SGT importer uses NEW so the magic conversion mapping type does the adapatation. In this way we can convert importers and exporters to use NEW in any order and they still interwork with each other. > eventually. Also, if matching fails, IMO, indicating that to the > importer (allow_ic) and having both exporter/importer fallback to > the current legacy mechanism would be simpler than the SGT > compatibility stuff. I don't want to have three paths in importers. If the importer supports SGT it should declare it in a match and the core code should always return a SGT match for the importer to use The importer should not have to code 'oh it is sgt but it somehow a little different' via an allow_ic type idea. > - Also, I thought PCIe P2P (along with SGT) use-cases are already well handled > by the existing map_dma_buf() and other interfaces. So, it might be confusing > if the newer interfaces also provide a mechanism to handle P2P although a > bit differently. I might be missing something here but shouldn't the existing > allow_peer2peer and other related stuff be left alone? P2P is part of SGT, it gets pulled into the SGT stuff as steps toward isolating SGT properly. Again as we move things to use native SGT exporters we would remove the exporter related allow_peer2peer items when they become unused. > - You are also adding custom attach/detach ops for each mapping_type. I think > it makes sense to reuse existing attach/detach ops if possible and initiate the > matching process from there, at-least initially. I started there, but as soon as I went to adding PAL I realized the attach/detach logic was completely different for each of the mapping types. So this is looking alot simpler. If the driver wants to share the same attach/detach ops for some of its mapping types then it can just set the same function pointer to all of them and pick up the mapping type from the attach->map_type. > - Looks like your design doesn't call for a dma_buf_map_interconnect() or other > similar helpers provided by dma-buf core that the importers can use. Is that > because the return type would not be known to the core? I don't want to have a single shared 'map' operation, that is the whole point of this design. Each mapping type has its own ops, own types, own function signatures that the client calls directly. No more type confusion or trying to abuse phys_addr_t, dma_addr_t, or scatterlist for in appropriate things. If your driver wants something special, like IOV, then give it proper clear types so it is understandable. > - And, just to confirm, with your design if I want to add a new interconnect/ > mapping_type (not just IOV but in general), all that is needed is to provide custom > attach/detach, match ops and one or more ops to map/unmap the address list > right? Does this mean that the role of dma-buf core would be limited to just > match and the exporters are expected to do most of the heavy lifting and > checking for stuff like dynamic importers, resv lock held, etc? I expect the core code would continue to provide wrappers and helpers to call the ops that can do any required common stuff. However, keep in mind, when the importer moves to use mapping type it also must be upgraded to use the dynamic importer flow as this API doesn't support non-dynamic importers using mapping type. I will add some of these remarks to the commit messages.. Thanks! Jason

1 month, 2 weeks

1
0
0 0

Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Wed, Oct 29, 2025 at 05:25:03PM -0700, Samiullah Khawaja wrote: > On Mon, Oct 13, 2025 at 8:27 AM Leon Romanovsky <leon(a)kernel.org> wrote: > > > > From: Leon Romanovsky <leonro(a)nvidia.com> > > > > Add support for exporting PCI device MMIO regions through dma-buf, > > enabling safe sharing of non-struct page memory with controlled > > lifetime management. This allows RDMA and other subsystems to import > > dma-buf FDs and build them into memory regions for PCI P2P operations. > > > > The implementation provides a revocable attachment mechanism using > > dma-buf move operations. MMIO regions are normally pinned as BARs > > don't change physical addresses, but access is revoked when the VFIO > > device is closed or a PCI reset is issued. This ensures kernel > > self-defense against potentially hostile userspace. > > > > Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com> > > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com> > > --- > > drivers/vfio/pci/Kconfig | 3 + > > drivers/vfio/pci/Makefile | 2 + > > drivers/vfio/pci/vfio_pci_config.c | 22 +- > > drivers/vfio/pci/vfio_pci_core.c | 28 ++ > > drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++ > > drivers/vfio/pci/vfio_pci_priv.h | 23 ++ > > include/linux/vfio_pci_core.h | 1 + > > include/uapi/linux/vfio.h | 25 ++ > > 8 files changed, 546 insertions(+), 4 deletions(-) > > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c <...> > > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) > > +{ > > + struct vfio_pci_dma_buf *priv; > > + struct vfio_pci_dma_buf *tmp; > > + > > + lockdep_assert_held_write(&vdev->memory_lock); > > + > > + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { > > + if (!get_file_active(&priv->dmabuf->file)) > > + continue; > > + > > + if (priv->revoked != revoked) { > > + dma_resv_lock(priv->dmabuf->resv, NULL); > > + priv->revoked = revoked; > > + dma_buf_move_notify(priv->dmabuf); > > I think this should only be called when revoked is true, otherwise > this will be calling move_notify on the already revoked dmabuf > attachments. This case is protected by "if (priv->revoked)" check both in vfio_pci_dma_buf_map and vfio_pci_dma_buf_attach. They will prevent DMABUF recreation if revoked is false. VTW, please trim your replies, it is time consuming to find your reply among 600 lines of unrelated text. Thanks > > + dma_resv_unlock(priv->dmabuf->resv); > > + } > > + dma_buf_put(priv->dmabuf); > > + } > > +}

1 month, 2 weeks

2
1
0 0

Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Wed, Oct 29, 2025, at 18:50, Alex Mastro wrote: > On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote: >> + /* >> + * dma_buf_fd() consumes the reference, when the file closes the dmabuf >> + * will be released. >> + */ >> + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); > > I think this still needs to unwind state on fd allocation error. Reference > ownership is only transferred on success. Yes, you are correct, i need to call to dma_buf_put() in case of error. I will fix. Thanks > >> + >> +err_dev_put: >> + vfio_device_put_registration(&vdev->vdev); >> +err_free_phys: >> + kfree(priv->phys_vec); >> +err_free_priv: >> + kfree(priv); >> +err_free_ranges: >> + kfree(dma_ranges); >> + return ret; >> +}

1 month, 2 weeks

1
0
0 0

Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb

by Christian König

On 10/29/25 10:11, Pierre-Eric Pelloux-Prayer wrote: > https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out > a possible deadlock: > > [ 1231.611031] Possible interrupt unsafe locking scenario: > > [ 1231.611033] CPU0 CPU1 > [ 1231.611034] ---- ---- > [ 1231.611035] lock(&xa->xa_lock#17); > [ 1231.611038] local_irq_disable(); > [ 1231.611039] lock(&fence->lock); > [ 1231.611041] lock(&xa->xa_lock#17); > [ 1231.611044] <Interrupt> > [ 1231.611045] lock(&fence->lock); > [ 1231.611047] > *** DEADLOCK *** > > My initial fix was to replace xa_erase by xa_erase_irq, but Christian > pointed out that calling dma_fence_add_callback from a callback can > also deadlock if the signalling fence and the one passed to > dma_fence_add_callback share the same lock. > > To fix both issues, the code iterating on dependencies and re-arming them > is moved out to drm_sched_entity_kill_jobs_work. > > Suggested-by: Christian König <christian.koenig(a)amd.com> > Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com> Reviewed-by: Christian König <christian.koenig(a)amd.com> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- > 1 file changed, 19 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index c8e949f4a568..fe174a4857be 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) > } > EXPORT_SYMBOL(drm_sched_entity_error); > > +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > + struct dma_fence_cb *cb); > + > static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) > { > struct drm_sched_job *job = container_of(wrk, typeof(*job), work); > - > - drm_sched_fence_scheduled(job->s_fence, NULL); > - drm_sched_fence_finished(job->s_fence, -ESRCH); > - WARN_ON(job->s_fence->parent); > - job->sched->ops->free_job(job); > -} > - > -/* Signal the scheduler finished fence when the entity in question is killed. */ > -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > - struct dma_fence_cb *cb) > -{ > - struct drm_sched_job *job = container_of(cb, struct drm_sched_job, > - finish_cb); > + struct dma_fence *f; > unsigned long index; > > - dma_fence_put(f); > - > /* Wait for all dependencies to avoid data corruptions */ > xa_for_each(&job->dependencies, index, f) { > struct drm_sched_fence *s_fence = to_drm_sched_fence(f); > @@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > dma_fence_put(f); > } > > + drm_sched_fence_scheduled(job->s_fence, NULL); > + drm_sched_fence_finished(job->s_fence, -ESRCH); > + WARN_ON(job->s_fence->parent); > + job->sched->ops->free_job(job); > +} > + > +/* Signal the scheduler finished fence when the entity in question is killed. */ > +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > + struct dma_fence_cb *cb) > +{ > + struct drm_sched_job *job = container_of(cb, struct drm_sched_job, > + finish_cb); > + > + dma_fence_put(f); > + > INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); > schedule_work(&job->work); > }

1 month, 2 weeks

1
0
0 0

Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via interconnects

by Jason Gunthorpe

On Wed, Oct 29, 2025 at 11:25:34AM +0200, Leon Romanovsky wrote: > On Tue, Oct 28, 2025 at 09:27:26PM -0300, Jason Gunthorpe wrote: > > On Sun, Oct 26, 2025 at 09:44:12PM -0700, Vivek Kasireddy wrote: > > > In a typical dma-buf use case, a dmabuf exporter makes its buffer > > > buffer available to an importer by mapping it using DMA APIs > > > such as dma_map_sgtable() or dma_map_resource(). However, this > > > is not desirable in some cases where the exporter and importer > > > are directly connected via a physical or virtual link (or > > > interconnect) and the importer can access the buffer without > > > having it DMA mapped. > > > > I think my explanation was not so clear, I spent a few hours and typed > > in what I was thinking about here: > > > > https://github.com/jgunthorpe/linux/commits/dmabuf_map_type > > > > I didn't type in the last patch for iommufd side, hopefully it is > > clear enough. Adding iov should follow the pattern of the "physical > > address list" patch. > > > > I think the use of EXPORT_SYMBOL_FOR_MODULES() to lock down the > > physical addres list mapping type to iommufd is clever and I'm hoping > > addresses Chrsitian's concerns about abuse. > > > > Single GPU drivers can easilly declare their own mapping type for > > their own private interconnect without needing to change the core > > code. > > > > This seems to be fairly straightforward and reasonably type safe.. > > It makes me wonder what am I supposed to do with my series now [1]? > How do you see submission plan now? > > [1] https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/ IMHO that series needs the small tweaks and should go this merge window, ideally along with the iommufd half. I think this thread is a topic for the next cycle, I expect it will take some time to converge on the dmabuf core changes, and adapting your series is quite simple. Jason

1 month, 2 weeks

1
0
0 0

Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via interconnects

by Jason Gunthorpe

On Sun, Oct 26, 2025 at 09:44:12PM -0700, Vivek Kasireddy wrote: > In a typical dma-buf use case, a dmabuf exporter makes its buffer > buffer available to an importer by mapping it using DMA APIs > such as dma_map_sgtable() or dma_map_resource(). However, this > is not desirable in some cases where the exporter and importer > are directly connected via a physical or virtual link (or > interconnect) and the importer can access the buffer without > having it DMA mapped. I think my explanation was not so clear, I spent a few hours and typed in what I was thinking about here: https://github.com/jgunthorpe/linux/commits/dmabuf_map_type I didn't type in the last patch for iommufd side, hopefully it is clear enough. Adding iov should follow the pattern of the "physical address list" patch. I think the use of EXPORT_SYMBOL_FOR_MODULES() to lock down the physical addres list mapping type to iommufd is clever and I'm hoping addresses Chrsitian's concerns about abuse. Single GPU drivers can easilly declare their own mapping type for their own private interconnect without needing to change the core code. This seems to be fairly straightforward and reasonably type safe.. What do you think? Jason

1 month, 2 weeks

1
0
0 0

Re: [RFC 1/8] dma-buf: Add support for map/unmap APIs for interconnects

by Christian König

On 10/14/25 09:08, Vivek Kasireddy wrote: > For the map operation, the dma-buf core will create an xarray but > the exporter is expected to populate it with the interconnect > specific addresses. And, similarly for unmap, the exporter is > expected to cleanup the individual entries of the xarray. > > Cc: Jason Gunthorpe <jgg(a)nvidia.com> > Cc: Christian Koenig <christian.koenig(a)amd.com> > Cc: Sumit Semwal <sumit.semwal(a)linaro.org> > Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com> > Cc: Simona Vetter <simona.vetter(a)ffwll.ch> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com> > --- > drivers/dma-buf/dma-buf.c | 68 ++++++++++++++++++++++++++++ > include/linux/dma-buf-interconnect.h | 29 ++++++++++++ > include/linux/dma-buf.h | 11 +++++ > 3 files changed, 108 insertions(+) > create mode 100644 include/linux/dma-buf-interconnect.h > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 2bcf9ceca997..162642bd53e8 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -1612,6 +1612,74 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) > } > EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, "DMA_BUF"); > > +struct dma_buf_ranges * > +dma_buf_map_interconnect(struct dma_buf_attachment *attach) > +{ > + const struct dma_buf_interconnect_ops *ic_ops; > + struct dma_buf *dmabuf = attach->dmabuf; > + struct dma_buf_ranges *ranges; > + int ret; > + > + might_sleep(); > + > + if (WARN_ON(!attach || !attach->dmabuf)) > + return ERR_PTR(-EINVAL); > + > + if (!dma_buf_attachment_is_dynamic(attach)) > + return ERR_PTR(-EINVAL); > + > + if (!attach->allow_ic) > + return ERR_PTR(-EOPNOTSUPP); > + > + dma_resv_assert_held(attach->dmabuf->resv); > + > + ic_ops = dmabuf->ops->interconnect_ops; > + if (!ic_ops || !ic_ops->map_interconnect) > + return ERR_PTR(-EINVAL); > + > + ranges = kzalloc(sizeof(*ranges), GFP_KERNEL); > + if (!ranges) > + return ERR_PTR(-ENOMEM); > + > + xa_init(&ranges->ranges); > + ret = ic_ops->map_interconnect(attach, ranges); > + if (ret) > + goto err_free_ranges; > + > + return ranges; > + > +err_free_ranges: > + xa_destroy(&ranges->ranges); > + kfree(ranges); > + return ERR_PTR(ret); > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_map_interconnect, "DMA_BUF"); > + > +void dma_buf_unmap_interconnect(struct dma_buf_attachment *attach, > + struct dma_buf_ranges *ranges) > +{ > + const struct dma_buf_interconnect_ops *ic_ops; > + struct dma_buf *dmabuf = attach->dmabuf; > + > + if (WARN_ON(!attach || !attach->dmabuf || !ranges)) > + return; > + > + if (!attach->allow_ic) > + return; > + > + ic_ops = dmabuf->ops->interconnect_ops; > + if (!ic_ops || !ic_ops->unmap_interconnect) > + return; > + > + dma_resv_assert_held(attach->dmabuf->resv); > + > + ic_ops->unmap_interconnect(attach, ranges); > + > + xa_destroy(&ranges->ranges); > + kfree(ranges); > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_interconnect, "DMA_BUF"); > + > #ifdef CONFIG_DEBUG_FS > static int dma_buf_debug_show(struct seq_file *s, void *unused) > { > diff --git a/include/linux/dma-buf-interconnect.h b/include/linux/dma-buf-interconnect.h > new file mode 100644 > index 000000000000..17504dea9691 > --- /dev/null > +++ b/include/linux/dma-buf-interconnect.h > @@ -0,0 +1,29 @@ > +/* SPDX-License-Identifier: MIT */ > + > +#ifndef __DMA_BUF_INTERCONNECT_H__ > +#define __DMA_BUF_INTERCONNECT_H__ > + > +#include <linux/xarray.h> > + > +struct dma_buf_attachment; > + > +struct dma_buf_ranges { > + struct xarray ranges; > + unsigned int nranges; > +}; Hui? How is that supposed to work? Should the exporter fill in the xarray with values? That clearly needs more description. And IIRC xarray can only contain pointers because the lower bits are used for internal handling. Some kind of iterator like interface would be preferred where you have first and next callbacks. > + > +enum dma_buf_interconnect_type { > + DMA_BUF_INTERCONNECT_NONE = 0, Let's start with a DMA_BUF_DMA_ADDR type. > +}; > + > +struct dma_buf_interconnect { > + enum dma_buf_interconnect_type type; > +}; > + > +struct dma_buf_interconnect_ops { > + int (*map_interconnect)(struct dma_buf_attachment *attach, > + struct dma_buf_ranges *ranges); > + void (*unmap_interconnect)(struct dma_buf_attachment *attach, > + struct dma_buf_ranges *ranges); > +}; Please put those directly into the dma_buf_ops structure, I don't really see a value in separating them. Additional to that I'm not sure if the "interconnect" is a good naming, essentially we want to use the new mapping functions to replace the sg_table as well. Regards, Christian. > +#endif > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index d58e329ac0e7..db91c67c00d6 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -23,6 +23,8 @@ > #include <linux/dma-fence.h> > #include <linux/wait.h> > > +#include <linux/dma-buf-interconnect.h> > + > struct device; > struct dma_buf; > struct dma_buf_attachment; > @@ -276,6 +278,8 @@ struct dma_buf_ops { > > int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map); > void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map); > + > + const struct dma_buf_interconnect_ops *interconnect_ops; > }; > > /** > @@ -502,7 +506,9 @@ struct dma_buf_attachment { > struct device *dev; > struct list_head node; > bool peer2peer; > + bool allow_ic; > const struct dma_buf_attach_ops *importer_ops; > + struct dma_buf_interconnect interconnect; > void *importer_priv; > void *priv; > }; > @@ -589,6 +595,11 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, > enum dma_data_direction); > void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, > enum dma_data_direction); > + > +struct dma_buf_ranges *dma_buf_map_interconnect(struct dma_buf_attachment *); > +void dma_buf_unmap_interconnect(struct dma_buf_attachment *, > + struct dma_buf_ranges *); > + > void dma_buf_move_notify(struct dma_buf *dma_buf); > int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, > enum dma_data_direction dir);

1 month, 2 weeks

2
3
0 0

Re: [RFC v2 1/8] dma-buf: Add support for map/unmap APIs for interconnects

by Jason Gunthorpe

On Tue, Oct 28, 2025 at 05:39:39AM +0000, Kasireddy, Vivek wrote: > Hi Jason, > > > Subject: Re: [RFC v2 1/8] dma-buf: Add support for map/unmap APIs for > > interconnects > > > > On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote: > > > For the map operation, the dma-buf core will create an xarray but > > > the exporter needs to populate it with the interconnect specific > > > addresses. And, similarly for unmap, the exporter is expected to > > > cleanup the individual entries of the xarray. > > > > I don't think we should limit this to xarrays, nor do I think it is a > > great datastructure for what is usually needed here.. > One of the goals (as suggested by Christian) is to have a container that > can be used with an iterator. I thought Christian was suggesting to avoid the container and have some kind of iterator? > So, instead of creating a new data structure, > I figured using an xarray would make sense here. And, since the entries > of an xarray can be of any type, I think another advantage is that the > dma-buf core only needs to be aware of the xarray but the exporter can > use an interconnect specific type to populate the entries that the importer > would be aware of. It is excessively memory wasteful. > > I just posted the patches showing what iommufd needs, and it wants > > something like > > > > struct mapping { > > struct p2p_provider *provider; > > size_t nelms; > > struct phys_vec *phys; > > }; > > > > Which is not something that make sense as an xarray. > If we do not want to use an xarray, I guess we can try to generalize the > struct that holds the addresses and any additional info (such as provider). > Would any of the following look OK to you: I think just don't try to have a general struct, it is not required once we have interconnects. Each interconnect can define what makes sense for it. > struct dma_buf_ranges { > struct range *ranges; > unsigned int nranges; > void *ranges_data; > }; Like this is just pointless, it destroys type safety for no benifit. > > struct dma_buf_iov_interconnect_ops { > > struct dma_buf_interconnect_ops ic_ops; > > struct xx *(*map)(struct dma_buf_attachment *attach, > Do we want each specific interconnect to have its own return type for map? I think yes, then you have type safety and so on. The types should all be different. We need to get away from using dma_addr_t or phys_addr_t for something that is not in those address spaces. Jason

1 month, 2 weeks

1
0
0 0

Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions

by Leon Romanovsky

On Mon, Oct 27, 2025 at 04:13:05PM -0700, David Matlack wrote: > On Mon, Oct 13, 2025 at 8:44 AM Leon Romanovsky <leon(a)kernel.org> wrote: > > > > From: Leon Romanovsky <leonro(a)nvidia.com> > > > > Add support for exporting PCI device MMIO regions through dma-buf, > > enabling safe sharing of non-struct page memory with controlled > > lifetime management. This allows RDMA and other subsystems to import > > dma-buf FDs and build them into memory regions for PCI P2P operations. > > > +/** > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the > > + * regions selected. > > + * > > + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, > > + * etc. offset/length specify a slice of the region to create the dmabuf from. > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. > > + * > > + * Return: The fd number on success, -1 and errno is set on failure. > > + */ > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11 > > + > > +struct vfio_region_dma_range { > > + __u64 offset; > > + __u64 length; > > +}; > > + > > +struct vfio_device_feature_dma_buf { > > + __u32 region_index; > > + __u32 open_flags; > > + __u32 flags; > > + __u32 nr_ranges; > > + struct vfio_region_dma_range dma_ranges[]; > > +}; > > This uAPI would be a good candidate for a VFIO selftest. You can test > that it returns an error when it's supposed to, and a valid fd when > it's supposed to. And once the iommufd importer side is ready, we can > extend the test and verify that the fd can be mapped into iommufd. No problem, I'll add such test, but let's focus on making sure that this series is accepted first. Thanks

1 month, 2 weeks

1
0
0 0

[PATCH 0/8] Initial DMABUF support for iommufd

by Jason Gunthorpe

This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org/ The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf The branch has various modifications to Leon's series I've suggested. Jason Gunthorpe (8): iommufd: Add DMABUF to iopt_pages iommufd: Do not map/unmap revoked DMABUFs iommufd: Allow a DMABUF to be revoked iommufd: Allow MMIO pages in a batch iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd/selftest: Add some tests for the dmabuf flow drivers/iommu/iommufd/io_pagetable.c | 74 +++- drivers/iommu/iommufd/io_pagetable.h | 53 ++- drivers/iommu/iommufd/ioas.c | 8 +- drivers/iommu/iommufd/iommufd_private.h | 13 +- drivers/iommu/iommufd/iommufd_test.h | 10 + drivers/iommu/iommufd/main.c | 10 + drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++-- drivers/iommu/iommufd/selftest.c | 142 ++++++ tools/testing/selftests/iommu/iommufd.c | 43 ++ tools/testing/selftests/iommu/iommufd_utils.h | 44 ++ 10 files changed, 741 insertions(+), 63 deletions(-) base-commit: fc882154e421f82677925d33577226e776bb07a4 -- 2.43.0

1 month, 2 weeks

2
9
0 0

Re: [RFC v2 2/8] dma-buf: Add a helper to match interconnects between exporter/importer

by Jason Gunthorpe

On Sun, Oct 26, 2025 at 09:44:14PM -0700, Vivek Kasireddy wrote: > +/** > + * dma_buf_match_interconnects - determine if there is a specific interconnect > + * that is supported by both exporter and importer. > + * @attach: [in] attachment to populate ic_match field > + * @exp: [in] array of interconnects supported by exporter > + * @exp_ics: [in] number of interconnects supported by exporter > + * @imp: [in] array of interconnects supported by importer > + * @imp_ics: [in] number of interconnects supported by importer > + * > + * This helper function iterates through the list interconnects supported by > + * both exporter and importer to find a match. A successful match means that > + * a common interconnect type is supported by both parties and the exporter's > + * match_interconnect() callback also confirms that the importer is compatible > + * with the exporter for that interconnect type. Document which of the exporter/importer is supposed to call this > + * > + * If a match is found, the attach->ic_match field is populated with a copy > + * of the exporter's match data. > + * Return: true if a match is found, false otherwise. > + */ > +bool dma_buf_match_interconnects(struct dma_buf_attachment *attach, > + const struct dma_buf_interconnect_match *exp, > + unsigned int exp_ics, > + const struct dma_buf_interconnect_match *imp, > + unsigned int imp_ics) > +{ > + const struct dma_buf_interconnect_ops *ic_ops; > + struct dma_buf_interconnect_match *ic_match; > + struct dma_buf *dmabuf = attach->dmabuf; > + unsigned int i, j; > + > + if (!exp || !imp) > + return false; > + > + if (!attach->allow_ic) > + return false; Seems redundant with this check for ic_ops == NULL: > + ic_ops = dmabuf->ops->interconnect_ops; > + if (!ic_ops || !ic_ops->match_interconnect) > + return false; This seems like too much of a maze to me.. I think you should structure it like this. First declare an interconnect: struct dma_buf_interconnect iov_interconnect { .name = "IOV interconnect", .match =.. } Then the exporters "subclass" struct dma_buf_interconnect_ops vfio_iov_interconnect { .interconnect = &iov_interconnect, .map = vfio_map, } I guess no container_of technique.. Then in VFIO's attach trigger the new code: const struct dma_buf_interconnect_match vfio_exp_ics[] = { {&vfio_iov_interconnect}, }; dma_buf_match_interconnects(attach, &vfio_exp_ics)) Which will callback to the importer: static const struct dma_buf_attach_ops xe_dma_buf_attach_ops = { .get_importer_interconnects } dma_buf_match_interconnects() would call aops->get_importer_interconnects and matchs first on .interconnect, then call the interconnect->match function with exp/inpt match structs if not NULL. > +struct dma_buf_interconnect_match { > + const struct dma_buf_interconnect *type; > + struct device *dev; > + unsigned int bar; > +}; This should be more general, dev and bar are unique to the iov importer. Maybe just simple: struct dma_buf_interconnect_match { struct dma_buf_interconnect *ic; // no need for type const struct dma_buf_interconnct_ops *exporter_ic_ops; u64 match_data[2]; // dev and bar are IOV specific, generalize }; Then some helper const struct dma_buf_interconnect_match supports_ics[] = { IOV_INTERCONNECT(&vfio_iov_interconnect, dev, bar), } And it would be nice if interconnect aware drivers could more easially interwork with non-interconnect importers. So I'd add a exporter type of 'p2p dma mapped scatterlist' that just matches the legacy importer. Jason

1 month, 2 weeks

1
0
0 0

Re: [RFC v2 1/8] dma-buf: Add support for map/unmap APIs for interconnects

by Jason Gunthorpe

On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote: > For the map operation, the dma-buf core will create an xarray but > the exporter needs to populate it with the interconnect specific > addresses. And, similarly for unmap, the exporter is expected to > cleanup the individual entries of the xarray. I don't think we should limit this to xarrays, nor do I think it is a great datastructure for what is usually needed here.. I just posted the patches showing what iommufd needs, and it wants something like struct mapping { struct p2p_provider *provider; size_t nelms; struct phys_vec *phys; }; Which is not something that make sense as an xarray. I think the interconnect should have its own functions for map/unmap, ie instead of trying to have them as a commmon dma_buf_interconnect_ops do something like struct dma_buf_interconnect_ops { const char *name; bool (*supports_interconnects)(struct dma_buf_attachment *attach, const struct dma_buf_interconnect_match *, unsigned int num_ics); }; struct dma_buf_iov_interconnect_ops { struct dma_buf_interconnect_ops ic_ops; struct xx *(*map)(struct dma_buf_attachment *attach, unsigned int *bar_number, size_t *nelms); // No unmap for iov }; static inline struct xx *dma_buf_iov_map(struct dma_buf_attachment *attach, unsigned int *bar_number, size_t *nelms) { return container_of(attach->ic_ops, struct dma_buf_iov_interconnect_ops, ic_ops)->map( attach, bar_number, nelms)); } > +/** > + * dma_buf_attachment_is_dynamic - check if the importer can handle move_notify. > + * @attach: the attachment to check > + * > + * Returns true if a DMA-buf importer has indicated that it can handle dmabuf > + * location changes through the move_notify callback. > + */ > +static inline bool > +dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) > +{ > + return !!attach->importer_ops; > +} Why is this in this patch? I also think this patch should be second in the series, it makes more sense to figure out how to attach with an interconnect then show how to map/unmap with that interconnect Like I'm not sure why this introduces allow_ic? Jason

1 month, 2 weeks

1
0
0 0

Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions

by Jason Gunthorpe

On Sun, Oct 26, 2025 at 03:55:04PM +0800, Shuai Xue wrote: > > > 在 2025/10/22 20:50, Jason Gunthorpe 写道: > > On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote: > > > From: Leon Romanovsky <leonro(a)nvidia.com> > > > > > > Add support for exporting PCI device MMIO regions through dma-buf, > > > enabling safe sharing of non-struct page memory with controlled > > > lifetime management. This allows RDMA and other subsystems to import > > > dma-buf FDs and build them into memory regions for PCI P2P operations. > > > > > > The implementation provides a revocable attachment mechanism using > > > dma-buf move operations. MMIO regions are normally pinned as BARs > > > don't change physical addresses, but access is revoked when the VFIO > > > device is closed or a PCI reset is issued. This ensures kernel > > > self-defense against potentially hostile userspace. > > > > Let's enhance this: > > > > Currently VFIO can take MMIO regions from the device's BAR and map > > them into a PFNMAP VMA with special PTEs. This mapping type ensures > > the memory cannot be used with things like pin_user_pages(), hmm, and > > so on. In practice only the user process CPU and KVM can safely make > > use of these VMA. When VFIO shuts down these VMAs are cleaned by > > unmap_mapping_range() to prevent any UAF of the MMIO beyond driver > > unbind. > > > > However, VFIO type 1 has an insecure behavior where it uses > > follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back > > into the IOMMU. This has a long history of enabling P2P DMA inside > > VMs, but has serious lifetime problems by allowing a UAF of the MMIO > > after the VFIO driver has been unbound. > > Hi, Jason, > > Can you elaborate on this more? > > From my understanding of the VFIO type 1 implementation: > > - When a device is opened through VFIO type 1, it increments the > device->refcount > - During unbind, the driver waits for this refcount to drop to zero via > wait_for_completion(&device->comp) > - This should prevent the unbind() from completing while the device is > still in use > > Given this refcount mechanism, I do not figure out how the UAF can > occur. A second vfio device can be opened and then use follow_pfnmap_*() to read the first vfio device's PTEs. There is no relationship betweent the first and second VFIO devices, so once the first is unbound it sails through the device->comp while the second device retains the PFN in its type1 iommu_domain. Jason

1 month, 2 weeks

1
0
0 0

Re: [PATCH] dma-fence: Fix safe access wrapper to call timeline name method

by Christian König

On 10/21/25 20:36, Tvrtko Ursulin wrote: > > On 21/10/2025 17:09, Akash Goel wrote: >> This commit fixes the wrapper function dma_fence_timeline_name(), that >> was added for safe access, to actually call the timeline name method of >> dma_fence_ops. >> >> Cc: <stable(a)vger.kernel.org> # v6.17+ >> Signed-off-by: Akash Goel <akash.goel(a)arm.com> > > Fixes: 506aa8b02a8d ("dma-fence: Add safe access helpers and document the rules") > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> Good catch, Reviewed-by: Christian König <christian.koenig(a)amd.com> as well. Please ping me if you need somebody to push this to drm-misc-fixes. Thanks, Christian. > > Apologies for the copy and paste snafu. > > Regards, > > Tvrtko > >> --- >> drivers/dma-buf/dma-fence.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c >> index 3f78c56b58dc..39e6f93dc310 100644 >> --- a/drivers/dma-buf/dma-fence.c >> +++ b/drivers/dma-buf/dma-fence.c >> @@ -1141,7 +1141,7 @@ const char __rcu *dma_fence_timeline_name(struct dma_fence *fence) >> "RCU protection is required for safe access to returned string"); >> if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) >> - return fence->ops->get_driver_name(fence); >> + return fence->ops->get_timeline_name(fence); >> else >> return "signaled-timeline"; >> } >

1 month, 2 weeks

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig