Hi everybody,
we have documented here https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#dma-fence-cr… that dma_fence objects must signal in a reasonable amount of time, but at the same time note that drivers might have a different idea of what reasonable means.
Recently I realized that this is actually not a good idea. Background is that the wall clock timeout means that for example the OOM killer might actually wait for this timeout to be able to terminate a process and reclaim the memory used. And this is just an example of how general kernel features might depend on that.
Some drivers and fence implementations used 10 seconds and that raised complains by end users. So at least amdgpu recently switched to 2 second which triggered an internal discussion about it.
This patch set here now adds a define to the dma_fence header which gives 2 seconds as reasonable amount of time. SW-sync is modified to always taint the kernel (since it doesn't has a timeout), VGEM is switched over to the new define and the scheduler gets a warning and taints the kernel if a driver uses a timeout longer than that.
I have not much intention of actually committing the patches (maybe except the SW-sync one), but question is if 2 seconds are reasonable?
Regards,
Christian.
From: Barry Song <v-songbaohua(a)oppo.com>
In many cases, the pages passed to vmap() may include
high-order pages—for example, the systemheap often allocates
pages in descending order: order 8, then 4, then 0. Currently,
vmap() iterates over every page individually—even the pages
inside a high-order block are handled one by one. This patch
detects high-order pages and maps them as a single contiguous
block whenever possible.
Another possibility is to implement a new API, vmap_sg().
However, that change seems to be quite large in scope.
When vmapping a 128MB dma-buf using the systemheap,
this RFC appears to make system_heap_do_vmap() 16× faster:
W/ patch:
[ 51.363682] system_heap_do_vmap took 2474000 ns
[ 53.307044] system_heap_do_vmap took 2469008 ns
[ 55.061985] system_heap_do_vmap took 2519008 ns
[ 56.653810] system_heap_do_vmap took 2674000 ns
W/o patch:
[ 8.260880] system_heap_do_vmap took 39490000 ns
[ 32.513292] system_heap_do_vmap took 38784000 ns
[ 82.673374] system_heap_do_vmap took 40711008 ns
[ 84.579062] system_heap_do_vmap took 40236000 ns
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: John Stultz <jstultz(a)google.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
---
mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 43 insertions(+), 6 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0832f944544c..af2e3e8c052a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,34 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
return err;
}
+static inline int get_vmap_batch_order(struct page **pages,
+ unsigned int stride,
+ int max_steps,
+ unsigned int idx)
+{
+ /*
+ * Currently, batching is only supported in vmap_pages_range
+ * when page_shift == PAGE_SHIFT.
+ */
+ if (stride != 1)
+ return 0;
+
+ struct page *base = pages[idx];
+ if (!PageHead(base))
+ return 0;
+
+ int order = compound_order(base);
+ int nr_pages = 1 << order;
+
+ if (max_steps < nr_pages)
+ return 0;
+
+ for (int i = 0; i < nr_pages; i++)
+ if (pages[idx + i] != base + i)
+ return 0;
+ return order;
+}
+
/*
* vmap_pages_range_noflush is similar to vmap_pages_range, but does not
* flush caches.
@@ -655,23 +683,32 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+ unsigned int stride;
WARN_ON(page_shift < PAGE_SHIFT);
+ /*
+ * Some users may allocate pages from high-order down to order 0.
+ * We roughly check if the first page is a compound page. If so,
+ * there is a chance to batch multiple pages together.
+ */
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
- page_shift == PAGE_SHIFT)
+ (page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
return vmap_small_pages_range_noflush(addr, end, prot, pages);
- for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
- int err;
+ stride = 1U << (page_shift - PAGE_SHIFT);
+ for (i = 0; i < nr; ) {
+ int err, order;
- err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+ order = get_vmap_batch_order(pages, stride, nr - i, i);
+ err = vmap_range_noflush(addr, addr + (1UL << (page_shift + order)),
page_to_phys(pages[i]), prot,
- page_shift);
+ page_shift + order);
if (err)
return err;
- addr += 1UL << page_shift;
+ addr += 1UL << (page_shift + order);
+ i += 1U << (order + page_shift - PAGE_SHIFT);
}
return 0;
--
2.39.3 (Apple Git-146)
clear_page() translates into memset(*p, 0, PAGE_SIZE) on some
architectures, but on the major architectures it will call
an optimized assembly snippet so use this instead of open
coding a memset().
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
drivers/dma-buf/heaps/cma_heap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 0df007111975..9eaff80050f2 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -315,7 +315,7 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap,
while (nr_clear_pages > 0) {
void *vaddr = kmap_local_page(page);
- memset(vaddr, 0, PAGE_SIZE);
+ clear_page(vaddr);
kunmap_local(vaddr);
/*
* Avoid wasting time zeroing memory if the process
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20251129-dma-buf-heap-clear-page-248bb236e4c4
Best regards,
--
Linus Walleij <linus.walleij(a)linaro.org>
Barely anyone uses dma_fence_signal()'s (and similar functions') return
code. Checking it is pretty much useless anyways, because what are you
going to do if a fence was already signal it? Unsignal it and signal it
again? ;p
Removing the return code simplifies the API and makes it easier for me
to sit on top with Rust DmaFence.
Philipp Stanner (6):
dma-buf/dma-fence: Add dma_fence_test_signaled_flag()
amd/amdkfd: Ignore return code of dma_fence_signal()
drm/gpu/xe: Ignore dma_fenc_signal() return code
dma-buf: Don't misuse dma_fence_signal()
drm/ttm: Remove return check of dma_fence_signal()
dma-buf/dma-fence: Remove return code of signaling-functions
drivers/dma-buf/dma-fence.c | 59 ++++++-------------
drivers/dma-buf/st-dma-fence.c | 7 +--
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +-
.../gpu/drm/ttm/tests/ttm_bo_validate_test.c | 3 +-
drivers/gpu/drm/xe/xe_hw_fence.c | 5 +-
include/linux/dma-fence.h | 33 ++++++++---
6 files changed, 53 insertions(+), 59 deletions(-)
--
2.49.0
fill_sg_entry() splits large DMA buffers into multiple scatter-gather
entries, each holding up to UINT_MAX bytes. When calculating the DMA
address for entries beyond the second one, the expression (i * UINT_MAX)
causes integer overflow due to 32-bit arithmetic.
This manifests when the input arg length >= 8 GiB results in looping for
i >= 2.
Fix by casting i to dma_addr_t before multiplication.
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Signed-off-by: Alex Mastro <amastro(a)fb.com>
---
More color about how I discovered this in [1] for the commit at [2]:
[1] https://lore.kernel.org/all/aSZHO6otK0Heh+Qj@devgpu015.cco6.facebook.com
[2] https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.c…
---
drivers/dma-buf/dma-buf-mapping.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index b4819811a64a..b7352e609fbd 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -24,7 +24,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
* does not require the CPU list for mapping or unmapping.
*/
sg_set_page(sgl, NULL, 0, 0);
- sg_dma_address(sgl) = addr + i * UINT_MAX;
+ sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
sg_dma_len(sgl) = len;
sgl = sg_next(sgl);
}
---
base-commit: 5415d887db0e059920cb5673a32cc4d66daa280f
change-id: 20251125-dma-buf-overflow-e3253f108e36
Best regards,
--
Alex Mastro <amastro(a)fb.com>
Hi everyone,
dma_fences have ever lived under the tyranny dictated by the module
lifetime of their issuer, leading to crashes should anybody still holding
a reference to a dma_fence when the module of the issuer was unloaded.
The basic problem is that when buffer are shared between drivers
dma_fence objects can leak into external drivers and stay there even
after they are signaled. The dma_resv object for example only lazy releases
dma_fences.
So what happens is that when the module who originally created the dma_fence
unloads the dma_fence_ops function table becomes unavailable as well and so
any attempt to release the fence crashes the system.
Previously various approaches have been discussed, including changing the
locking semantics of the dma_fence callbacks (by me) as well as using the
drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
from their actual users, but none of them are actually solving all problems.
Tvrtko did some really nice prerequisite work by protecting the returned
strings of the dma_fence_ops by RCU. This way dma_fence creators where
able to just wait for an RCU grace period after fence signaling before
they could be save to free those data structures.
Now this patch set here goes a step further and protects the whole
dma_fence_ops structure by RCU, so that after the fence signals the
pointer to the dma_fence_ops is set to NULL when there is no wait nor
release callback given. All functionality which use the dma_fence_ops
reference are put inside an RCU critical section, except for the
deprecated issuer specific wait and of course the optional release
callback.
Additional to the RCU changes the lock protecting the dma_fence state
previously had to be allocated external. This set here now changes the
functionality to make that external lock optional and allows dma_fences
to use an inline lock and be self contained.
This patch set addressed all previous code review comments and is based
on drm-tip, includes my changes for amdgpu as well as Mathew's patches for XE.
Going to push the core DMA-buf changes to drm-misc-next as soon as I get
the appropriate rb. The driver specific changes can go upstream through
the driver channels as necessary.
Please review and comment,
Christian.