[Linaro-mm-sig] Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk

22 Jun 2026

On Mon, Jun 22, 2026 at 4:13 AM David Laight
david.laight.linux@gmail.com wrote:
...
Hi David,
Thank you for your review. You raised many good points regarding
optimizations here. I'll switch to using 2G as the max entry size
(`SZ_2G` from `linux/sizes.h`), and remove divisions and
multiplications. I'll also replace the `for()` loop with `while
(length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
to `size_t`. I'll send out a v2 with these changes shortly.
Thanks,
David
...
...
Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
first entry, resulting in non-page-aligned DMA addresses for all
subsequent entries.
How did you find this?
It requires a single buffer over 4GB - seems highly unlikely.
It was observed during experiments with buffers over 8GB on an accelerator.
...
...
While the underlying IOMMU mapping may be contiguous, hardware
DMA engines often require explicit address alignment (e.g., page,
cacheline, or storage sector boundaries). Passing unaligned
addresses and lengths can cause explicit failures in DMA descriptor
creation or silent data corruption if lower unaligned bits are
truncated.
Fix this by splitting the scatterlist by the largest possible page
aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
This ensures all scatterlist DMA addresses and lengths remain page
aligned and satisfy hardware constraints.
It would almost certainly better to spilt into 2G chunks.
That removes any need for any divisions.
I agree. 2G naturally aligns with most hardware boundaries, while also
allowing compiler optimizations with simple bit shifts.
...
...
Page-aligned entries allow the system to cleanly chunk payloads into
PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
As a result, this may help reduce TLP fragmentation in P2P transfers
and alleviate potential congestion within a logical PCIe switch
partition, especially when Relaxed Ordering is not possible due to
hardware constraints.
Reported-by: sashiko-bot sashiko-bot@kernel.org
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu xuehaohu@google.com

drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a..f2bde38fdb1f 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -5,6 +5,9 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/dma-resv.h>
+#include <linux/align.h>



+#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
...
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
                                       dma_addr_t addr)
@@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
      unsigned int len, nents;
      int i;

nents = DIV_ROUND_UP(length, UINT_MAX);




nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
for (i = 0; i < nents; i++) {



Why not change that to 'while (length) {' to avoid the division above.
Sounds good, will do.
...
...

        len = min_t(size_t, length, UINT_MAX);




        len = min_t(size_t, length, MAX_ENT_SZ);



I bet that doesn't need to be min_t()
Agreed.
...
...
          length -= len;
          /*
           * DMABUF abuses scatterlist to create a scatterlist

@@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
               * does not require the CPU list for mapping or unmapping.
               */
              sg_set_page(sgl, NULL, 0, 0);

        sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;




        sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
        sg_dma_len(sgl) = len;



Replace the multiply with 'addr += len'.
Will update this as well.
...
-- David
...
          sgl = sg_next(sgl);
  }

@@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
  if (!state || !dma_use_iova(state)) {
          for (i = 0; i < nr_ranges; i++)


                nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);




                nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
} else {
        /*
         * In IOVA case, there is only one SG entry which spans
         * for whole IOVA address space, but we need to make sure
         * that it fits sg->length, maybe we need more.
         */




        nents = DIV_ROUND_UP(size, UINT_MAX);




        nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
}

return nents;




    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk