On Thu, Nov 21, 2024 at 06:31:34PM +0530, Jyothi Kumar Seerapu wrote:
> I2C functionality has dependencies on the GPI driver.
> Ensure that the GPI driver is enabled when using the I2C
> driver functionality.
> Therefore, update the I2C GENI driver to depend on the GPI driver.
>
> Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
> ---
> v2 -> v3:
> - Moved this change to patch3.
> - Updated commit description.
>
> v1 -> v2:
> - This patch is added in v2 to address the kernel test robot
> reported compilation error.
> ERROR: modpost: "gpi_multi_desc_process" [drivers/i2c/busses/i2c-qcom-geni.ko] undefined!
>
> drivers/i2c/busses/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 0aa948014008..87634a682855 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -1049,6 +1049,7 @@ config I2C_QCOM_GENI
> tristate "Qualcomm Technologies Inc.'s GENI based I2C controller"
> depends on ARCH_QCOM || COMPILE_TEST
> depends on QCOM_GENI_SE
> + depends on QCOM_GPI_DMA
So... without this change the previous patch is broken, which is a
no-go. And anyway, adding dependency onto a particular DMA driver is a
bad idea. Please make use of the DMA API instead.
> help
> This driver supports GENI serial engine based I2C controller in
> master mode on the Qualcomm Technologies Inc.'s SoCs. If you say
> --
> 2.17.1
>
--
With best wishes
Dmitry
On Mon, Nov 18, 2024 at 1:15 AM Tvrtko Ursulin <tursulin(a)ursulin.net> wrote:
>
>
> On 17/11/2024 17:03, T.J. Mercier wrote:
> > The arguments for __dma_buf_debugfs_list_del do not match for both the
> > CONFIG_DEBUG_FS case and the !CONFIG_DEBUG_FS case. The !CONFIG_DEBUG_FS
> > case should take a struct dma_buf *, but it's currently struct file *.
> > This can lead to the build error:
> >
> > error: passing argument 1 of ‘__dma_buf_debugfs_list_del’ from
> > incompatible pointer type [-Werror=incompatible-pointer-types]
> >
> > dma-buf.c:63:53: note: expected ‘struct file *’ but argument is of
> > type ‘struct dma_buf *’
> > 63 | static void __dma_buf_debugfs_list_del(struct file *file)
> >
> > Fixes: bfc7bc539392 ("dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS")
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > ---
> > drivers/dma-buf/dma-buf.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8892bc701a66..afb8c1c50107 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -60,7 +60,7 @@ static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf)
> > {
> > }
> >
> > -static void __dma_buf_debugfs_list_del(struct file *file)
> > +static void __dma_buf_debugfs_list_del(struct dma_buf *dmabuf)
> > {
> > }
> > #endif
>
> Huh I wonder how this sneaked by until now.. thanks for fixing!
>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>
> Regards,
>
> Tvrtko
Thanks Tvrtko. Upstream there is currently only the one use where it's
called with a void pointer which doesn't generate the error, but
KernelCI caught the problem on an Android branch where it's also
called with a dma_buf pointer:
https://dashboard.kernelci.org/tree/5a4c93e2f794001a5efa13c0dec931235240d38…
The arguments for __dma_buf_debugfs_list_del do not match for both the
CONFIG_DEBUG_FS case and the !CONFIG_DEBUG_FS case. The !CONFIG_DEBUG_FS
case should take a struct dma_buf *, but it's currently struct file *.
This can lead to the build error:
error: passing argument 1 of ‘__dma_buf_debugfs_list_del’ from
incompatible pointer type [-Werror=incompatible-pointer-types]
dma-buf.c:63:53: note: expected ‘struct file *’ but argument is of
type ‘struct dma_buf *’
63 | static void __dma_buf_debugfs_list_del(struct file *file)
Fixes: bfc7bc539392 ("dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
---
drivers/dma-buf/dma-buf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8892bc701a66..afb8c1c50107 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -60,7 +60,7 @@ static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf)
{
}
-static void __dma_buf_debugfs_list_del(struct file *file)
+static void __dma_buf_debugfs_list_del(struct dma_buf *dmabuf)
{
}
#endif
--
2.47.0.338.g60cca15819-goog
Am 14.11.24 um 12:14 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>
> One alternative to the fix Christian proposed in
> https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@am…
> is to replace the rather complex open coded sorting loops with the kernel
> standard sort followed by a context squashing pass.
>
> Proposed advantage of this would be readability but one concern Christian
> raised was that there could be many fences, that they are typically mostly
> sorted, and so the kernel's heap sort would be much worse by the proposed
> algorithm.
>
> I had a look running some games and vkcube to see what are the typical
> number of input fences. Tested scenarios:
>
> 1) Hogwarts Legacy under Gamescope
>
> 450 calls per second to __dma_fence_unwrap_merge.
>
> Percentages per number of fences buckets, before and after checking for
> signalled status, sorting and flattening:
>
> N Before After
> 0 0.91%
> 1 69.40%
> 2-3 28.72% 9.4% (90.6% resolved to one fence)
> 4-5 0.93%
> 6-9 0.03%
> 10+
>
> 2) Cyberpunk 2077 under Gamescope
>
> 1050 calls per second, amounting to 0.01% CPU time according to perf top.
>
> N Before After
> 0 1.13%
> 1 52.30%
> 2-3 40.34% 55.57%
> 4-5 1.46% 0.50%
> 6-9 2.44%
> 10+ 2.34%
>
> 3) vkcube under Plasma
>
> 90 calls per second.
>
> N Before After
> 0
> 1
> 2-3 100% 0% (Ie. all resolved to a single fence)
> 4-5
> 6-9
> 10+
>
> In the case of vkcube all invocations in the 2-3 bucket were actually
> just two input fences.
>
> From these numbers it looks like the heap sort should not be a
> disadvantage, given how the dominant case is <= 2 input fences which heap
> sort solves with just one compare and swap. (And for the case of one input
> fence we have a fast path in the previous patch.)
>
> A complementary possibility is to implement a different sorting algorithm
> under the same API as the kernel's sort() and so keep the simplicity,
> potentially moving the new sort under lib/ if it would be found more
> widely useful.
>
> v2:
> * Hold on to fence references and reduce commentary. (Christian)
> * Record and use latest signaled timestamp in the 2nd loop too.
> * Consolidate zero or one fences fast paths.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> Fixes: 245a4a7b531c ("dma-buf: generalize dma_fence unwrap & merging v3")
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617
> Cc: Christian König <christian.koenig(a)amd.com>
> Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
> Cc: Gustavo Padovan <gustavo(a)padovan.org>
> Cc: Friedrich Vock <friedrich.vock(a)gmx.de>
> Cc: linux-media(a)vger.kernel.org
> Cc: dri-devel(a)lists.freedesktop.org
> Cc: linaro-mm-sig(a)lists.linaro.org
> Cc: <stable(a)vger.kernel.org> # v6.0+
> ---
> drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++---------------
> 1 file changed, 64 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..26cad03340ce 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -12,6 +12,7 @@
> #include <linux/dma-fence-chain.h>
> #include <linux/dma-fence-unwrap.h>
> #include <linux/slab.h>
> +#include <linux/sort.h>
>
> /* Internal helper to start new array iteration, don't use directly */
> static struct dma_fence *
> @@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct dma_fence_unwrap *cursor)
> }
> EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>
> +
> +static int fence_cmp(const void *_a, const void *_b)
> +{
> + struct dma_fence *a = *(struct dma_fence **)_a;
> + struct dma_fence *b = *(struct dma_fence **)_b;
> +
> + if (a->context < b->context)
> + return -1;
> + else if (a->context > b->context)
> + return 1;
> +
> + if (dma_fence_is_later(b, a))
> + return -1;
> + else if (dma_fence_is_later(a, b))
> + return 1;
> +
> + return 0;
> +}
> +
> /* Implementation for the dma_fence_merge() marco, don't use directly */
> struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
> struct dma_fence **fences,
> @@ -67,8 +87,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
> struct dma_fence_array *result;
> struct dma_fence *tmp, **array;
> ktime_t timestamp;
> - unsigned int i;
> - size_t count;
> + int i, j, count;
>
> count = 0;
> timestamp = ns_to_ktime(0);
> @@ -96,78 +115,58 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
> if (!array)
> return NULL;
>
> - /*
> - * This trashes the input fence array and uses it as position for the
> - * following merge loop. This works because the dma_fence_merge()
> - * wrapper macro is creating this temporary array on the stack together
> - * with the iterators.
> - */
> - for (i = 0; i < num_fences; ++i)
> - fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
> -
> count = 0;
> - do {
> - unsigned int sel;
> -
> -restart:
> - tmp = NULL;
> - for (i = 0; i < num_fences; ++i) {
> - struct dma_fence *next;
> -
> - while (fences[i] && dma_fence_is_signaled(fences[i]))
> - fences[i] = dma_fence_unwrap_next(&iter[i]);
> -
> - next = fences[i];
> - if (!next)
> - continue;
> -
> - /*
> - * We can't guarantee that inpute fences are ordered by
> - * context, but it is still quite likely when this
> - * function is used multiple times. So attempt to order
> - * the fences by context as we pass over them and merge
> - * fences with the same context.
> - */
> - if (!tmp || tmp->context > next->context) {
> - tmp = next;
> - sel = i;
> -
> - } else if (tmp->context < next->context) {
> - continue;
> -
> - } else if (dma_fence_is_later(tmp, next)) {
> - fences[i] = dma_fence_unwrap_next(&iter[i]);
> - goto restart;
> + for (i = 0; i < num_fences; ++i) {
> + dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
> + if (!dma_fence_is_signaled(tmp)) {
> + array[count++] = dma_fence_get(tmp);
> } else {
> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> - goto restart;
> + ktime_t t = dma_fence_timestamp(tmp);
> +
> + if (ktime_after(t, timestamp))
> + timestamp = t;
> }
> }
> + }
>
> - if (tmp) {
> - array[count++] = dma_fence_get(tmp);
> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> + if (count == 0 || count == 1)
> + goto return_fastpath;
> +
> + sort(array, count, sizeof(*array), fence_cmp, NULL);
> +
> + /*
> + * Only keep the most recent fence for each context.
> + */
> + j = 0;
> + tmp = array[0];
> + for (i = 1; i < count; i++) {
> + if (array[i]->context != tmp->context)
> + array[j++] = tmp;
> + else
> + dma_fence_put(tmp);
If I'm not completely mistaken that can result in dropping the first
element but not assigning it again.
E.g. array[0] is potentially invalid after the loop.
> + tmp = array[i];
> + }
> + if (j == 0 || tmp->context != array[j - 1]->context) {
> + array[j++] = tmp;
> + }
Maybe adjust the sort criteria so that the highest seqno comes first.
This reduces the whole loop to something like this:
j = 0;
for (i = 1; i < count; i++) {
if (array[i]->context == array[j]->context)
dma_fence_put(array[i]);
else
array[++j] = array[i];
}
count = ++j;
Regards,
Christian.
> + count = j;
> +
> + if (count > 1) {
> + result = dma_fence_array_create(count, array,
> + dma_fence_context_alloc(1),
> + 1, false);
> + if (!result) {
> + tmp = NULL;
> + goto return_tmp;
> }
> - } while (tmp);
> -
> - if (count == 0) {
> - tmp = dma_fence_allocate_private_stub(ktime_get());
> - goto return_tmp;
> + return &result->base;
> }
>
> - if (count == 1) {
> +return_fastpath:
> + if (count == 0)
> + tmp = dma_fence_allocate_private_stub(timestamp);
> + else
> tmp = array[0];
> - goto return_tmp;
> - }
> -
> - result = dma_fence_array_create(count, array,
> - dma_fence_context_alloc(1),
> - 1, false);
> - if (!result) {
> - tmp = NULL;
> - goto return_tmp;
> - }
> - return &result->base;
>
> return_tmp:
> kfree(array);
On Mon, Nov 11, 2024 at 07:32:42PM +0530, Jyothi Kumar Seerapu wrote:
> GSI hardware generates an interrupt for each transfer completion.
> For multiple messages within a single transfer, this results
> in receiving N interrupts for N messages, which can introduce
> significant software interrupt latency.
Here's an excellent opportunity for splitting your problem description
and solution description in two easy to read paragraphs by adding some
newlines.
> To mitigate this latency,
> utilize Block Event Interrupt (BEI) only when an interrupt is necessary.
> When using BEI, consider splitting a single multi-message transfer into
> chunks of 8. This approach can enhance overall transfer time and
The reason for the number 8 must be documented.
"This approach..." wouldn't hurt from having it's own paragraph once
again.
> efficiency.
>
> Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
> ---
>
> v1 -> v2:
> - Changed dma_addr type from array of pointers to array.
> - To support BEI functionality with the TRE size of 64 defined in GPI driver,
> updated QCOM_GPI_MAX_NUM_MSGS to 16 and NUM_MSGS_PER_IRQ to 8.
>
> drivers/dma/qcom/gpi.c | 49 ++++++++++++++++++++++++++++++++
> include/linux/dma/qcom-gpi-dma.h | 37 ++++++++++++++++++++++++
> 2 files changed, 86 insertions(+)
>
> diff --git a/drivers/dma/qcom/gpi.c b/drivers/dma/qcom/gpi.c
> index 52a7c8f2498f..a98de3178764 100644
> --- a/drivers/dma/qcom/gpi.c
> +++ b/drivers/dma/qcom/gpi.c
> @@ -1693,6 +1693,9 @@ static int gpi_create_i2c_tre(struct gchan *chan, struct gpi_desc *desc,
>
> tre->dword[3] = u32_encode_bits(TRE_TYPE_DMA, TRE_FLAGS_TYPE);
> tre->dword[3] |= u32_encode_bits(1, TRE_FLAGS_IEOT);
> +
> + if (i2c->flags & QCOM_GPI_BLOCK_EVENT_IRQ)
> + tre->dword[3] |= u32_encode_bits(1, TRE_FLAGS_BEI);
> }
>
> for (i = 0; i < tre_idx; i++)
> @@ -2098,6 +2101,52 @@ static int gpi_find_avail_gpii(struct gpi_dev *gpi_dev, u32 seid)
> return -EIO;
> }
>
> +/**
> + * gpi_multi_desc_process() - Process received transfers from GSI HW
> + * @dev: pointer to the corresponding dev node
> + * @multi_xfer: pointer to the gpi_multi_xfer
> + * @num_xfers: total number of transfers
> + * @transfer_timeout_msecs: transfer timeout value
> + * @transfer_comp: completion object of the transfer
> + *
> + * This function is used to process the received transfers based on the
> + * completion events
As far as I can tell it doesn't "process" anything. All it does is
reinit the completion (n + 7) / 8 times, and for the first n / 8
iterations it will wait for an externally defined completion.
Why is this function even defined here, it solely operates on parameters
coming from the I2C driver?
> + *
> + * Return: On success returns 0, otherwise return error code
> + */
> +int gpi_multi_desc_process(struct device *dev, struct gpi_multi_xfer *multi_xfer,
> + u32 num_xfers, u32 transfer_timeout_msecs,
> + struct completion *transfer_comp)
> +{
> + int i;
> + u32 max_irq_cnt, time_left;
> +
> + max_irq_cnt = num_xfers / NUM_MSGS_PER_IRQ;
> + if (num_xfers % NUM_MSGS_PER_IRQ)
> + max_irq_cnt++;
> +
> + /*
> + * Wait for the interrupts of the processed transfers in multiple
> + * of 64 and for the last transfer. If the hardware is fast and
I'm confused, where does this 64 come from?
> + * already processed all the transfers then no need to wait.
> + */
> + for (i = 0; i < max_irq_cnt; i++) {
> + reinit_completion(transfer_comp);
I'm trying to convince myself that this isn't racey, but the split
ownership of updating and checking multi_xfer->irq_cnt between the GPI
and I2C drivers is just too hard for me to follow.
> + if (max_irq_cnt != multi_xfer->irq_cnt) {
> + time_left = wait_for_completion_timeout(transfer_comp,
> + transfer_timeout_msecs);
> + if (!time_left) {
> + dev_err(dev, "%s: Transfer timeout\n", __func__);
> + return -ETIMEDOUT;
> + }
> + }
> + if (num_xfers > multi_xfer->msg_idx_cnt)
> + return 0;
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(gpi_multi_desc_process);
The dmaengine framework is expected to provide an abstraction between
clients and DMA engines, so this doesn't look right.
> +
> /* gpi_of_dma_xlate: open client requested channel */
> static struct dma_chan *gpi_of_dma_xlate(struct of_phandle_args *args,
> struct of_dma *of_dma)
> diff --git a/include/linux/dma/qcom-gpi-dma.h b/include/linux/dma/qcom-gpi-dma.h
> index 6680dd1a43c6..1341ff0db808 100644
> --- a/include/linux/dma/qcom-gpi-dma.h
> +++ b/include/linux/dma/qcom-gpi-dma.h
> @@ -15,6 +15,12 @@ enum spi_transfer_cmd {
> SPI_DUPLEX,
> };
>
> +#define QCOM_GPI_BLOCK_EVENT_IRQ BIT(0)
> +
> +#define QCOM_GPI_MAX_NUM_MSGS 16
> +#define NUM_MSGS_PER_IRQ 8
> +#define MIN_NUM_OF_MSGS_MULTI_DESC 4
Prefixing these QCOM_GPI_ seems like an excellent idea. Still puzzled
about the numbers 8 and 4 though, are they universal for all variants of
GPI or are they just arbitrary numbers picked by experimentation?
> +
> /**
> * struct gpi_spi_config - spi config for peripheral
> *
> @@ -51,6 +57,29 @@ enum i2c_op {
> I2C_READ,
> };
>
> +/**
> + * struct gpi_multi_xfer - Used for multi transfer support
> + *
> + * @msg_idx_cnt: message index for the transfer
> + * @buf_idx: dma buffer index
> + * @unmap_msg_cnt: unampped transfer index
s/unampped/unmapped
> + * @freed_msg_cnt: freed transfer index
> + * @irq_cnt: received interrupt count
> + * @irq_msg_cnt: transfer message count for the received irqs
> + * @dma_buf: virtual address of the buffer
> + * @dma_addr: dma address of the buffer
"the buffer"? There's up to 16 of them...
As mentioned above, I'm skeptical about this custom API - but if we
were to go this route, the exact responsibilities and semantics should
be documented.
Regards,
Bjorn
> + */
> +struct gpi_multi_xfer {
> + u32 msg_idx_cnt;
> + u32 buf_idx;
> + u32 unmap_msg_cnt;
> + u32 freed_msg_cnt;
> + u32 irq_cnt;
> + u32 irq_msg_cnt;
> + void *dma_buf[QCOM_GPI_MAX_NUM_MSGS];
> + dma_addr_t dma_addr[QCOM_GPI_MAX_NUM_MSGS];
> +};
> +
> /**
> * struct gpi_i2c_config - i2c config for peripheral
> *
> @@ -65,6 +94,8 @@ enum i2c_op {
> * @rx_len: receive length for buffer
> * @op: i2c cmd
> * @muli-msg: is part of multi i2c r-w msgs
> + * @flags: true for block event interrupt support
> + * @multi_xfer: indicates transfer has multi messages
> */
> struct gpi_i2c_config {
> u8 set_config;
> @@ -78,6 +109,12 @@ struct gpi_i2c_config {
> u32 rx_len;
> enum i2c_op op;
> bool multi_msg;
> + u8 flags;
> + struct gpi_multi_xfer multi_xfer;
> };
>
> +int gpi_multi_desc_process(struct device *dev, struct gpi_multi_xfer *multi_xfer,
> + u32 num_xfers, u32 tranfer_timeout_msecs,
> + struct completion *transfer_comp);
> +
> #endif /* QCOM_GPI_DMA_H */
> --
> 2.17.1
>
>
On Mon, Nov 11, 2024 at 07:32:43PM +0530, Jyothi Kumar Seerapu wrote:
> I2C_QCOM_GENI is having compile dependencies on QCOM_GPI_DMA and
> so update I2C_QCOM_GENI to depends on QCOM_GPI_DMA.
>
Given that this is a separate patch, your wording can only be
interpreted as this being an existing problem.
> Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
> ---
>
> v1 -> v2:
> This patch is added in v2 to address the kernel test robot
> reported compilation error.
> ERROR: modpost: "gpi_multi_desc_process" [drivers/i2c/busses/i2c-qcom-geni.ko] undefined!
But as far as I can tell you introduce this problem in patch 3. If so
this addition should be part of patch 3.
Also, you have different subject prefix for patch 2 and 3, yet they
relate to the same driver. Not pretty.
Regards,
Bjorn
>
> drivers/i2c/busses/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 0aa948014008..87634a682855 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -1049,6 +1049,7 @@ config I2C_QCOM_GENI
> tristate "Qualcomm Technologies Inc.'s GENI based I2C controller"
> depends on ARCH_QCOM || COMPILE_TEST
> depends on QCOM_GENI_SE
> + depends on QCOM_GPI_DMA
> help
> This driver supports GENI serial engine based I2C controller in
> master mode on the Qualcomm Technologies Inc.'s SoCs. If you say
> --
> 2.17.1
>
>
Am 08.11.24 um 12:22 schrieb Tvrtko Ursulin:
> On 07/11/2024 16:00, Tvrtko Ursulin wrote:
>> On 24/10/2024 13:41, Christian König wrote:
>>> The merge function initially handled only individual fences and
>>> arrays which in turn were created by the merge function. This allowed
>>> to create the new array by a simple merge sort based on the fence
>>> context number.
>>>
>>> The problem is now that since the addition of timeline sync objects
>>> userspace can create chain containers in basically any fence context
>>> order.
>>>
>>> If those are merged together it can happen that we create really
>>> large arrays since the merge sort algorithm doesn't work any more.
>>>
>>> So put an insert sort behind the merge sort which kicks in when the
>>> input fences are not in the expected order. This isn't as efficient
>>> as a heap sort, but has better properties for the most common use
>>> case.
>>>
>>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>>> ---
>>> drivers/dma-buf/dma-fence-unwrap.c | 39
>>> ++++++++++++++++++++++++++----
>>> 1 file changed, 34 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c
>>> b/drivers/dma-buf/dma-fence-unwrap.c
>>> index 628af51c81af..d9aa280d9ff6 100644
>>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>>> @@ -106,7 +106,7 @@ struct dma_fence
>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>> fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>> count = 0;
>>> - do {
>>> + while (true) {
>>> unsigned int sel;
>>> restart:
>>> @@ -144,11 +144,40 @@ struct dma_fence
>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>> }
>>> }
>>> - if (tmp) {
>>> - array[count++] = dma_fence_get(tmp);
>>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> + if (!tmp)
>>> + break;
>>> +
>>> + /*
>>> + * We could use a binary search here, but since the assumption
>>> + * is that the main input are already sorted dma_fence_arrays
>>> + * just looking from end has a higher chance of finding the
>>> + * right location on the first try
>>> + */
>>> +
>>> + for (i = count; i--;) {
>>> + if (likely(array[i]->context < tmp->context))
>>> + break;
>>> +
>>> + if (array[i]->context == tmp->context) {
>>> + if (dma_fence_is_later(tmp, array[i])) {
>>> + dma_fence_put(array[i]);
>>> + array[i] = dma_fence_get(tmp);
>>> + }
>>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> + goto restart;
>>> + }
>>> }
>>> - } while (tmp);
>>> +
>>> + ++i;
>>> + /*
>>> + * Make room for the fence, this should be a nop most of the
>>> + * time.
>>> + */
>>> + memcpy(&array[i + 1], &array[i], (count - i) *
>>> sizeof(*array));
>>> + array[i] = dma_fence_get(tmp);
>>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> + count++;
>>
>> Having ventured into this function for the first time, I can say that
>> this is some smart code which is not easy to grasp. It could
>> definitely benefit from a high level comment before the do-while loop
>> to explain what it is going to do.
>>
>> Next and tmp local variable names I also wonder if could be renamed
>> to something more descriptive.
>>
>> And the algorithmic complexity of the end result, given the multiple
>> loops and gotos, I have no idea what it could be.
>>
>> Has a dumb solution been considered like a two-pass with a
>> pessimistically allocated fence array been considered? Like:
>>
>> 1) Populate array with all unsignalled unwrapped fences. (O(count))
>>
>> 2) Bog standard include/linux/sort.h by context and seqno.
>> (O(count*log (count)))
>>
>> 3) Walk array and squash same context to latest fence. (Before this
>> patch that wasn't there, right?). (O(count)) (Overwrite in place, no
>> memcpy needed.)
>>
>> Algorithmic complexity of that would be obvious and code much simpler.
>
> FWIW something like the below passes selftests. How does it look to
> you? Do you think more or less efficient and more or less readable?
Yeah I was considering the exact same thing.
What hold me back was the fact that the heap sort() implementation is
really inefficient for the most common use case of this. In other words
two arrays with fences already sorted is basically just O(count).
And I'm also not sure how many fences we see in those arrays in
practice. With Vulkan basically trying to feed multiple contexts to keep
all CPUs busy we might have quite a number here.
Regards,
Christian.
>
> commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9
> Author: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
> Date: Fri Nov 8 10:14:15 2024 +0000
>
> __dma_fence_unwrap_merge
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c
> b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..47d67e482e96 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -12,6 +12,7 @@
> #include <linux/dma-fence-chain.h>
> #include <linux/dma-fence-unwrap.h>
> #include <linux/slab.h>
> +#include <linux/sort.h>
>
> /* Internal helper to start new array iteration, don't use directly */
> static struct dma_fence *
> @@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct
> dma_fence_unwrap *cursor)
> }
> EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>
> +
> +static int fence_cmp(const void *_a, const void *_b)
> +{
> + const struct dma_fence *a = *(const struct dma_fence **)_a;
> + const struct dma_fence *b = *(const struct dma_fence **)_b;
> +
> + if (a->context < b->context)
> + return -1;
> + else if (a->context > b->context)
> + return 1;
> +
> + if (a->seqno < b->seqno)
> + return -1;
> + else if (a->seqno > b->seqno)
> + return 1;
> +
> + return 0;
> +}
> +
> /* Implementation for the dma_fence_merge() marco, don't use directly */
> struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
> struct dma_fence **fences,
> struct dma_fence_unwrap *iter)
> {
> - struct dma_fence_array *result;
> struct dma_fence *tmp, **array;
> + struct dma_fence_array *result;
> ktime_t timestamp;
> - unsigned int i;
> - size_t count;
> + int i, j, count;
>
> + /*
> + * Count number of unwrapped fences and fince the latest signaled
> + * timestamp.
> + */
> count = 0;
> timestamp = ns_to_ktime(0);
> for (i = 0; i < num_fences; ++i) {
> @@ -92,63 +115,41 @@ struct dma_fence
> *__dma_fence_unwrap_merge(unsigned int num_fences,
> if (count == 0)
> return dma_fence_allocate_private_stub(timestamp);
>
> + /*
> + * Allocate and populate the array.
> + */
> array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
> if (!array)
> return NULL;
>
> - /*
> - * This trashes the input fence array and uses it as position for
> the
> - * following merge loop. This works because the dma_fence_merge()
> - * wrapper macro is creating this temporary array on the stack
> together
> - * with the iterators.
> - */
> - for (i = 0; i < num_fences; ++i)
> - fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
> -
> count = 0;
> - do {
> - unsigned int sel;
> -
> -restart:
> - tmp = NULL;
> - for (i = 0; i < num_fences; ++i) {
> - struct dma_fence *next;
> -
> - while (fences[i] && dma_fence_is_signaled(fences[i]))
> - fences[i] = dma_fence_unwrap_next(&iter[i]);
> -
> - next = fences[i];
> - if (!next)
> - continue;
> -
> - /*
> - * We can't guarantee that inpute fences are ordered by
> - * context, but it is still quite likely when this
> - * function is used multiple times. So attempt to order
> - * the fences by context as we pass over them and merge
> - * fences with the same context.
> - */
> - if (!tmp || tmp->context > next->context) {
> - tmp = next;
> - sel = i;
> -
> - } else if (tmp->context < next->context) {
> - continue;
> -
> - } else if (dma_fence_is_later(tmp, next)) {
> - fences[i] = dma_fence_unwrap_next(&iter[i]);
> - goto restart;
> - } else {
> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> - goto restart;
> - }
> + for (i = 0; i < num_fences; ++i) {
> + dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
> + if (!dma_fence_is_signaled(tmp))
> + array[count++] = tmp;
> }
> + }
> +
> + /*
> + * Sort in context and seqno order.
> + */
> + sort(array, count, sizeof(*array), fence_cmp, NULL);
>
> - if (tmp) {
> - array[count++] = dma_fence_get(tmp);
> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> + /*
> + * Only keep the most recent fence for each context.
> + */
> + j = 0;
> + tmp = array[0];
> + for (i = 1; i < count; i++) {
> + if (array[i]->context != tmp->context) {
> + array[j++] = dma_fence_get(tmp);
> }
> - } while (tmp);
> + tmp = array[i];
> + }
> + if (tmp->context != array[j - 1]->context) {
> + array[j++] = dma_fence_get(tmp);
> + }
> + count = j;
>
> if (count == 0) {
> tmp = dma_fence_allocate_private_stub(ktime_get());
>
>
> Regards,
>
> Tvrtko
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> + };
>>> if (count == 0) {
>>> tmp = dma_fence_allocate_private_stub(ktime_get());
Am 07.11.24 um 12:29 schrieb Tvrtko Ursulin:
>
> On 28/10/2024 10:34, Christian König wrote:
>> Am 25.10.24 um 11:05 schrieb Tvrtko Ursulin:
>>>
>>> On 25/10/2024 09:59, Tvrtko Ursulin wrote:
>>>>
>>>> On 24/10/2024 13:41, Christian König wrote:
>>>>> Reports indicates that some userspace applications try to merge
>>>>> more than
>>>>> 80k of fences into a single dma_fence_array leading to a warning from
>>>>> kzalloc() that the requested size becomes to big.
>>>>>
>>>>> While that is clearly an userspace bug we should probably handle
>>>>> that case
>>>>> gracefully in the kernel.
>>>>>
>>>>> So we can either reject requests to merge more than a reasonable
>>>>> amount of
>>>>> fences (64k maybe?) or we can start to use kvzalloc() instead of
>>>>> kzalloc().
>>>>> This patch here does the later.
>>>>
>>>> Rejecting would potentially be safer, otherwise there is a path for
>>>> userspace to trigger a warn in kvmalloc_node (see 0829b5bcdd3b
>>>> ("drm/i915: 2 GiB of relocations ought to be enough for anybody*"))
>>>> and spam dmesg at will.
>>>
>>> Actually that is a WARN_ON_*ONCE* there so maybe not so critical to
>>> invent a limit. Up for discussion I suppose.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> Question is what limit to set...
>>
>> That's one of the reasons why I opted for kvzalloc() initially.
>
> I didn't get that, what was the reason? To not have to invent an
> arbitrary limit?
Well that I couldn't come up with any arbitrary limit that I had
confidence would work and not block real world use cases.
Switching to kvzalloc() just seemed the more defensive approach.
>
>> I mean we could use some nice round number like 65536, but that would
>> be totally arbitrary.
>
> Yeah.. Set an arbitrary limit so a warning in __kvmalloc_node_noprof()
> is avoided? Or pass __GFP_NOWARN?
Well are we sure that will never hit 65536 in a real world use case?
It's still pretty low.
>
>> Any comments on the other two patches? I need to get them upstream.
>
> Will look into them shortly.
Thanks,
Christian.
>
> Regards,
>
> Tvrtko
>
>
>> Thanks,
>> Christian.
>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>>>>> CC: stable(a)vger.kernel.org
>>>>> ---
>>>>> drivers/dma-buf/dma-fence-array.c | 6 +++---
>>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-fence-array.c
>>>>> b/drivers/dma-buf/dma-fence-array.c
>>>>> index 8a08ffde31e7..46ac42bcfac0 100644
>>>>> --- a/drivers/dma-buf/dma-fence-array.c
>>>>> +++ b/drivers/dma-buf/dma-fence-array.c
>>>>> @@ -119,8 +119,8 @@ static void dma_fence_array_release(struct
>>>>> dma_fence *fence)
>>>>> for (i = 0; i < array->num_fences; ++i)
>>>>> dma_fence_put(array->fences[i]);
>>>>> - kfree(array->fences);
>>>>> - dma_fence_free(fence);
>>>>> + kvfree(array->fences);
>>>>> + kvfree_rcu(fence, rcu);
>>>>> }
>>>>> static void dma_fence_array_set_deadline(struct dma_fence *fence,
>>>>> @@ -153,7 +153,7 @@ struct dma_fence_array
>>>>> *dma_fence_array_alloc(int num_fences)
>>>>> {
>>>>> struct dma_fence_array *array;
>>>>> - return kzalloc(struct_size(array, callbacks, num_fences),
>>>>> GFP_KERNEL);
>>>>> + return kvzalloc(struct_size(array, callbacks, num_fences),
>>>>> GFP_KERNEL);
>>>>> }
>>>>> EXPORT_SYMBOL(dma_fence_array_alloc);
>>
Am 30.10.24 um 19:10 schrieb Friedrich Vock:
> On 24.10.24 14:41, Christian König wrote:
>> The merge function initially handled only individual fences and
>> arrays which in turn were created by the merge function. This allowed
>> to create the new array by a simple merge sort based on the fence
>> context number.
>>
>> The problem is now that since the addition of timeline sync objects
>> userspace can create chain containers in basically any fence context
>> order.
>>
>> If those are merged together it can happen that we create really
>> large arrays since the merge sort algorithm doesn't work any more.
>>
>> So put an insert sort behind the merge sort which kicks in when the
>> input fences are not in the expected order. This isn't as efficient
>> as a heap sort, but has better properties for the most common use
>> case.
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
>> 1 file changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..d9aa280d9ff6 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -106,7 +106,7 @@ struct dma_fence
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>> fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>
>> count = 0;
>> - do {
>> + while (true) {
>> unsigned int sel;
>>
>> restart:
>> @@ -144,11 +144,40 @@ struct dma_fence
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>> }
>> }
>>
>> - if (tmp) {
>> - array[count++] = dma_fence_get(tmp);
>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> + if (!tmp)
>> + break;
>> +
>> + /*
>> + * We could use a binary search here, but since the assumption
>> + * is that the main input are already sorted dma_fence_arrays
>> + * just looking from end has a higher chance of finding the
>> + * right location on the first try
>> + */
>> +
>> + for (i = count; i--;) {
>
> This is broken. The first iteration of this loop will always index out
> of bounds.
Nope, that is correct. The condition is evaluated before the loop, so
the i-- reduces the index to the last element in the array.
Regards,
Christian.
> What you probably want here is:
>
> + for (i = count - 1; count && i--;) {
>
> This intentionally overflows for count == 0, but the ++i after the loop
> undoes that. Maybe it would be worth a comment to point out that's
> intentional.
>
>> + if (likely(array[i]->context < tmp->context))
>> + break;
>> +
>> + if (array[i]->context == tmp->context) {
>> + if (dma_fence_is_later(tmp, array[i])) {
>> + dma_fence_put(array[i]);
>> + array[i] = dma_fence_get(tmp);
>> + }
>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> + goto restart;
>> + }
>> }
>> - } while (tmp);
>> +
>> + ++i;
>> + /*
>> + * Make room for the fence, this should be a nop most of the
>> + * time.
>> + */
>> + memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));
>
> Need memmove here, src and dst alias.
>
> I took it for a spin with these things fixed and it seemed to resolve
> the issue as well. How do you want to proceed? I guess I would be
> comfortable putting a Reviewed-by and/or Tested-by on a version with
> these things fixed (with the usual caveat that I'm not a maintainer - I
> guess the process requires (at least one) reviewer to be?).
>
> By the way, I guess you might've had some internal branches where this
> fix needed to go into quick or something? Usually I'm happy to make a v2
> for my patches myself, too ;)
>
> Regards,
> Friedrich
>
>> + array[i] = dma_fence_get(tmp);
>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> + count++;
>> + };
>>
>> if (count == 0) {
>> tmp = dma_fence_allocate_private_stub(ktime_get());
>
On Thu, Oct 31, 2024 at 05:45:23PM +0100, Maxime Ripard wrote:
> On Wed, Oct 30, 2024 at 12:16:22PM +0100, metux wrote:
> > On 22.10.24 10:38, Maxime Ripard wrote:
> > > I'm still interested in merging a carve-out driver[1], since it seems to be
> > > in every vendor BSP and got asked again last week.
> > >
> > > I remember from our discussion that for new heap types to be merged, we
> > > needed a kernel use-case. Looking back, I'm not entirely sure how one
> > > can provide that given that heaps are essentially facilities for
> > > user-space.
> >
> > For those who didn't follow your work, could you please give a short
> > intro what's that all about ?
> >
> > If I understand you correctly, you'd like the infrastructure of
> > kmalloc() et al for things / memory regions that aren't the usual heap,
> > right ?
>
> No, not really. The discussion is about dma-buf heaps. They allow to
> allocate buffers suitable for DMA from userspace. It might or might not
> from the system memory, at the heap driver discretion.
I'm afraid you've misinterpreted that - our hexapedal friend had just
* noticed that excessive crossposting can get it banned
* got itself a new address
* posted a solitary ping as the first test
* followed that by testing the ability to cross-post (posting you'd
been replying to, contents on chatGPT level)
* proceeded to use its shiny new address for more of the chorus
whinge exercise they'd been holding with several other similar fellows
(or sock puppets, for all I know).
Just ignore the wanker - it wasn't trying to get any information other
than "will the posting get through" anyway.
On Tue, Oct 29, 2024 at 04:35:16PM +0000, Pavel Begunkov wrote:
> I see, the reply is about your phrase about additional memory
> abstractions:
>
> "... don't really need to build memory buffer abstraction over
> memory buffer abstraction."
Yes, over the exsting memory buffer abstraction (dma_buf).
> If you mean internals, making up a dmabuf that has never existed in the
> picture in the first place is not cleaner or easier in any way. If that
> changes, e.g. there is more code to reuse in the future, we can unify it
> then.
I'm not sure what "making up" means here, they are all made up :)
> > with pre-registering the memry with the iommu to get good performance
> > in IOMMU-enabled setups.
>
> The page pool already does that just like it handles the normal
> path without providers.
In which case is basically is a dma-buf. If you'd expose it as such
we could actually use to communicate between subsystems in the
kernel.
Am 25.10.24 um 08:52 schrieb Friedrich Vock:
> On 24.10.24 22:29, Matthew Brost wrote:
>> On Thu, Oct 24, 2024 at 02:41:57PM +0200, Christian König wrote:
>>> Reports indicates that some userspace applications try to merge more
>>> than
>>> 80k of fences into a single dma_fence_array leading to a warning from
>>
>> Really, yikes.
>
> Not really IME. Unless Christian means some reports I don't have access
> to, the cases where userspace applications tried to do that were really
> just cases where the fence count exploded exponentially because
> dma_fence_unwrap_merge failed to actually merge identical fences (see
> patch 2). At no point have I actually seen apps trying to merge 80k+
> unique fences.
While working on it I've modified our stress test tool to send the same
1GiB SDMA copy to 100k different contexts.
Turned out it's perfectly possible to create so many fences, there is
nothing blocking userspace to do it.
While this isn't a realistic use case the kernel should at least not
crash or spill a warning, but either handle or reject it gracefully.
Friedrich can you confirm that patch two in this series fixes the
problem? I would really like to get it into drm-misc-fixes before 6.12
comes out.
Thanks,
Christian.
>
> Regards,
> Friedrich
>
>>
>>> kzalloc() that the requested size becomes to big.
>>>
>>> While that is clearly an userspace bug we should probably handle
>>> that case
>>> gracefully in the kernel.
>>>
>>> So we can either reject requests to merge more than a reasonable
>>> amount of
>>> fences (64k maybe?) or we can start to use kvzalloc() instead of
>>> kzalloc().
>>> This patch here does the later.
>>>
>>
>> This patch seems reasonable to me if the above use is in fact valid.
>>
>>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>>> CC: stable(a)vger.kernel.org
>>
>> Fixes tag?
>>
>> Patch itself LGTM:
>> Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
>>
>>> ---
>>> drivers/dma-buf/dma-fence-array.c | 6 +++---
>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-array.c
>>> b/drivers/dma-buf/dma-fence-array.c
>>> index 8a08ffde31e7..46ac42bcfac0 100644
>>> --- a/drivers/dma-buf/dma-fence-array.c
>>> +++ b/drivers/dma-buf/dma-fence-array.c
>>> @@ -119,8 +119,8 @@ static void dma_fence_array_release(struct
>>> dma_fence *fence)
>>> for (i = 0; i < array->num_fences; ++i)
>>> dma_fence_put(array->fences[i]);
>>>
>>> - kfree(array->fences);
>>> - dma_fence_free(fence);
>>> + kvfree(array->fences);
>>> + kvfree_rcu(fence, rcu);
>>> }
>>>
>>> static void dma_fence_array_set_deadline(struct dma_fence *fence,
>>> @@ -153,7 +153,7 @@ struct dma_fence_array
>>> *dma_fence_array_alloc(int num_fences)
>>> {
>>> struct dma_fence_array *array;
>>>
>>> - return kzalloc(struct_size(array, callbacks, num_fences),
>>> GFP_KERNEL);
>>> + return kvzalloc(struct_size(array, callbacks, num_fences),
>>> GFP_KERNEL);
>>> }
>>> EXPORT_SYMBOL(dma_fence_array_alloc);
>>>
>>> --
>>> 2.34.1
>>>
>
On Thu, Oct 24, 2024 at 05:40:02PM +0100, Pavel Begunkov wrote:
> On 10/24/24 17:06, Christoph Hellwig wrote:
> > On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote:
> > > > That's not what this series does. It adds the new memory_provider_ops
> > > > set of hooks, with once implementation for dmabufs, and one for
> > > > io_uring zero copy.
> > >
> > > First, it's not a _new_ abstraction over a buffer as you called it
> > > before, the abstraction (net_iov) is already merged.
> >
> > Umm, it is a new ops vector.
>
> I don't understand what you mean. Callback?
struct memory_provider_ops. It's a method table or ops vetor, no
callbacks involved.
> Then please go ahead and take a look at the patchset in question
> and see how much of dmabuf handling is there comparing to pure
> networking changes. The point that it's a new set of API and lots
> of changes not related directly to dmabufs stand. dmabufs is useful
> there as an abstraction there, but it's a very long stretch saying
> that the series is all about it.
I did take a look, that's why I replied.
> > > on an existing network specific abstraction, which are not restricted to
> > > pages or anything specific in the long run, but the flow of which from
> > > net stack to user and back is controlled by io_uring. If you worry about
> > > abuse, io_uring can't even sanely initialise those buffers itself and
> > > therefore asking the page pool code to do that.
> >
> > No, I worry about trying to io_uring for not good reason. This
>
> It sounds that the argument is that you just don't want any
> io_uring APIs, I don't think you'd be able to help you with
> that.
No, that's complete misinterpreting what I'm saying. Of course an
io_uring API is fine. But tying low-level implementation details to
to is not.
> > pre-cludes in-kernel uses which would be extremly useful for
>
> Uses of what? devmem TCP is merged, I'm not removing it,
> and the net_iov abstraction is in there, which can be potentially
> be reused by other in-kernel users if that'd even make sense.
How when you are hardcoding io uring memory registrations instead
of making them a generic dmabuf? Which btw would also really help
with pre-registering the memry with the iommu to get good performance
in IOMMU-enabled setups.
Am 25.10.24 um 11:05 schrieb Tvrtko Ursulin:
>
> On 25/10/2024 09:59, Tvrtko Ursulin wrote:
>>
>> On 24/10/2024 13:41, Christian König wrote:
>>> Reports indicates that some userspace applications try to merge more
>>> than
>>> 80k of fences into a single dma_fence_array leading to a warning from
>>> kzalloc() that the requested size becomes to big.
>>>
>>> While that is clearly an userspace bug we should probably handle
>>> that case
>>> gracefully in the kernel.
>>>
>>> So we can either reject requests to merge more than a reasonable
>>> amount of
>>> fences (64k maybe?) or we can start to use kvzalloc() instead of
>>> kzalloc().
>>> This patch here does the later.
>>
>> Rejecting would potentially be safer, otherwise there is a path for
>> userspace to trigger a warn in kvmalloc_node (see 0829b5bcdd3b
>> ("drm/i915: 2 GiB of relocations ought to be enough for anybody*"))
>> and spam dmesg at will.
>
> Actually that is a WARN_ON_*ONCE* there so maybe not so critical to
> invent a limit. Up for discussion I suppose.
>
> Regards,
>
> Tvrtko
>
>>
>> Question is what limit to set...
That's one of the reasons why I opted for kvzalloc() initially.
I mean we could use some nice round number like 65536, but that would be
totally arbitrary.
Any comments on the other two patches? I need to get them upstream.
Thanks,
Christian.
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>>> CC: stable(a)vger.kernel.org
>>> ---
>>> drivers/dma-buf/dma-fence-array.c | 6 +++---
>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-array.c
>>> b/drivers/dma-buf/dma-fence-array.c
>>> index 8a08ffde31e7..46ac42bcfac0 100644
>>> --- a/drivers/dma-buf/dma-fence-array.c
>>> +++ b/drivers/dma-buf/dma-fence-array.c
>>> @@ -119,8 +119,8 @@ static void dma_fence_array_release(struct
>>> dma_fence *fence)
>>> for (i = 0; i < array->num_fences; ++i)
>>> dma_fence_put(array->fences[i]);
>>> - kfree(array->fences);
>>> - dma_fence_free(fence);
>>> + kvfree(array->fences);
>>> + kvfree_rcu(fence, rcu);
>>> }
>>> static void dma_fence_array_set_deadline(struct dma_fence *fence,
>>> @@ -153,7 +153,7 @@ struct dma_fence_array
>>> *dma_fence_array_alloc(int num_fences)
>>> {
>>> struct dma_fence_array *array;
>>> - return kzalloc(struct_size(array, callbacks, num_fences),
>>> GFP_KERNEL);
>>> + return kvzalloc(struct_size(array, callbacks, num_fences),
>>> GFP_KERNEL);
>>> }
>>> EXPORT_SYMBOL(dma_fence_array_alloc);
On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote:
> > That's not what this series does. It adds the new memory_provider_ops
> > set of hooks, with once implementation for dmabufs, and one for
> > io_uring zero copy.
>
> First, it's not a _new_ abstraction over a buffer as you called it
> before, the abstraction (net_iov) is already merged.
Umm, it is a new ops vector.
> Second, you mention devmem TCP, and it's not just a page pool with
> "dmabufs", it's a user API to use it and other memory agnostic
> allocation logic. And yes, dmabufs there is the least technically
> important part. Just having a dmabuf handle solves absolutely nothing.
It solves a lot, becaue it provides a proper abstraction.
> > So you are precluding zero copy RX into anything but your magic
> > io_uring buffers, and using an odd abstraction for that.
>
> Right io_uring zero copy RX API expects transfer to happen into io_uring
> controlled buffers, and that's the entire idea. Buffers that are based
> on an existing network specific abstraction, which are not restricted to
> pages or anything specific in the long run, but the flow of which from
> net stack to user and back is controlled by io_uring. If you worry about
> abuse, io_uring can't even sanely initialise those buffers itself and
> therefore asking the page pool code to do that.
No, I worry about trying to io_uring for not good reason. This
pre-cludes in-kernel uses which would be extremly useful for
network storage drivers, and it precludes device memory of all
kinds.
> I'm even more confused how that would help. The user API has to
> be implemented and adding a new dmabuf gives nothing, not even
> mentioning it's not clear what semantics of that beast is
> supposed to be.
>
The dma-buf maintainers already explained to you last time
that there is absolutely no need to use the dmabuf UAPI, you
can use dma-bufs through in-kernel interfaces just fine.
Hi guys,
turned out that userspace can also merge dma_fence_chain contains
which can result in really huge arrays.
Fix those merges to sort the arrays and remove the duplicates.
Additional to that start to use kvzalloc() for dma_fence_array
containers so that can handle much larger arrays if necessary.
Please review and comment,
Christian.
On Wed, Oct 23, 2024 at 03:34:53PM +0100, Pavel Begunkov wrote:
> It doesn't care much what kind of memory it is, nor it's important
> for internals how it's imported, it's user addresses -> pages for
> user convenience sake. All the net_iov setup code is in the page pool
> core code. What it does, however, is implementing the user API, so
That's not what this series does. It adds the new memory_provider_ops
set of hooks, with once implementation for dmabufs, and one for
io_uring zero copy.
So you are precluding zero copy RX into anything but your magic
io_uring buffers, and using an odd abstraction for that.
The right way would be to support zero copy RX into every
designated dmabuf, and make io_uring work with udmabuf or if
absolutely needed it's own kind of dmabuf. Instead we create
a maze of incompatible abstractions here. The use case of e.g.
doing zero copy receive into a NVMe CMB using PCIe P2P transactions
is every but made up, so this does create a problem.
On Wed, Oct 16, 2024 at 11:52:39AM -0700, David Wei wrote:
> From: Pavel Begunkov <asml.silence(a)gmail.com>
>
> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
> which serves as a useful abstraction to share data and provide a
> context. However, it's too devmem specific, and we want to reuse it for
> other memory providers, and for that we need to decouple net_iov from
> devmem. Make net_iov to point to a new base structure called
> net_iov_area, which dmabuf_genpool_chunk_owner extends.
We've been there before. Instead of reinventing your own memory
provider please enhance dmabufs for your use case. We don't really
need to build memory buffer abstraction over memory buffer abstraction.
On Tue, Oct 22, 2024 at 1:38 AM Maxime Ripard <mripard(a)redhat.com> wrote:
>
> I wanted to follow-up on the discussion we had at Plumbers with John and
> T.J. about (among other things) adding new heaps to the kernel.
>
> I'm still interested in merging a carve-out driver[1], since it seems to be
> in every vendor BSP and got asked again last week.
>
> I remember from our discussion that for new heap types to be merged, we
> needed a kernel use-case. Looking back, I'm not entirely sure how one
> can provide that given that heaps are essentially facilities for
> user-space.
>
> Am I misremembering or missing something? What are the requirements for
> you to consider adding a new heap driver?
It's basically the same as the DRM subsystem rules.
https://docs.kernel.org/gpu/drm-uapi.html#open-source-userspace-requirements
ie: There has to be opensource user for it, and the user has to be
more significant than a "toy" implementation (which can be a bit
subjective and contentious when trying to get out of a chicken and egg
loop).
thanks
-john