On Tue, 15 Oct 2024 17:37:46 +0530, Jyothi Kumar Seerapu wrote:
> When high performance with multiple i2c messages in a single transfer
> is required, employ Block Event Interrupt (BEI) to trigger interrupts
> after specific messages transfer and the last message transfer,
> thereby reducing interrupts.
>
> For each i2c message transfer, a series of Transfer Request Elements(TREs)
> must be programmed, including config tre for frequency configuration,
> go tre for holding i2c address and dma tre for holding dma buffer address,
> length as per the hardware programming guide. For transfer using BEI,
> multiple I2C messages may necessitate the preparation of config, go,
> and tx DMA TREs. However, a channel TRE size of 64 is often insufficient,
> potentially leading to failures due to inadequate memory space.
>
> Add additional argument to dma-cell property for channel TRE size.
> With this, adjust the channel TRE size via the device tree.
> The default size is 64, but clients can modify this value based on
> their specific requirements.
>
> Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com>
> ---
> Documentation/devicetree/bindings/dma/qcom,gpi.yaml | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
My bot found errors running 'make dt_binding_check' on your patch:
yamllint warnings/errors:
dtschema/dtc warnings/errors:
/builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/dma/qcom,gpi.yaml: properties:#dma-cells: 'minItems' is not one of ['description', 'deprecated', 'const', 'enum', 'minimum', 'maximum', 'multipleOf', 'default', '$ref', 'oneOf']
from schema $id: http://devicetree.org/meta-schemas/core.yaml#
/builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/dma/qcom,gpi.yaml: properties:#dma-cells: 'maxItems' is not one of ['description', 'deprecated', 'const', 'enum', 'minimum', 'maximum', 'multipleOf', 'default', '$ref', 'oneOf']
from schema $id: http://devicetree.org/meta-schemas/core.yaml#
doc reference errors (make refcheckdocs):
See https://patchwork.ozlabs.org/project/devicetree-bindings/patch/202410151207…
The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.
If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:
pip3 install dtschema --upgrade
Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.
On Sat, Oct 5, 2024 at 11:10 AM Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> These warnings/errors are reported by checkpatch.
> Fix them with minor changes to make it clean.
> No other functional changes.
>
> WARNING: Block comments use * on subsequent lines
> + /* only support discovering the end of the buffer,
> + but also allow SEEK_SET to maintain the idiomatic
>
> WARNING: Block comments use a trailing */ on a separate line
> + SEEK_END(0), SEEK_CUR(0) pattern */
>
> WARNING: Block comments use a trailing */ on a separate line
> + * before passing the sgt back to the exporter. */
>
> ERROR: "foo * bar" should be "foo *bar"
> +static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
>
> WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
> + d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
>
> total: 1 errors, 4 warnings, 1746 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
Looks ok to me. Thanks for sending these cleanups!
Acked-by: John Stultz <jstultz(a)google.com>
Hello,
We held a workshop at XDC 2024 titled "Towards a universal buffer
allocator for Linux", whose abstract was as follows:
Buffer allocation for media contents, despite being required for any
framework or application dealing with image capture, processing,
decoding, encoding, rendering and display, remains an area plagued by
many unsolved problems. Over time improvements have been made to APIs
for buffer allocation, both on the kernel side (standardization of the
DRM dumb buffer API, or DMA heaps, to name a few) and in userspace
(most notably with GBM, and the buffer management API in Vulkan), or
for specific use cases (e.g. gralloc in Android). Unfortunately, no
universal solution exists to allocate buffers shared by multiple
devices. This is hindering interoperability and forces userspace to
pile hacks and workarounds.
(https://indico.freedesktop.org/event/6/contributions/395/).
Here are the raw notes from the workshop.
XDC 2024 - Buffer allocation workshop
=====================================
Attendees:
- Erico Nunes <ernunes(a)redhat.com>
- James Jones <jajones(a)nvidia.com>
- Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
- Nicolas Dufresne <nicolas(a)collabora.com>
- Yunxiang (Teddy) Li <yunxiang.li(a)amd.com>
Relevant content:
- XDC 2016: https://www.x.org/wiki/Events/XDC2016/Program/Unix_Device_Memory_Allocation…
- XDC 2020: https://lpc.events/event/9/contributions/615/attachments/704/1301/XDC_2020_…
Discussions
-----------
NICs are relevant to the discussion, but likely only on large servers.
The proposal needs to accomodate that.
Nicolas asked about support for the CPU as a device. A lot of pipelines
are hybrid, with CPU processing and dedicated hardware processing. James
said previous proposals were able to support the CPU as a device. The
last proposal was focussed on usages, so could support CPUs.
Demi things this could be viewed as in a similar way as type checking in
compilers, which have to resolve constraints.
Nicolas is annoyed that we force applications to make a decision,
requiring them to go the hard route of reallocation if they get it
wrong. Some of the parameters are the MMU configuration, which can't be
changed afterwards in Linux.
Demi said that there could be cases where the fastest thing to do is to
use shadow memory and copy. Nicolas said the first question that would
come back is if this is something userspace should decide. Yunxiang
asked if we're looking at designing a kernel interface to convey
constraints. Nicolas said some kernel APIs expose some constraints. For
instance, V4L2 allows discovering some stride constraints. Yunxiang
thinks constraints won't be able to scale. Capabilities would be
better. The difference is that the hardware device can have lots of
constraints, but could more easily say some of the things it can do.
Adding a new capability later wouldn't break anything, but adding a new
constraint would require all components to understand it. James said the
latest proposal handled constraints in a forward-compatible way with
versioned constraints. Demi prefers the capabilities approach. Yunxiang
said devices could expose which memory they can access.
Nicolas said that in GStreamer, the current approach is trial and error.
Capabilities would reduce trial and error. Everybody has hacks on top.
James is instead bringing a formula that solves this. As soon as each of
the nodes involved can be identified (which GStreamer can do now), we
can calculate and combine but can't sustain adding capabilities. Liviu
said GStreamer could tell which capabilities it supports, and push
hardware vendors to comply with that. If the hardware is not compatible,
GStreamer would provide converters. Nicolas said buffer sharing is a
trade-off. We will find things that won't break it but will make it
slower. Demi said that cache-incoherent DMA will usually have higher
bandwidth than cache-coherent DMA, unless all the components share the
same cache.
Discussion followed on capabilities vs. constraints. It's partly a
vocabulary matter. The first proposal from James was using the word
capability. Constraints in the last proposal describe what a device can
do.
Nicolas has only stride and number of planes as constraints in
GStreamer. James' mechanism has support for much more. Yunxiang asked if
we can include the pixel format as constraints. James said we first have
to resolve formats and modifiers, and then handle allocation
constraints. Yunxiang thinks the allocator should also take care of the
format. Nicolas said that when using a tiling modifier, we already
reduce the scope of incompatibilities. When using e.g. an Intel tiling
modifier, there's a bigger guarantee that the devices that support it
will be able to inter-operate as they're designed for that purpose.
Demi said it should be possible to validate parameters for buffers
without being Mesa. The format modifiers are way too opaque. James said
that for each format we need to query possible modifiers, and for each
format+modifier we need to query constraints. Modifiers are mostly
considered opaque to applications, constraints are semi-opaque, and
formats are not opaque.
The problem that remains hard to James is when it comes to locality, how
do we say "local to a device" ? There's no serializable representation
of a device. Nicolas says that as soon as it stays in the graphics
stack, the stack will hide the problem from applications. A sysfs path
could be used on Linux to identify a device.
James asked how an application would be allowed to allocate buffers from
e.g. a DRI device, if the allocator library told it that the allocation
needs to be done there. This starts becoming a permission issue or a
logind issue. Passing fds could help, but when sitting on top of GL, we
have no way to get an fd. Nicolas said Vulkan should be OK, but GL is
dead. Nicolas asked why it would be useful to get access to fds. James
said that at the bottom of the problem we have the question of which fd
we send an ioctl to to allocate memory.
We could start small to have something and break the chicken and egg
problem. James thought about using GBM to start with, but this isn't
good for GStreamer. James and Nicolas thought that kernel drivers would
register dmabuf heaps. One issue with dmabuf heaps today is that
allocation isn't tracked, so we bypass cgroups. DRM and V4L2 have the
same issue, offering an infinite amount of memory. logind has a
responsibility of not making it worse, so didn't allow access to dmabuf
heaps. The problem needs to be fixed in the kernel first, adding
accounting. This would then be an incentive to abandon device-specific
allocation APIs, as memory accounting would come for free when using
dmabuf heaps. Nicolas wonders why the obvious path forward of
implementing heaps for devices that have specific constraints didn't
happen. James thinks people may just have been too busy.
Demi brought up the VFIO case. The hardware device may not be managed by
the kernel, it may be managed by userspace, or passed to a guest.
Capabilities for those devices can't come from a kernel device. James
said the key is communicating the constraints in a serializable way.
Allocation would involve passing dmabufs between guests and hosts.
Allocation from DMA heaps and from subsystem-specific APIs will coexist
for a long time. This needs to be abstracted in a userspace library, so
that the underlying allocators can evolve.
Demi asked if GBM is needed when using Vulkan. Internally in mesa,
everything uses GBM for allocation. James said it's a good argument to
ask why not using Vulkan only. If we decide to use GBM to bootstrap this
evolution, and GBM allocates for a specific device, then it doesn't seem
to be a good fit. We don't want to add support for system memory
allocation to GBM. Erico said that there's a use case for GBM even when
using Vulkan when there's a separate display device that has no Vulkan
implementation.
We need an iterative approach. For instance a simple useful improvement
would be to expose stride constraints on DRM devices.
Conclusions
-----------
- We need some library (existing or new one)
- We need some API
- We need someone to bootstrap this
- We need a more iterative approach
Action points
-------------
- Add memory accounting to DMA heaps
- Push drivers to implement their own heaps to replace subsystem allocation APIs
--
Regards,
Laurent Pinchart
Hi,
On Sat, 5 Oct 2024 at 23:40, Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> These warnings/errors are reported by checkpatch.
> Fix them with minor changes to make it clean.
> No other functional changes.
>
> WARNING: Block comments use * on subsequent lines
> + /* only support discovering the end of the buffer,
> + but also allow SEEK_SET to maintain the idiomatic
>
> WARNING: Block comments use a trailing */ on a separate line
> + SEEK_END(0), SEEK_CUR(0) pattern */
>
> WARNING: Block comments use a trailing */ on a separate line
> + * before passing the sgt back to the exporter. */
>
> ERROR: "foo * bar" should be "foo *bar"
> +static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
>
> WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
> + d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
>
> total: 1 errors, 4 warnings, 1746 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
>
> ---
> Changes in V1 suggested by Sumit Semwal:
> Change commit title, and mention exact reason of fix in commit log.
> V1: https://lore.kernel.org/all/CAOuPNLg1=YCUFXW-76A_gZm_PE1MFSugNvg3dEdkfujXV_…
> ---
> drivers/dma-buf/dma-buf.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8892bc701a66..2e63d50e46d3 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -176,8 +176,9 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
> dmabuf = file->private_data;
>
> /* only support discovering the end of the buffer,
> - but also allow SEEK_SET to maintain the idiomatic
> - SEEK_END(0), SEEK_CUR(0) pattern */
> + * but also allow SEEK_SET to maintain the idiomatic
> + * SEEK_END(0), SEEK_CUR(0) pattern.
> + */
> if (whence == SEEK_END)
> base = dmabuf->size;
> else if (whence == SEEK_SET)
> @@ -782,13 +783,14 @@ static void mangle_sg_table(struct sg_table *sg_table)
> /* To catch abuse of the underlying struct page by importers mix
> * up the bits, but take care to preserve the low SG_ bits to
> * not corrupt the sgt. The mixing is undone in __unmap_dma_buf
> - * before passing the sgt back to the exporter. */
> + * before passing the sgt back to the exporter.
> + */
> for_each_sgtable_sg(sg_table, sg, i)
> sg->page_link ^= ~0xffUL;
> #endif
>
> }
> -static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
> +static struct sg_table *__map_dma_buf(struct dma_buf_attachment *attach,
> enum dma_data_direction direction)
> {
> struct sg_table *sg_table;
> @@ -1694,7 +1696,7 @@ static int dma_buf_init_debugfs(void)
>
> dma_buf_debugfs_dir = d;
>
> - d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
> + d = debugfs_create_file("bufinfo", 0444, dma_buf_debugfs_dir,
> NULL, &dma_buf_debug_fops);
> if (IS_ERR(d)) {
> pr_debug("dma_buf: debugfs: failed to create node bufinfo\n");
> --
Pushed V2 here. Any further comment on this ?
Thanks,
Pintu
On Tue, Oct 1, 2024 at 7:51 PM Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> Use of kmap_atomic/kunmap_atomic is deprecated, use
> kmap_local_page/kunmap_local instead.
>
> This is reported by checkpatch.
> Also fix repeated word issue.
>
> WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
> + void *vaddr = kmap_atomic(page);
>
> WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead
> + kunmap_atomic(vaddr);
>
> WARNING: Possible repeated word: 'by'
> + * has been killed by by SIGKILL
>
> total: 0 errors, 3 warnings, 405 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
Reviewed-by: T.J. Mercier <tjmercier(a)google.com>
The Android kernels have been doing this for over a year, so should be
pretty well tested at this point:
https://r.android.com/c/kernel/common/+/2500840
> ---
> drivers/dma-buf/heaps/cma_heap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
> index 93be88b805fe..8c55431cc16c 100644
> --- a/drivers/dma-buf/heaps/cma_heap.c
> +++ b/drivers/dma-buf/heaps/cma_heap.c
> @@ -309,13 +309,13 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap,
> struct page *page = cma_pages;
>
> while (nr_clear_pages > 0) {
> - void *vaddr = kmap_atomic(page);
> + void *vaddr = kmap_local_page(page);
>
> memset(vaddr, 0, PAGE_SIZE);
> - kunmap_atomic(vaddr);
> + kunmap_local(vaddr);
> /*
> * Avoid wasting time zeroing memory if the process
> - * has been killed by by SIGKILL
> + * has been killed by SIGKILL.
> */
> if (fatal_signal_pending(current))
> goto free_cma;
> --
> 2.17.1
>
Hello Pintu,
On Tue, 1 Oct 2024 at 23:16, Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> Symbolic permissions are not preferred, instead use the octal.
> Also, fix other warnings/errors as well for cleanup.
>
> WARNING: Block comments use * on subsequent lines
> + /* only support discovering the end of the buffer,
> + but also allow SEEK_SET to maintain the idiomatic
>
> WARNING: Block comments use a trailing */ on a separate line
> + SEEK_END(0), SEEK_CUR(0) pattern */
>
> WARNING: Block comments use a trailing */ on a separate line
> + * before passing the sgt back to the exporter. */
>
> ERROR: "foo * bar" should be "foo *bar"
> +static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
>
> WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
> + d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
>
> total: 1 errors, 4 warnings, 1746 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
Thanks for this patch - could you please also mention in the commit
log how did you find this? It looks like you ran checkpatch, but it's
not clear from the commit log.
Since this patch does multiple things related to checkpatch warnings
(change S_IRUGO to 0444, comments correction, function declaration
correction), can I please ask you to change the commit title to also
reflect that?
> ---
> drivers/dma-buf/dma-buf.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8892bc701a66..2e63d50e46d3 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -176,8 +176,9 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
> dmabuf = file->private_data;
>
> /* only support discovering the end of the buffer,
> - but also allow SEEK_SET to maintain the idiomatic
> - SEEK_END(0), SEEK_CUR(0) pattern */
> + * but also allow SEEK_SET to maintain the idiomatic
> + * SEEK_END(0), SEEK_CUR(0) pattern.
> + */
> if (whence == SEEK_END)
> base = dmabuf->size;
> else if (whence == SEEK_SET)
> @@ -782,13 +783,14 @@ static void mangle_sg_table(struct sg_table *sg_table)
> /* To catch abuse of the underlying struct page by importers mix
> * up the bits, but take care to preserve the low SG_ bits to
> * not corrupt the sgt. The mixing is undone in __unmap_dma_buf
> - * before passing the sgt back to the exporter. */
> + * before passing the sgt back to the exporter.
> + */
> for_each_sgtable_sg(sg_table, sg, i)
> sg->page_link ^= ~0xffUL;
> #endif
>
> }
> -static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
> +static struct sg_table *__map_dma_buf(struct dma_buf_attachment *attach,
> enum dma_data_direction direction)
> {
> struct sg_table *sg_table;
> @@ -1694,7 +1696,7 @@ static int dma_buf_init_debugfs(void)
>
> dma_buf_debugfs_dir = d;
>
> - d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
> + d = debugfs_create_file("bufinfo", 0444, dma_buf_debugfs_dir,
> NULL, &dma_buf_debug_fops);
> if (IS_ERR(d)) {
> pr_debug("dma_buf: debugfs: failed to create node bufinfo\n");
> --
> 2.17.1
>
Best,
Sumit.
On 02/10/2024 09:38, Boris Brezillon wrote:
> On Tue, 24 Sep 2024 00:06:21 +0100
> Adrián Larumbe <adrian.larumbe(a)collabora.com> wrote:
>
>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>> + u32 cs_ringbuf_size)
>> +{
>> + u32 min_profiled_job_instrs = U32_MAX;
>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>> +
>> + /*
>> + * We want to calculate the minimum size of a profiled job's CS,
>> + * because since they need additional instructions for the sampling
>> + * of performance metrics, they might take up further slots in
>> + * the queue's ringbuffer. This means we might not need as many job
>> + * slots for keeping track of their profiling information. What we
>> + * need is the maximum number of slots we should allocate to this end,
>> + * which matches the maximum number of profiled jobs we can place
>> + * simultaneously in the queue's ring buffer.
>> + * That has to be calculated separately for every single job profiling
>> + * flag, but not in the case job profiling is disabled, since unprofiled
>> + * jobs don't need to keep track of this at all.
>> + */
>> + for (u32 i = 0; i < last_flag; i++) {
>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>
> I'll get rid of this check when applying, as suggested by Steve. Steve,
> with this modification do you want me to add your R-b?
Yes, please do.
Thanks,
Steve
> BTW, I've also fixed a bunch of checkpatch errors/warnings, so you
> might want to run checkpatch --strict next time.
>
>> + min_profiled_job_instrs =
>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>> + }
>> +
>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>> +}
Hi,
Am 30.09.24 um 21:38 schrieb Zichen Xie:
> Dear Linux Developers for DMA BUFFER SHARING FRAMEWORK,
>
> We are curious about the function 'dma_resv_get_fences' here:
> https://elixir.bootlin.com/linux/v6.11/source/drivers/dma-buf/dma-resv.c#L5…,
> and the logic below:
> ```
> dma_resv_for_each_fence_unlocked(&cursor, fence) {
>
> if (dma_resv_iter_is_restarted(&cursor)) {
> struct dma_fence **new_fences;
> unsigned int count;
>
> while (*num_fences)
> dma_fence_put((*fences)[--(*num_fences)]);
>
> count = cursor.num_fences + 1;
>
> /* Eventually re-allocate the array */
> new_fences = krealloc_array(*fences, count,
> sizeof(void *),
> GFP_KERNEL);
> if (count && !new_fences) {
> kfree(*fences);
> *fences = NULL;
> *num_fences = 0;
> dma_resv_iter_end(&cursor);
> return -ENOMEM;
> }
> *fences = new_fences;
> }
>
> (*fences)[(*num_fences)++] = dma_fence_get(fence);
> }
> ```
> The existing check 'if (count && !new_fences)' may fail if count==0,
> and 'krealloc_array' with count==0 is an undefined behavior. The
> realloc may fail and return a NULL pointer, leading to a NULL Pointer
> Dereference in '(*fences)[(*num_fences)++] = dma_fence_get(fence);'
You already answered the question yourself "count = cursor.num_fences +
1;". So count can never be 0.
What could theoretically be possible is that num_fences overflows, but
this value isn't userspace controllable and we would run into memory
allocation failures long before that happened.
But we could potentially remove this whole handling since if there are
no fences in the dma_resv object we don't enter the loop in the first place.
Regards,
Christian.
>
> Please correct us if we miss some key prerequisites for this function!
> Thank you very much!
On 27/09/2024 15:53, Adrián Larumbe wrote:
> On 25.09.2024 10:56, Steven Price wrote:
>> On 23/09/2024 21:43, Adrián Larumbe wrote:
>>> Hi Steve,
>>>
>>> On 23.09.2024 09:55, Steven Price wrote:
>>>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>>>> Hi Steve, thanks for the review.
>>>>
>>>> Hi Adrián,
>>>>
>>>>> I've applied all of your suggestions for the next patch series revision, so I'll
>>>>> only be answering to your question about the calc_profiling_ringbuf_num_slots
>>>>> function further down below.
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>>>>> .free_job = queue_free_job,
>>>>>>> };
>>>>>>>
>>>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>>>> + u32 cs_ringbuf_size)
>>>>>>> +{
>>>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * We want to calculate the minimum size of a profiled job's CS,
>>>>>>> + * because since they need additional instructions for the sampling
>>>>>>> + * of performance metrics, they might take up further slots in
>>>>>>> + * the queue's ringbuffer. This means we might not need as many job
>>>>>>> + * slots for keeping track of their profiling information. What we
>>>>>>> + * need is the maximum number of slots we should allocate to this end,
>>>>>>> + * which matches the maximum number of profiled jobs we can place
>>>>>>> + * simultaneously in the queue's ring buffer.
>>>>>>> + * That has to be calculated separately for every single job profiling
>>>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>>>>>> + * jobs don't need to keep track of this at all.
>>>>>>> + */
>>>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>>>> + min_profiled_job_instrs =
>>>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>>>>>> + }
>>>>>>> +
>>>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>>>>>> +}
>>>>>>
>>>>>> I may be missing something, but is there a situation where this is
>>>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>>>> can only add extra instructions to the no-flags case - whereas this
>>>>>> implies you're thinking that instructions may also be removed (or replaced).
>>>>>>
>>>>>> Steve
>>>>>
>>>>> Since we create a separate kernel BO to hold the profiling information slot, we
>>>>> need one that would be able to accomodate as many slots as the maximum number of
>>>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
>>>>> profiled jobs always take more instructions than unprofiled ones, then we would
>>>>> usually need fewer slots than the number of unprofiled jobs we could insert at
>>>>> once in the ring buffer.
>>>>>
>>>>> Because we represent profiling metrics with a bit mask, then we need to test the
>>>>> size of the CS for every single metric enabled in isolation, since enabling more
>>>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
>>>>> in the queue's ring buffer.
>>>>>
>>>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>>>> instructions we need for a normal job with no profiled features enabled, which
>>>>> would always requiere less instructions than profiled ones, and therefore more
>>>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>>>> profiling numbers for unprofiled jobs, so there's no point in calculating this
>>>>> number.
>>>>>
>>>>> At first I was simply allocating a profiling info kernel BO as big as the number
>>>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
>>>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
>>>>> be wasted, since profiled jobs always require more slots because they hold more
>>>>> instructions, so fewer profiling slots in said kernel BO.
>>>>>
>>>>> The value of this approach will eventually manifest if we decided to keep track of
>>>>> more profiling metrics, since this code won't have to change at all, other than
>>>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>>>
>>>> Thanks for the detailed explanation. I think what I was missing is that
>>>> the loop is checking each bit flag independently and *not* checking
>>>> calc_job_credits(0).
>>>>
>>>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>>>> confused me - that should be completely redundant. Or at least we need
>>>> something more intelligent if we have profiling bits which are not
>>>> mutually compatible.
>>>
>>> I thought of an alternative that would only test bits that are actually part of
>>> the mask:
>>>
>>> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> u32 cs_ringbuf_size)
>>> {
>>> u32 min_profiled_job_instrs = U32_MAX;
>>> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
>>>
>>> while (profiling_mask) {
>>> u32 i = ffs(profiling_mask) - 1;
>>> profiling_mask &= ~BIT(i);
>>> min_profiled_job_instrs =
>>> min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> }
>>>
>>> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> }
>>>
>>> However, I don't think this would be more efficient, because ffs() is probably
>>> fetching the first set bit by performing register shifts, and I guess this would
>>> take somewhat longer than iterating over every single bit from the last one,
>>> even if also matching them against the whole mask, just in case in future
>>> additions of performance metrics we decide to leave some of the lower
>>> significance bits untouched.
>>
>> Efficiency isn't very important here - we're not on a fast path, so it's
>> more about ensuring the code is readable. I don't think the above is
>> more readable then the original for loop.
>>
>>> Regarding your question about mutual compatibility, I don't think that is an
>>> issue here, because we're testing bits in isolation. If in the future we find
>>> out that some of the values we're profiling cannot be sampled at once, we can
>>> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
>>> profiling masks.
>>
>> My comment about compatibility is because in the original above you were
>> calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:
>>
>>> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>
>> then looping between 0 and that bit:
>>
>>> for (u32 i = 0; i < last_flag; i++) {
>>
>> So the test:
>>
>>> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>
>> would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
>> that it set. The only reason I can think for that to be true in the
>> future is if there is some sort of incompatibility - e.g. maybe there's
>> an old and new way of doing some form of profiling with the old way
>> being kept for backwards compatibility. But I suspect if/when that is
>> required we'll need to revisit this function anyway. So that 'if'
>> statement seems completely redundant (it's trivially always true).
>
> I think you're right about this. Would you be fine with the rest of the patch
> as it is in revision 8 if I also deleted this bitmask check?
Yes the rest of it looks fine.
Thanks,
Steve
>> Steve
>>
>>>> I'm also not entirely sure that the amount of RAM saved is significant,
>>>> but you've already written the code so we might as well have the saving ;)
>>>
>>> I think this was more evident before Boris suggested we reduce the basic slot
>>> size to that of a single cache line, because then the minimum profiled job
>>> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
>>> case, we would need a half as big BO for holding the sampled data (in case the
>>> least size profiled job CS would extend over the 16 instruction boundary).
>>> I still think this is a good idea so that in the future we don't need to worry
>>> about adjusting the code that deals with preparing the right boilerplate CS,
>>> since it'll only be a matter of adding new instructions inside prepare_job_instrs().
>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>>> Regards,
>>>>> Adrian
>>>>>
>>>>>>> +
>>>>>>> static struct panthor_queue *
>>>>>>> group_create_queue(struct panthor_group *group,
>>>>>>> const struct drm_panthor_queue_create *args)
>>>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>>>> goto err_free_queue;
>>>>>>> }
>>>>>>>
>>>>>>> + queue->profiling.slot_count =
>>>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>>>>>> +
>>>>>>> + queue->profiling.slots =
>>>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>>>> + queue->profiling.slot_count *
>>>>>>> + sizeof(struct panthor_job_profiling_data),
>>>>>>> + DRM_PANTHOR_BO_NO_MMAP,
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>>>>>> +
>>>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>>>> + ret = PTR_ERR(queue->profiling.slots);
>>>>>>> + goto err_free_queue;
>>>>>>> + }
>>>>>>> +
>>>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>>>> + if (ret)
>>>>>>> + goto err_free_queue;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Credit limit argument tells us the total number of instructions
>>>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>>>>>> + * twice as many as others, depending on their profiling status.
>>>>>>> + */
>>>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>>>> group->ptdev->scheduler->wq, 1,
>>>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>>>>>> + args->ringbuf_size / sizeof(u64),
>>>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>>>> group->ptdev->reset.wq,
>>>>>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> {
>>>>>>> struct panthor_group_pool *gpool = pfile->groups;
>>>>>>> struct panthor_job *job;
>>>>>>> + u32 credits;
>>>>>>> int ret;
>>>>>>>
>>>>>>> if (qsubmit->pad)
>>>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>>>> + if (credits == 0) {
>>>>>>> + ret = -EINVAL;
>>>>>>> + goto err_put_job;
>>>>>>> + }
>>>>>>> +
>>>>>>> ret = drm_sched_job_init(&job->base,
>>>>>>> &job->group->queues[job->queue_idx]->entity,
>>>>>>> - 1, job->group);
>>>>>>> + credits, job->group);
>>>>>>> if (ret)
>>>>>>> goto err_put_job;
>>>>>>>
>>>>>
Hi,
This patch set is based on top of Yong Wu's restricted heap patch set [1].
It's also a continuation on Olivier's Add dma-buf secure-heap patch set [2].
The Linaro restricted heap uses genalloc in the kernel to manage the heap
carvout. This is a difference from the Mediatek restricted heap which
relies on the secure world to manage the carveout.
I've tried to adress the comments on [2], but [1] introduces changes so I'm
afraid I've had to skip some comments.
This can be tested on QEMU with the following steps:
repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
-b prototype/sdp-v1
repo sync -j8
cd build
make toolchains -j4
make all -j$(nproc)
make run-only
# login and at the prompt:
xtest --sdp-basic
https://optee.readthedocs.io/en/latest/building/prerequisites.html
list dependencies needed to build the above.
The tests are pretty basic, mostly checking that a Trusted Application in
the secure world can access and manipulate the memory.
Cheers,
Jens
[1] https://lore.kernel.org/dri-devel/20240515112308.10171-1-yong.wu@mediatek.c…
[2] https://lore.kernel.org/lkml/20220805135330.970-1-olivier.masse@nxp.com/
Changes since Olivier's post [2]:
* Based on Yong Wu's post [1] where much of dma-buf handling is done in
the generic restricted heap
* Simplifications and cleanup
* New commit message for "dma-buf: heaps: add Linaro restricted dmabuf heap
support"
* Replaced the word "secure" with "restricted" where applicable
Etienne Carriere (1):
tee: new ioctl to a register tee_shm from a dmabuf file descriptor
Jens Wiklander (2):
dma-buf: heaps: restricted_heap: add no_map attribute
dma-buf: heaps: add Linaro restricted dmabuf heap support
Olivier Masse (1):
dt-bindings: reserved-memory: add linaro,restricted-heap
.../linaro,restricted-heap.yaml | 56 ++++++
drivers/dma-buf/heaps/Kconfig | 10 ++
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/restricted_heap.c | 17 +-
drivers/dma-buf/heaps/restricted_heap.h | 2 +
.../dma-buf/heaps/restricted_heap_linaro.c | 165 ++++++++++++++++++
drivers/tee/tee_core.c | 38 ++++
drivers/tee/tee_shm.c | 104 ++++++++++-
include/linux/tee_drv.h | 11 ++
include/uapi/linux/tee.h | 29 +++
10 files changed, 426 insertions(+), 7 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/linaro,restricted-heap.yaml
create mode 100644 drivers/dma-buf/heaps/restricted_heap_linaro.c
--
2.34.1
Hi Adrián,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.11 next-20240927]
[cannot apply to drm-misc/drm-misc-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-i…
base: linus/master
patch link: https://lore.kernel.org/r/20240923230912.2207320-4-adrian.larumbe%40collabo…
patch subject: [PATCH v8 3/5] drm/panthor: add DRM fdinfo support
config: arm-randconfig-002-20240929 (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…)
compiler: arm-linux-gnueabi-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409291048.zLqDeqpO-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/linux/math64.h:6,
from include/linux/time.h:6,
from include/linux/stat.h:19,
from include/linux/module.h:13,
from drivers/gpu/drm/panthor/panthor_drv.c:7:
drivers/gpu/drm/panthor/panthor_drv.c: In function 'panthor_gpu_show_fdinfo':
>> drivers/gpu/drm/panthor/panthor_drv.c:1389:45: error: implicit declaration of function 'arch_timer_get_cntfrq' [-Wimplicit-function-declaration]
1389 | arch_timer_get_cntfrq()));
| ^~~~~~~~~~~~~~~~~~~~~
include/linux/math.h:40:39: note: in definition of macro 'DIV_ROUND_DOWN_ULL'
40 | ({ unsigned long long _tmp = (ll); do_div(_tmp, d); _tmp; })
| ^~
drivers/gpu/drm/panthor/panthor_drv.c:1388:28: note: in expansion of macro 'DIV_ROUND_UP_ULL'
1388 | DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
| ^~~~~~~~~~~~~~~~
vim +/arch_timer_get_cntfrq +1389 drivers/gpu/drm/panthor/panthor_drv.c
1377
1378 static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
1379 struct panthor_file *pfile,
1380 struct drm_printer *p)
1381 {
1382 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_ALL)
1383 panthor_fdinfo_gather_group_samples(pfile);
1384
1385 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP) {
1386 #ifdef CONFIG_ARM_ARCH_TIMER
1387 drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
1388 DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
> 1389 arch_timer_get_cntfrq()));
1390 #endif
1391 }
1392 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
1393 drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
1394
1395 drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
1396 drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
1397 }
1398
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 23/09/2024 21:43, Adrián Larumbe wrote:
> Hi Steve,
>
> On 23.09.2024 09:55, Steven Price wrote:
>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>> Hi Steve, thanks for the review.
>>
>> Hi Adrián,
>>
>>> I've applied all of your suggestions for the next patch series revision, so I'll
>>> only be answering to your question about the calc_profiling_ringbuf_num_slots
>>> function further down below.
>>>
>>
>> [...]
>>
>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>>> .free_job = queue_free_job,
>>>>> };
>>>>>
>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>> + u32 cs_ringbuf_size)
>>>>> +{
>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>> +
>>>>> + /*
>>>>> + * We want to calculate the minimum size of a profiled job's CS,
>>>>> + * because since they need additional instructions for the sampling
>>>>> + * of performance metrics, they might take up further slots in
>>>>> + * the queue's ringbuffer. This means we might not need as many job
>>>>> + * slots for keeping track of their profiling information. What we
>>>>> + * need is the maximum number of slots we should allocate to this end,
>>>>> + * which matches the maximum number of profiled jobs we can place
>>>>> + * simultaneously in the queue's ring buffer.
>>>>> + * That has to be calculated separately for every single job profiling
>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>>>> + * jobs don't need to keep track of this at all.
>>>>> + */
>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>> + min_profiled_job_instrs =
>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>>>> + }
>>>>> +
>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>>>> +}
>>>>
>>>> I may be missing something, but is there a situation where this is
>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>> can only add extra instructions to the no-flags case - whereas this
>>>> implies you're thinking that instructions may also be removed (or replaced).
>>>>
>>>> Steve
>>>
>>> Since we create a separate kernel BO to hold the profiling information slot, we
>>> need one that would be able to accomodate as many slots as the maximum number of
>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
>>> profiled jobs always take more instructions than unprofiled ones, then we would
>>> usually need fewer slots than the number of unprofiled jobs we could insert at
>>> once in the ring buffer.
>>>
>>> Because we represent profiling metrics with a bit mask, then we need to test the
>>> size of the CS for every single metric enabled in isolation, since enabling more
>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
>>> in the queue's ring buffer.
>>>
>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>> instructions we need for a normal job with no profiled features enabled, which
>>> would always requiere less instructions than profiled ones, and therefore more
>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>> profiling numbers for unprofiled jobs, so there's no point in calculating this
>>> number.
>>>
>>> At first I was simply allocating a profiling info kernel BO as big as the number
>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
>>> be wasted, since profiled jobs always require more slots because they hold more
>>> instructions, so fewer profiling slots in said kernel BO.
>>>
>>> The value of this approach will eventually manifest if we decided to keep track of
>>> more profiling metrics, since this code won't have to change at all, other than
>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>
>> Thanks for the detailed explanation. I think what I was missing is that
>> the loop is checking each bit flag independently and *not* checking
>> calc_job_credits(0).
>>
>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>> confused me - that should be completely redundant. Or at least we need
>> something more intelligent if we have profiling bits which are not
>> mutually compatible.
>
> I thought of an alternative that would only test bits that are actually part of
> the mask:
>
> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> u32 cs_ringbuf_size)
> {
> u32 min_profiled_job_instrs = U32_MAX;
> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
>
> while (profiling_mask) {
> u32 i = ffs(profiling_mask) - 1;
> profiling_mask &= ~BIT(i);
> min_profiled_job_instrs =
> min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
> }
>
> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
> }
>
> However, I don't think this would be more efficient, because ffs() is probably
> fetching the first set bit by performing register shifts, and I guess this would
> take somewhat longer than iterating over every single bit from the last one,
> even if also matching them against the whole mask, just in case in future
> additions of performance metrics we decide to leave some of the lower
> significance bits untouched.
Efficiency isn't very important here - we're not on a fast path, so it's
more about ensuring the code is readable. I don't think the above is
more readable then the original for loop.
> Regarding your question about mutual compatibility, I don't think that is an
> issue here, because we're testing bits in isolation. If in the future we find
> out that some of the values we're profiling cannot be sampled at once, we can
> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
> profiling masks.
My comment about compatibility is because in the original above you were
calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:
> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
then looping between 0 and that bit:
> for (u32 i = 0; i < last_flag; i++) {
So the test:
> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
that it set. The only reason I can think for that to be true in the
future is if there is some sort of incompatibility - e.g. maybe there's
an old and new way of doing some form of profiling with the old way
being kept for backwards compatibility. But I suspect if/when that is
required we'll need to revisit this function anyway. So that 'if'
statement seems completely redundant (it's trivially always true).
Steve
>> I'm also not entirely sure that the amount of RAM saved is significant,
>> but you've already written the code so we might as well have the saving ;)
>
> I think this was more evident before Boris suggested we reduce the basic slot
> size to that of a single cache line, because then the minimum profiled job
> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
> case, we would need a half as big BO for holding the sampled data (in case the
> least size profiled job CS would extend over the 16 instruction boundary).
> I still think this is a good idea so that in the future we don't need to worry
> about adjusting the code that deals with preparing the right boilerplate CS,
> since it'll only be a matter of adding new instructions inside prepare_job_instrs().
>
>> Thanks,
>> Steve
>>
>>> Regards,
>>> Adrian
>>>
>>>>> +
>>>>> static struct panthor_queue *
>>>>> group_create_queue(struct panthor_group *group,
>>>>> const struct drm_panthor_queue_create *args)
>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>> goto err_free_queue;
>>>>> }
>>>>>
>>>>> + queue->profiling.slot_count =
>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>>>> +
>>>>> + queue->profiling.slots =
>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>> + queue->profiling.slot_count *
>>>>> + sizeof(struct panthor_job_profiling_data),
>>>>> + DRM_PANTHOR_BO_NO_MMAP,
>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>>>> +
>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>> + ret = PTR_ERR(queue->profiling.slots);
>>>>> + goto err_free_queue;
>>>>> + }
>>>>> +
>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>> + if (ret)
>>>>> + goto err_free_queue;
>>>>> +
>>>>> + /*
>>>>> + * Credit limit argument tells us the total number of instructions
>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>>>> + * twice as many as others, depending on their profiling status.
>>>>> + */
>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>> group->ptdev->scheduler->wq, 1,
>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>>>> + args->ringbuf_size / sizeof(u64),
>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>> group->ptdev->reset.wq,
>>>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>> {
>>>>> struct panthor_group_pool *gpool = pfile->groups;
>>>>> struct panthor_job *job;
>>>>> + u32 credits;
>>>>> int ret;
>>>>>
>>>>> if (qsubmit->pad)
>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>> }
>>>>> }
>>>>>
>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>> + if (credits == 0) {
>>>>> + ret = -EINVAL;
>>>>> + goto err_put_job;
>>>>> + }
>>>>> +
>>>>> ret = drm_sched_job_init(&job->base,
>>>>> &job->group->queues[job->queue_idx]->entity,
>>>>> - 1, job->group);
>>>>> + credits, job->group);
>>>>> if (ret)
>>>>> goto err_put_job;
>>>>>
>>>
>
>
> Adrian Larumbe
Nothing wrong with this, I just didn't had time to double check it
myself and then forgotten about it.
Going to push it to drm-misc-next.
Regards,
Christian.
Am 23.09.24 um 11:22 schrieb Tommy Chiang:
> Ping.
> Please let me know if I'm doing something wrong.
>
> On Mon, Feb 19, 2024 at 11:00 AM Tommy Chiang <ototot(a)chromium.org> wrote:
>> Kindly ping :)
>>
>> On Fri, Jan 19, 2024 at 11:33 AM Tommy Chiang <ototot(a)chromium.org> wrote:
>>> This patch tries to improve the display of the code listing
>>> on The Linux Kernel documentation website for dma-buf [1] .
>>>
>>> Originally, it appears that it was attempting to escape
>>> the '*' character, but looks like it's not necessary (now),
>>> so we are seeing something like '\*' on the webite.
>>>
>>> This patch removes these unnecessary backslashes and adds syntax
>>> highlighting to improve the readability of the code listing.
>>>
>>> [1] https://docs.kernel.org/driver-api/dma-buf.html
>>>
>>> Signed-off-by: Tommy Chiang <ototot(a)chromium.org>
>>> ---
>>> drivers/dma-buf/dma-buf.c | 15 +++++++++------
>>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 8fe5aa67b167..e083a0ab06d7 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -1282,10 +1282,12 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF);
>>> * vmap interface is introduced. Note that on very old 32-bit architectures
>>> * vmalloc space might be limited and result in vmap calls failing.
>>> *
>>> - * Interfaces::
>>> + * Interfaces:
>>> *
>>> - * void \*dma_buf_vmap(struct dma_buf \*dmabuf, struct iosys_map \*map)
>>> - * void dma_buf_vunmap(struct dma_buf \*dmabuf, struct iosys_map \*map)
>>> + * .. code-block:: c
>>> + *
>>> + * void *dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map)
>>> + * void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
>>> *
>>> * The vmap call can fail if there is no vmap support in the exporter, or if
>>> * it runs out of vmalloc space. Note that the dma-buf layer keeps a reference
>>> @@ -1342,10 +1344,11 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF);
>>> * enough, since adding interfaces to intercept pagefaults and allow pte
>>> * shootdowns would increase the complexity quite a bit.
>>> *
>>> - * Interface::
>>> + * Interface:
>>> + *
>>> + * .. code-block:: c
>>> *
>>> - * int dma_buf_mmap(struct dma_buf \*, struct vm_area_struct \*,
>>> - * unsigned long);
>>> + * int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long);
>>> *
>>> * If the importing subsystem simply provides a special-purpose mmap call to
>>> * set up a mapping in userspace, calling do_mmap with &dma_buf.file will
>>> --
>>> 2.43.0.381.gb435a96ce8-goog
>>>
On 20/09/2024 23:36, Adrián Larumbe wrote:
> Hi Steve, thanks for the review.
Hi Adrián,
> I've applied all of your suggestions for the next patch series revision, so I'll
> only be answering to your question about the calc_profiling_ringbuf_num_slots
> function further down below.
>
[...]
>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>> .free_job = queue_free_job,
>>> };
>>>
>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> + u32 cs_ringbuf_size)
>>> +{
>>> + u32 min_profiled_job_instrs = U32_MAX;
>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>> +
>>> + /*
>>> + * We want to calculate the minimum size of a profiled job's CS,
>>> + * because since they need additional instructions for the sampling
>>> + * of performance metrics, they might take up further slots in
>>> + * the queue's ringbuffer. This means we might not need as many job
>>> + * slots for keeping track of their profiling information. What we
>>> + * need is the maximum number of slots we should allocate to this end,
>>> + * which matches the maximum number of profiled jobs we can place
>>> + * simultaneously in the queue's ring buffer.
>>> + * That has to be calculated separately for every single job profiling
>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>> + * jobs don't need to keep track of this at all.
>>> + */
>>> + for (u32 i = 0; i < last_flag; i++) {
>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>> + min_profiled_job_instrs =
>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> + }
>>> +
>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> +}
>>
>> I may be missing something, but is there a situation where this is
>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>> can only add extra instructions to the no-flags case - whereas this
>> implies you're thinking that instructions may also be removed (or replaced).
>>
>> Steve
>
> Since we create a separate kernel BO to hold the profiling information slot, we
> need one that would be able to accomodate as many slots as the maximum number of
> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
> profiled jobs always take more instructions than unprofiled ones, then we would
> usually need fewer slots than the number of unprofiled jobs we could insert at
> once in the ring buffer.
>
> Because we represent profiling metrics with a bit mask, then we need to test the
> size of the CS for every single metric enabled in isolation, since enabling more
> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
> in the queue's ring buffer.
>
> In our case, calling calc_job_credits(0) would simply tell us the number of
> instructions we need for a normal job with no profiled features enabled, which
> would always requiere less instructions than profiled ones, and therefore more
> slots in the profiling info kernel BO. But we don't need to keep track of
> profiling numbers for unprofiled jobs, so there's no point in calculating this
> number.
>
> At first I was simply allocating a profiling info kernel BO as big as the number
> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
> be wasted, since profiled jobs always require more slots because they hold more
> instructions, so fewer profiling slots in said kernel BO.
>
> The value of this approach will eventually manifest if we decided to keep track of
> more profiling metrics, since this code won't have to change at all, other than
> adding new profiling flags in the panthor_device_profiling_flags enum.
Thanks for the detailed explanation. I think what I was missing is that
the loop is checking each bit flag independently and *not* checking
calc_job_credits(0).
The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
confused me - that should be completely redundant. Or at least we need
something more intelligent if we have profiling bits which are not
mutually compatible.
I'm also not entirely sure that the amount of RAM saved is significant,
but you've already written the code so we might as well have the saving ;)
Thanks,
Steve
> Regards,
> Adrian
>
>>> +
>>> static struct panthor_queue *
>>> group_create_queue(struct panthor_group *group,
>>> const struct drm_panthor_queue_create *args)
>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>> goto err_free_queue;
>>> }
>>>
>>> + queue->profiling.slot_count =
>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>> +
>>> + queue->profiling.slots =
>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>> + queue->profiling.slot_count *
>>> + sizeof(struct panthor_job_profiling_data),
>>> + DRM_PANTHOR_BO_NO_MMAP,
>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>> +
>>> + if (IS_ERR(queue->profiling.slots)) {
>>> + ret = PTR_ERR(queue->profiling.slots);
>>> + goto err_free_queue;
>>> + }
>>> +
>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>> + if (ret)
>>> + goto err_free_queue;
>>> +
>>> + /*
>>> + * Credit limit argument tells us the total number of instructions
>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>> + * twice as many as others, depending on their profiling status.
>>> + */
>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>> group->ptdev->scheduler->wq, 1,
>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>> + args->ringbuf_size / sizeof(u64),
>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>> group->ptdev->reset.wq,
>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>> {
>>> struct panthor_group_pool *gpool = pfile->groups;
>>> struct panthor_job *job;
>>> + u32 credits;
>>> int ret;
>>>
>>> if (qsubmit->pad)
>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>> }
>>> }
>>>
>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>> + credits = calc_job_credits(job->profiling.mask);
>>> + if (credits == 0) {
>>> + ret = -EINVAL;
>>> + goto err_put_job;
>>> + }
>>> +
>>> ret = drm_sched_job_init(&job->base,
>>> &job->group->queues[job->queue_idx]->entity,
>>> - 1, job->group);
>>> + credits, job->group);
>>> if (ret)
>>> goto err_put_job;
>>>
>
Condsider the following call sequence:
/* Upper layer */
dma_fence_begin_signalling();
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
...
The driver might here use a utility that is annotated as intended for the
dma-fence signalling critical path. Now if the upper layer isn't correctly
annotated yet for whatever reason, resulting in
/* Upper layer */
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
We will receive a false lockdep locking order violation notification from
dma_fence_begin_signalling(). However entering a dma-fence signalling
critical section itself doesn't block and could not cause a deadlock.
So use a successful read_trylock() annotation instead for
dma_fence_begin_signalling(). That will make sure that the locking order
is correctly registered in the first case, and doesn't register any
locking order in the second case.
The alternative is of course to make sure that the "Upper layer" is always
correctly annotated. But experience shows that's not easily achievable
in all cases.
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/dma-buf/dma-fence.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index f177c56269bb..17f632768ef9 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -308,8 +308,8 @@ bool dma_fence_begin_signalling(void)
if (in_atomic())
return true;
- /* ... and non-recursive readlock */
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+ /* ... and non-recursive successful read_trylock */
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _RET_IP_);
return false;
}
@@ -340,7 +340,7 @@ void __dma_fence_might_wait(void)
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _THIS_IP_);
}
#endif
--
2.39.2