On Tue, Oct 1, 2024 at 7:51 PM Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> Use of kmap_atomic/kunmap_atomic is deprecated, use
> kmap_local_page/kunmap_local instead.
>
> This is reported by checkpatch.
> Also fix repeated word issue.
>
> WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
> + void *vaddr = kmap_atomic(page);
>
> WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead
> + kunmap_atomic(vaddr);
>
> WARNING: Possible repeated word: 'by'
> + * has been killed by by SIGKILL
>
> total: 0 errors, 3 warnings, 405 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
Reviewed-by: T.J. Mercier <tjmercier(a)google.com>
The Android kernels have been doing this for over a year, so should be
pretty well tested at this point:
https://r.android.com/c/kernel/common/+/2500840
> ---
> drivers/dma-buf/heaps/cma_heap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
> index 93be88b805fe..8c55431cc16c 100644
> --- a/drivers/dma-buf/heaps/cma_heap.c
> +++ b/drivers/dma-buf/heaps/cma_heap.c
> @@ -309,13 +309,13 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap,
> struct page *page = cma_pages;
>
> while (nr_clear_pages > 0) {
> - void *vaddr = kmap_atomic(page);
> + void *vaddr = kmap_local_page(page);
>
> memset(vaddr, 0, PAGE_SIZE);
> - kunmap_atomic(vaddr);
> + kunmap_local(vaddr);
> /*
> * Avoid wasting time zeroing memory if the process
> - * has been killed by by SIGKILL
> + * has been killed by SIGKILL.
> */
> if (fatal_signal_pending(current))
> goto free_cma;
> --
> 2.17.1
>
Hello Pintu,
On Tue, 1 Oct 2024 at 23:16, Pintu Kumar <quic_pintu(a)quicinc.com> wrote:
>
> Symbolic permissions are not preferred, instead use the octal.
> Also, fix other warnings/errors as well for cleanup.
>
> WARNING: Block comments use * on subsequent lines
> + /* only support discovering the end of the buffer,
> + but also allow SEEK_SET to maintain the idiomatic
>
> WARNING: Block comments use a trailing */ on a separate line
> + SEEK_END(0), SEEK_CUR(0) pattern */
>
> WARNING: Block comments use a trailing */ on a separate line
> + * before passing the sgt back to the exporter. */
>
> ERROR: "foo * bar" should be "foo *bar"
> +static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
>
> WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
> + d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
>
> total: 1 errors, 4 warnings, 1746 lines checked
>
> Signed-off-by: Pintu Kumar <quic_pintu(a)quicinc.com>
Thanks for this patch - could you please also mention in the commit
log how did you find this? It looks like you ran checkpatch, but it's
not clear from the commit log.
Since this patch does multiple things related to checkpatch warnings
(change S_IRUGO to 0444, comments correction, function declaration
correction), can I please ask you to change the commit title to also
reflect that?
> ---
> drivers/dma-buf/dma-buf.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8892bc701a66..2e63d50e46d3 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -176,8 +176,9 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
> dmabuf = file->private_data;
>
> /* only support discovering the end of the buffer,
> - but also allow SEEK_SET to maintain the idiomatic
> - SEEK_END(0), SEEK_CUR(0) pattern */
> + * but also allow SEEK_SET to maintain the idiomatic
> + * SEEK_END(0), SEEK_CUR(0) pattern.
> + */
> if (whence == SEEK_END)
> base = dmabuf->size;
> else if (whence == SEEK_SET)
> @@ -782,13 +783,14 @@ static void mangle_sg_table(struct sg_table *sg_table)
> /* To catch abuse of the underlying struct page by importers mix
> * up the bits, but take care to preserve the low SG_ bits to
> * not corrupt the sgt. The mixing is undone in __unmap_dma_buf
> - * before passing the sgt back to the exporter. */
> + * before passing the sgt back to the exporter.
> + */
> for_each_sgtable_sg(sg_table, sg, i)
> sg->page_link ^= ~0xffUL;
> #endif
>
> }
> -static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
> +static struct sg_table *__map_dma_buf(struct dma_buf_attachment *attach,
> enum dma_data_direction direction)
> {
> struct sg_table *sg_table;
> @@ -1694,7 +1696,7 @@ static int dma_buf_init_debugfs(void)
>
> dma_buf_debugfs_dir = d;
>
> - d = debugfs_create_file("bufinfo", S_IRUGO, dma_buf_debugfs_dir,
> + d = debugfs_create_file("bufinfo", 0444, dma_buf_debugfs_dir,
> NULL, &dma_buf_debug_fops);
> if (IS_ERR(d)) {
> pr_debug("dma_buf: debugfs: failed to create node bufinfo\n");
> --
> 2.17.1
>
Best,
Sumit.
On 02/10/2024 09:38, Boris Brezillon wrote:
> On Tue, 24 Sep 2024 00:06:21 +0100
> Adrián Larumbe <adrian.larumbe(a)collabora.com> wrote:
>
>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>> + u32 cs_ringbuf_size)
>> +{
>> + u32 min_profiled_job_instrs = U32_MAX;
>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>> +
>> + /*
>> + * We want to calculate the minimum size of a profiled job's CS,
>> + * because since they need additional instructions for the sampling
>> + * of performance metrics, they might take up further slots in
>> + * the queue's ringbuffer. This means we might not need as many job
>> + * slots for keeping track of their profiling information. What we
>> + * need is the maximum number of slots we should allocate to this end,
>> + * which matches the maximum number of profiled jobs we can place
>> + * simultaneously in the queue's ring buffer.
>> + * That has to be calculated separately for every single job profiling
>> + * flag, but not in the case job profiling is disabled, since unprofiled
>> + * jobs don't need to keep track of this at all.
>> + */
>> + for (u32 i = 0; i < last_flag; i++) {
>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>
> I'll get rid of this check when applying, as suggested by Steve. Steve,
> with this modification do you want me to add your R-b?
Yes, please do.
Thanks,
Steve
> BTW, I've also fixed a bunch of checkpatch errors/warnings, so you
> might want to run checkpatch --strict next time.
>
>> + min_profiled_job_instrs =
>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>> + }
>> +
>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>> +}
Hi,
Am 30.09.24 um 21:38 schrieb Zichen Xie:
> Dear Linux Developers for DMA BUFFER SHARING FRAMEWORK,
>
> We are curious about the function 'dma_resv_get_fences' here:
> https://elixir.bootlin.com/linux/v6.11/source/drivers/dma-buf/dma-resv.c#L5…,
> and the logic below:
> ```
> dma_resv_for_each_fence_unlocked(&cursor, fence) {
>
> if (dma_resv_iter_is_restarted(&cursor)) {
> struct dma_fence **new_fences;
> unsigned int count;
>
> while (*num_fences)
> dma_fence_put((*fences)[--(*num_fences)]);
>
> count = cursor.num_fences + 1;
>
> /* Eventually re-allocate the array */
> new_fences = krealloc_array(*fences, count,
> sizeof(void *),
> GFP_KERNEL);
> if (count && !new_fences) {
> kfree(*fences);
> *fences = NULL;
> *num_fences = 0;
> dma_resv_iter_end(&cursor);
> return -ENOMEM;
> }
> *fences = new_fences;
> }
>
> (*fences)[(*num_fences)++] = dma_fence_get(fence);
> }
> ```
> The existing check 'if (count && !new_fences)' may fail if count==0,
> and 'krealloc_array' with count==0 is an undefined behavior. The
> realloc may fail and return a NULL pointer, leading to a NULL Pointer
> Dereference in '(*fences)[(*num_fences)++] = dma_fence_get(fence);'
You already answered the question yourself "count = cursor.num_fences +
1;". So count can never be 0.
What could theoretically be possible is that num_fences overflows, but
this value isn't userspace controllable and we would run into memory
allocation failures long before that happened.
But we could potentially remove this whole handling since if there are
no fences in the dma_resv object we don't enter the loop in the first place.
Regards,
Christian.
>
> Please correct us if we miss some key prerequisites for this function!
> Thank you very much!
On 27/09/2024 15:53, Adrián Larumbe wrote:
> On 25.09.2024 10:56, Steven Price wrote:
>> On 23/09/2024 21:43, Adrián Larumbe wrote:
>>> Hi Steve,
>>>
>>> On 23.09.2024 09:55, Steven Price wrote:
>>>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>>>> Hi Steve, thanks for the review.
>>>>
>>>> Hi Adrián,
>>>>
>>>>> I've applied all of your suggestions for the next patch series revision, so I'll
>>>>> only be answering to your question about the calc_profiling_ringbuf_num_slots
>>>>> function further down below.
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>>>>> .free_job = queue_free_job,
>>>>>>> };
>>>>>>>
>>>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>>>> + u32 cs_ringbuf_size)
>>>>>>> +{
>>>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * We want to calculate the minimum size of a profiled job's CS,
>>>>>>> + * because since they need additional instructions for the sampling
>>>>>>> + * of performance metrics, they might take up further slots in
>>>>>>> + * the queue's ringbuffer. This means we might not need as many job
>>>>>>> + * slots for keeping track of their profiling information. What we
>>>>>>> + * need is the maximum number of slots we should allocate to this end,
>>>>>>> + * which matches the maximum number of profiled jobs we can place
>>>>>>> + * simultaneously in the queue's ring buffer.
>>>>>>> + * That has to be calculated separately for every single job profiling
>>>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>>>>>> + * jobs don't need to keep track of this at all.
>>>>>>> + */
>>>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>>>> + min_profiled_job_instrs =
>>>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>>>>>> + }
>>>>>>> +
>>>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>>>>>> +}
>>>>>>
>>>>>> I may be missing something, but is there a situation where this is
>>>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>>>> can only add extra instructions to the no-flags case - whereas this
>>>>>> implies you're thinking that instructions may also be removed (or replaced).
>>>>>>
>>>>>> Steve
>>>>>
>>>>> Since we create a separate kernel BO to hold the profiling information slot, we
>>>>> need one that would be able to accomodate as many slots as the maximum number of
>>>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
>>>>> profiled jobs always take more instructions than unprofiled ones, then we would
>>>>> usually need fewer slots than the number of unprofiled jobs we could insert at
>>>>> once in the ring buffer.
>>>>>
>>>>> Because we represent profiling metrics with a bit mask, then we need to test the
>>>>> size of the CS for every single metric enabled in isolation, since enabling more
>>>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
>>>>> in the queue's ring buffer.
>>>>>
>>>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>>>> instructions we need for a normal job with no profiled features enabled, which
>>>>> would always requiere less instructions than profiled ones, and therefore more
>>>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>>>> profiling numbers for unprofiled jobs, so there's no point in calculating this
>>>>> number.
>>>>>
>>>>> At first I was simply allocating a profiling info kernel BO as big as the number
>>>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
>>>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
>>>>> be wasted, since profiled jobs always require more slots because they hold more
>>>>> instructions, so fewer profiling slots in said kernel BO.
>>>>>
>>>>> The value of this approach will eventually manifest if we decided to keep track of
>>>>> more profiling metrics, since this code won't have to change at all, other than
>>>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>>>
>>>> Thanks for the detailed explanation. I think what I was missing is that
>>>> the loop is checking each bit flag independently and *not* checking
>>>> calc_job_credits(0).
>>>>
>>>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>>>> confused me - that should be completely redundant. Or at least we need
>>>> something more intelligent if we have profiling bits which are not
>>>> mutually compatible.
>>>
>>> I thought of an alternative that would only test bits that are actually part of
>>> the mask:
>>>
>>> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> u32 cs_ringbuf_size)
>>> {
>>> u32 min_profiled_job_instrs = U32_MAX;
>>> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
>>>
>>> while (profiling_mask) {
>>> u32 i = ffs(profiling_mask) - 1;
>>> profiling_mask &= ~BIT(i);
>>> min_profiled_job_instrs =
>>> min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> }
>>>
>>> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> }
>>>
>>> However, I don't think this would be more efficient, because ffs() is probably
>>> fetching the first set bit by performing register shifts, and I guess this would
>>> take somewhat longer than iterating over every single bit from the last one,
>>> even if also matching them against the whole mask, just in case in future
>>> additions of performance metrics we decide to leave some of the lower
>>> significance bits untouched.
>>
>> Efficiency isn't very important here - we're not on a fast path, so it's
>> more about ensuring the code is readable. I don't think the above is
>> more readable then the original for loop.
>>
>>> Regarding your question about mutual compatibility, I don't think that is an
>>> issue here, because we're testing bits in isolation. If in the future we find
>>> out that some of the values we're profiling cannot be sampled at once, we can
>>> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
>>> profiling masks.
>>
>> My comment about compatibility is because in the original above you were
>> calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:
>>
>>> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>
>> then looping between 0 and that bit:
>>
>>> for (u32 i = 0; i < last_flag; i++) {
>>
>> So the test:
>>
>>> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>
>> would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
>> that it set. The only reason I can think for that to be true in the
>> future is if there is some sort of incompatibility - e.g. maybe there's
>> an old and new way of doing some form of profiling with the old way
>> being kept for backwards compatibility. But I suspect if/when that is
>> required we'll need to revisit this function anyway. So that 'if'
>> statement seems completely redundant (it's trivially always true).
>
> I think you're right about this. Would you be fine with the rest of the patch
> as it is in revision 8 if I also deleted this bitmask check?
Yes the rest of it looks fine.
Thanks,
Steve
>> Steve
>>
>>>> I'm also not entirely sure that the amount of RAM saved is significant,
>>>> but you've already written the code so we might as well have the saving ;)
>>>
>>> I think this was more evident before Boris suggested we reduce the basic slot
>>> size to that of a single cache line, because then the minimum profiled job
>>> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
>>> case, we would need a half as big BO for holding the sampled data (in case the
>>> least size profiled job CS would extend over the 16 instruction boundary).
>>> I still think this is a good idea so that in the future we don't need to worry
>>> about adjusting the code that deals with preparing the right boilerplate CS,
>>> since it'll only be a matter of adding new instructions inside prepare_job_instrs().
>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>>> Regards,
>>>>> Adrian
>>>>>
>>>>>>> +
>>>>>>> static struct panthor_queue *
>>>>>>> group_create_queue(struct panthor_group *group,
>>>>>>> const struct drm_panthor_queue_create *args)
>>>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>>>> goto err_free_queue;
>>>>>>> }
>>>>>>>
>>>>>>> + queue->profiling.slot_count =
>>>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>>>>>> +
>>>>>>> + queue->profiling.slots =
>>>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>>>> + queue->profiling.slot_count *
>>>>>>> + sizeof(struct panthor_job_profiling_data),
>>>>>>> + DRM_PANTHOR_BO_NO_MMAP,
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>>>>>> +
>>>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>>>> + ret = PTR_ERR(queue->profiling.slots);
>>>>>>> + goto err_free_queue;
>>>>>>> + }
>>>>>>> +
>>>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>>>> + if (ret)
>>>>>>> + goto err_free_queue;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Credit limit argument tells us the total number of instructions
>>>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>>>>>> + * twice as many as others, depending on their profiling status.
>>>>>>> + */
>>>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>>>> group->ptdev->scheduler->wq, 1,
>>>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>>>>>> + args->ringbuf_size / sizeof(u64),
>>>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>>>> group->ptdev->reset.wq,
>>>>>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> {
>>>>>>> struct panthor_group_pool *gpool = pfile->groups;
>>>>>>> struct panthor_job *job;
>>>>>>> + u32 credits;
>>>>>>> int ret;
>>>>>>>
>>>>>>> if (qsubmit->pad)
>>>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>>>> + if (credits == 0) {
>>>>>>> + ret = -EINVAL;
>>>>>>> + goto err_put_job;
>>>>>>> + }
>>>>>>> +
>>>>>>> ret = drm_sched_job_init(&job->base,
>>>>>>> &job->group->queues[job->queue_idx]->entity,
>>>>>>> - 1, job->group);
>>>>>>> + credits, job->group);
>>>>>>> if (ret)
>>>>>>> goto err_put_job;
>>>>>>>
>>>>>