Hi,
On Mon, Oct 23, 2023 at 10:25:50AM -0700, Doug Anderson wrote:
> On Mon, Oct 23, 2023 at 9:31 AM Yuran Pereira <yuran.pereira(a)hotmail.com> wrote:
> >
> > Since "Clean up checks for already prepared/enabled in panels" has
> > already been done and merged [1], I think there is no longer a need
> > for this item to be in the gpu TODO.
> >
> > [1] https://patchwork.freedesktop.org/patch/551421/
> >
> > Signed-off-by: Yuran Pereira <yuran.pereira(a)hotmail.com>
> > ---
> > Documentation/gpu/todo.rst | 25 -------------------------
> > 1 file changed, 25 deletions(-)
>
> It's not actually all done. It's in a bit of a limbo state right now,
> unfortunately. I landed all of the "simple" cases where panels were
> needlessly tracking prepare/enable, but the less simple cases are
> still outstanding.
>
> Specifically the issue is that many panels have code to properly power
> cycle themselves off at shutdown time and in order to do that they
> need to keep track of the prepare/enable state. After a big, long
> discussion [1] it was decided that we could get rid of all the panel
> code handling shutdown if only all relevant DRM KMS drivers would
> properly call drm_atomic_helper_shutdown().
>
> I made an attempt to get DRM KMS drivers to call
> drm_atomic_helper_shutdown() [2] [3] [4]. I was able to land the
> patches that went through drm-misc, but currently many of the
> non-drm-misc ones are blocked waiting for attention.
>
> ...so things that could be done to help out:
>
> a) Could review patches that haven't landed in [4]. Maybe adding a
> Reviewed-by tag would help wake up maintainers?
>
> b) Could see if you can identify panels that are exclusively used w/
> DRM drivers that have already been converted and then we could post
> patches for just those panels. I have no idea how easy this task would
> be. Is it enough to look at upstream dts files by "compatible" string?
I think it is, yes.
Maxime
On 27/09/2024 15:53, Adrián Larumbe wrote:
> On 25.09.2024 10:56, Steven Price wrote:
>> On 23/09/2024 21:43, Adrián Larumbe wrote:
>>> Hi Steve,
>>>
>>> On 23.09.2024 09:55, Steven Price wrote:
>>>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>>>> Hi Steve, thanks for the review.
>>>>
>>>> Hi Adrián,
>>>>
>>>>> I've applied all of your suggestions for the next patch series revision, so I'll
>>>>> only be answering to your question about the calc_profiling_ringbuf_num_slots
>>>>> function further down below.
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>>>>> .free_job = queue_free_job,
>>>>>>> };
>>>>>>>
>>>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>>>> + u32 cs_ringbuf_size)
>>>>>>> +{
>>>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * We want to calculate the minimum size of a profiled job's CS,
>>>>>>> + * because since they need additional instructions for the sampling
>>>>>>> + * of performance metrics, they might take up further slots in
>>>>>>> + * the queue's ringbuffer. This means we might not need as many job
>>>>>>> + * slots for keeping track of their profiling information. What we
>>>>>>> + * need is the maximum number of slots we should allocate to this end,
>>>>>>> + * which matches the maximum number of profiled jobs we can place
>>>>>>> + * simultaneously in the queue's ring buffer.
>>>>>>> + * That has to be calculated separately for every single job profiling
>>>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>>>>>> + * jobs don't need to keep track of this at all.
>>>>>>> + */
>>>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>>>> + min_profiled_job_instrs =
>>>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>>>>>> + }
>>>>>>> +
>>>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>>>>>> +}
>>>>>>
>>>>>> I may be missing something, but is there a situation where this is
>>>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>>>> can only add extra instructions to the no-flags case - whereas this
>>>>>> implies you're thinking that instructions may also be removed (or replaced).
>>>>>>
>>>>>> Steve
>>>>>
>>>>> Since we create a separate kernel BO to hold the profiling information slot, we
>>>>> need one that would be able to accomodate as many slots as the maximum number of
>>>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
>>>>> profiled jobs always take more instructions than unprofiled ones, then we would
>>>>> usually need fewer slots than the number of unprofiled jobs we could insert at
>>>>> once in the ring buffer.
>>>>>
>>>>> Because we represent profiling metrics with a bit mask, then we need to test the
>>>>> size of the CS for every single metric enabled in isolation, since enabling more
>>>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
>>>>> in the queue's ring buffer.
>>>>>
>>>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>>>> instructions we need for a normal job with no profiled features enabled, which
>>>>> would always requiere less instructions than profiled ones, and therefore more
>>>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>>>> profiling numbers for unprofiled jobs, so there's no point in calculating this
>>>>> number.
>>>>>
>>>>> At first I was simply allocating a profiling info kernel BO as big as the number
>>>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
>>>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
>>>>> be wasted, since profiled jobs always require more slots because they hold more
>>>>> instructions, so fewer profiling slots in said kernel BO.
>>>>>
>>>>> The value of this approach will eventually manifest if we decided to keep track of
>>>>> more profiling metrics, since this code won't have to change at all, other than
>>>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>>>
>>>> Thanks for the detailed explanation. I think what I was missing is that
>>>> the loop is checking each bit flag independently and *not* checking
>>>> calc_job_credits(0).
>>>>
>>>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>>>> confused me - that should be completely redundant. Or at least we need
>>>> something more intelligent if we have profiling bits which are not
>>>> mutually compatible.
>>>
>>> I thought of an alternative that would only test bits that are actually part of
>>> the mask:
>>>
>>> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> u32 cs_ringbuf_size)
>>> {
>>> u32 min_profiled_job_instrs = U32_MAX;
>>> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
>>>
>>> while (profiling_mask) {
>>> u32 i = ffs(profiling_mask) - 1;
>>> profiling_mask &= ~BIT(i);
>>> min_profiled_job_instrs =
>>> min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> }
>>>
>>> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> }
>>>
>>> However, I don't think this would be more efficient, because ffs() is probably
>>> fetching the first set bit by performing register shifts, and I guess this would
>>> take somewhat longer than iterating over every single bit from the last one,
>>> even if also matching them against the whole mask, just in case in future
>>> additions of performance metrics we decide to leave some of the lower
>>> significance bits untouched.
>>
>> Efficiency isn't very important here - we're not on a fast path, so it's
>> more about ensuring the code is readable. I don't think the above is
>> more readable then the original for loop.
>>
>>> Regarding your question about mutual compatibility, I don't think that is an
>>> issue here, because we're testing bits in isolation. If in the future we find
>>> out that some of the values we're profiling cannot be sampled at once, we can
>>> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
>>> profiling masks.
>>
>> My comment about compatibility is because in the original above you were
>> calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:
>>
>>> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>
>> then looping between 0 and that bit:
>>
>>> for (u32 i = 0; i < last_flag; i++) {
>>
>> So the test:
>>
>>> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>
>> would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
>> that it set. The only reason I can think for that to be true in the
>> future is if there is some sort of incompatibility - e.g. maybe there's
>> an old and new way of doing some form of profiling with the old way
>> being kept for backwards compatibility. But I suspect if/when that is
>> required we'll need to revisit this function anyway. So that 'if'
>> statement seems completely redundant (it's trivially always true).
>
> I think you're right about this. Would you be fine with the rest of the patch
> as it is in revision 8 if I also deleted this bitmask check?
Yes the rest of it looks fine.
Thanks,
Steve
>> Steve
>>
>>>> I'm also not entirely sure that the amount of RAM saved is significant,
>>>> but you've already written the code so we might as well have the saving ;)
>>>
>>> I think this was more evident before Boris suggested we reduce the basic slot
>>> size to that of a single cache line, because then the minimum profiled job
>>> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
>>> case, we would need a half as big BO for holding the sampled data (in case the
>>> least size profiled job CS would extend over the 16 instruction boundary).
>>> I still think this is a good idea so that in the future we don't need to worry
>>> about adjusting the code that deals with preparing the right boilerplate CS,
>>> since it'll only be a matter of adding new instructions inside prepare_job_instrs().
>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>>> Regards,
>>>>> Adrian
>>>>>
>>>>>>> +
>>>>>>> static struct panthor_queue *
>>>>>>> group_create_queue(struct panthor_group *group,
>>>>>>> const struct drm_panthor_queue_create *args)
>>>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>>>> goto err_free_queue;
>>>>>>> }
>>>>>>>
>>>>>>> + queue->profiling.slot_count =
>>>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>>>>>> +
>>>>>>> + queue->profiling.slots =
>>>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>>>> + queue->profiling.slot_count *
>>>>>>> + sizeof(struct panthor_job_profiling_data),
>>>>>>> + DRM_PANTHOR_BO_NO_MMAP,
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>>>>>> +
>>>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>>>> + ret = PTR_ERR(queue->profiling.slots);
>>>>>>> + goto err_free_queue;
>>>>>>> + }
>>>>>>> +
>>>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>>>> + if (ret)
>>>>>>> + goto err_free_queue;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Credit limit argument tells us the total number of instructions
>>>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>>>>>> + * twice as many as others, depending on their profiling status.
>>>>>>> + */
>>>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>>>> group->ptdev->scheduler->wq, 1,
>>>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>>>>>> + args->ringbuf_size / sizeof(u64),
>>>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>>>> group->ptdev->reset.wq,
>>>>>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> {
>>>>>>> struct panthor_group_pool *gpool = pfile->groups;
>>>>>>> struct panthor_job *job;
>>>>>>> + u32 credits;
>>>>>>> int ret;
>>>>>>>
>>>>>>> if (qsubmit->pad)
>>>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>>>> + if (credits == 0) {
>>>>>>> + ret = -EINVAL;
>>>>>>> + goto err_put_job;
>>>>>>> + }
>>>>>>> +
>>>>>>> ret = drm_sched_job_init(&job->base,
>>>>>>> &job->group->queues[job->queue_idx]->entity,
>>>>>>> - 1, job->group);
>>>>>>> + credits, job->group);
>>>>>>> if (ret)
>>>>>>> goto err_put_job;
>>>>>>>
>>>>>
Hi,
This patch set is based on top of Yong Wu's restricted heap patch set [1].
It's also a continuation on Olivier's Add dma-buf secure-heap patch set [2].
The Linaro restricted heap uses genalloc in the kernel to manage the heap
carvout. This is a difference from the Mediatek restricted heap which
relies on the secure world to manage the carveout.
I've tried to adress the comments on [2], but [1] introduces changes so I'm
afraid I've had to skip some comments.
This can be tested on QEMU with the following steps:
repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
-b prototype/sdp-v1
repo sync -j8
cd build
make toolchains -j4
make all -j$(nproc)
make run-only
# login and at the prompt:
xtest --sdp-basic
https://optee.readthedocs.io/en/latest/building/prerequisites.html
list dependencies needed to build the above.
The tests are pretty basic, mostly checking that a Trusted Application in
the secure world can access and manipulate the memory.
Cheers,
Jens
[1] https://lore.kernel.org/dri-devel/20240515112308.10171-1-yong.wu@mediatek.c…
[2] https://lore.kernel.org/lkml/20220805135330.970-1-olivier.masse@nxp.com/
Changes since Olivier's post [2]:
* Based on Yong Wu's post [1] where much of dma-buf handling is done in
the generic restricted heap
* Simplifications and cleanup
* New commit message for "dma-buf: heaps: add Linaro restricted dmabuf heap
support"
* Replaced the word "secure" with "restricted" where applicable
Etienne Carriere (1):
tee: new ioctl to a register tee_shm from a dmabuf file descriptor
Jens Wiklander (2):
dma-buf: heaps: restricted_heap: add no_map attribute
dma-buf: heaps: add Linaro restricted dmabuf heap support
Olivier Masse (1):
dt-bindings: reserved-memory: add linaro,restricted-heap
.../linaro,restricted-heap.yaml | 56 ++++++
drivers/dma-buf/heaps/Kconfig | 10 ++
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/restricted_heap.c | 17 +-
drivers/dma-buf/heaps/restricted_heap.h | 2 +
.../dma-buf/heaps/restricted_heap_linaro.c | 165 ++++++++++++++++++
drivers/tee/tee_core.c | 38 ++++
drivers/tee/tee_shm.c | 104 ++++++++++-
include/linux/tee_drv.h | 11 ++
include/uapi/linux/tee.h | 29 +++
10 files changed, 426 insertions(+), 7 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/linaro,restricted-heap.yaml
create mode 100644 drivers/dma-buf/heaps/restricted_heap_linaro.c
--
2.34.1
Hi Adrián,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.11 next-20240927]
[cannot apply to drm-misc/drm-misc-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-i…
base: linus/master
patch link: https://lore.kernel.org/r/20240923230912.2207320-4-adrian.larumbe%40collabo…
patch subject: [PATCH v8 3/5] drm/panthor: add DRM fdinfo support
config: arm-randconfig-002-20240929 (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…)
compiler: arm-linux-gnueabi-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409291048.zLqDeqpO-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/linux/math64.h:6,
from include/linux/time.h:6,
from include/linux/stat.h:19,
from include/linux/module.h:13,
from drivers/gpu/drm/panthor/panthor_drv.c:7:
drivers/gpu/drm/panthor/panthor_drv.c: In function 'panthor_gpu_show_fdinfo':
>> drivers/gpu/drm/panthor/panthor_drv.c:1389:45: error: implicit declaration of function 'arch_timer_get_cntfrq' [-Wimplicit-function-declaration]
1389 | arch_timer_get_cntfrq()));
| ^~~~~~~~~~~~~~~~~~~~~
include/linux/math.h:40:39: note: in definition of macro 'DIV_ROUND_DOWN_ULL'
40 | ({ unsigned long long _tmp = (ll); do_div(_tmp, d); _tmp; })
| ^~
drivers/gpu/drm/panthor/panthor_drv.c:1388:28: note: in expansion of macro 'DIV_ROUND_UP_ULL'
1388 | DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
| ^~~~~~~~~~~~~~~~
vim +/arch_timer_get_cntfrq +1389 drivers/gpu/drm/panthor/panthor_drv.c
1377
1378 static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
1379 struct panthor_file *pfile,
1380 struct drm_printer *p)
1381 {
1382 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_ALL)
1383 panthor_fdinfo_gather_group_samples(pfile);
1384
1385 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP) {
1386 #ifdef CONFIG_ARM_ARCH_TIMER
1387 drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
1388 DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
> 1389 arch_timer_get_cntfrq()));
1390 #endif
1391 }
1392 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
1393 drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
1394
1395 drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
1396 drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
1397 }
1398
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 23/09/2024 21:43, Adrián Larumbe wrote:
> Hi Steve,
>
> On 23.09.2024 09:55, Steven Price wrote:
>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>> Hi Steve, thanks for the review.
>>
>> Hi Adrián,
>>
>>> I've applied all of your suggestions for the next patch series revision, so I'll
>>> only be answering to your question about the calc_profiling_ringbuf_num_slots
>>> function further down below.
>>>
>>
>> [...]
>>
>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>>> .free_job = queue_free_job,
>>>>> };
>>>>>
>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>> + u32 cs_ringbuf_size)
>>>>> +{
>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>> +
>>>>> + /*
>>>>> + * We want to calculate the minimum size of a profiled job's CS,
>>>>> + * because since they need additional instructions for the sampling
>>>>> + * of performance metrics, they might take up further slots in
>>>>> + * the queue's ringbuffer. This means we might not need as many job
>>>>> + * slots for keeping track of their profiling information. What we
>>>>> + * need is the maximum number of slots we should allocate to this end,
>>>>> + * which matches the maximum number of profiled jobs we can place
>>>>> + * simultaneously in the queue's ring buffer.
>>>>> + * That has to be calculated separately for every single job profiling
>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>>>> + * jobs don't need to keep track of this at all.
>>>>> + */
>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>> + min_profiled_job_instrs =
>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>>>> + }
>>>>> +
>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>>>> +}
>>>>
>>>> I may be missing something, but is there a situation where this is
>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>> can only add extra instructions to the no-flags case - whereas this
>>>> implies you're thinking that instructions may also be removed (or replaced).
>>>>
>>>> Steve
>>>
>>> Since we create a separate kernel BO to hold the profiling information slot, we
>>> need one that would be able to accomodate as many slots as the maximum number of
>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
>>> profiled jobs always take more instructions than unprofiled ones, then we would
>>> usually need fewer slots than the number of unprofiled jobs we could insert at
>>> once in the ring buffer.
>>>
>>> Because we represent profiling metrics with a bit mask, then we need to test the
>>> size of the CS for every single metric enabled in isolation, since enabling more
>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
>>> in the queue's ring buffer.
>>>
>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>> instructions we need for a normal job with no profiled features enabled, which
>>> would always requiere less instructions than profiled ones, and therefore more
>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>> profiling numbers for unprofiled jobs, so there's no point in calculating this
>>> number.
>>>
>>> At first I was simply allocating a profiling info kernel BO as big as the number
>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
>>> be wasted, since profiled jobs always require more slots because they hold more
>>> instructions, so fewer profiling slots in said kernel BO.
>>>
>>> The value of this approach will eventually manifest if we decided to keep track of
>>> more profiling metrics, since this code won't have to change at all, other than
>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>
>> Thanks for the detailed explanation. I think what I was missing is that
>> the loop is checking each bit flag independently and *not* checking
>> calc_job_credits(0).
>>
>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>> confused me - that should be completely redundant. Or at least we need
>> something more intelligent if we have profiling bits which are not
>> mutually compatible.
>
> I thought of an alternative that would only test bits that are actually part of
> the mask:
>
> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> u32 cs_ringbuf_size)
> {
> u32 min_profiled_job_instrs = U32_MAX;
> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
>
> while (profiling_mask) {
> u32 i = ffs(profiling_mask) - 1;
> profiling_mask &= ~BIT(i);
> min_profiled_job_instrs =
> min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
> }
>
> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
> }
>
> However, I don't think this would be more efficient, because ffs() is probably
> fetching the first set bit by performing register shifts, and I guess this would
> take somewhat longer than iterating over every single bit from the last one,
> even if also matching them against the whole mask, just in case in future
> additions of performance metrics we decide to leave some of the lower
> significance bits untouched.
Efficiency isn't very important here - we're not on a fast path, so it's
more about ensuring the code is readable. I don't think the above is
more readable then the original for loop.
> Regarding your question about mutual compatibility, I don't think that is an
> issue here, because we're testing bits in isolation. If in the future we find
> out that some of the values we're profiling cannot be sampled at once, we can
> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
> profiling masks.
My comment about compatibility is because in the original above you were
calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:
> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
then looping between 0 and that bit:
> for (u32 i = 0; i < last_flag; i++) {
So the test:
> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
that it set. The only reason I can think for that to be true in the
future is if there is some sort of incompatibility - e.g. maybe there's
an old and new way of doing some form of profiling with the old way
being kept for backwards compatibility. But I suspect if/when that is
required we'll need to revisit this function anyway. So that 'if'
statement seems completely redundant (it's trivially always true).
Steve
>> I'm also not entirely sure that the amount of RAM saved is significant,
>> but you've already written the code so we might as well have the saving ;)
>
> I think this was more evident before Boris suggested we reduce the basic slot
> size to that of a single cache line, because then the minimum profiled job
> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
> case, we would need a half as big BO for holding the sampled data (in case the
> least size profiled job CS would extend over the 16 instruction boundary).
> I still think this is a good idea so that in the future we don't need to worry
> about adjusting the code that deals with preparing the right boilerplate CS,
> since it'll only be a matter of adding new instructions inside prepare_job_instrs().
>
>> Thanks,
>> Steve
>>
>>> Regards,
>>> Adrian
>>>
>>>>> +
>>>>> static struct panthor_queue *
>>>>> group_create_queue(struct panthor_group *group,
>>>>> const struct drm_panthor_queue_create *args)
>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>> goto err_free_queue;
>>>>> }
>>>>>
>>>>> + queue->profiling.slot_count =
>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>>>> +
>>>>> + queue->profiling.slots =
>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>> + queue->profiling.slot_count *
>>>>> + sizeof(struct panthor_job_profiling_data),
>>>>> + DRM_PANTHOR_BO_NO_MMAP,
>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>>>> +
>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>> + ret = PTR_ERR(queue->profiling.slots);
>>>>> + goto err_free_queue;
>>>>> + }
>>>>> +
>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>> + if (ret)
>>>>> + goto err_free_queue;
>>>>> +
>>>>> + /*
>>>>> + * Credit limit argument tells us the total number of instructions
>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>>>> + * twice as many as others, depending on their profiling status.
>>>>> + */
>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>> group->ptdev->scheduler->wq, 1,
>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>>>> + args->ringbuf_size / sizeof(u64),
>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>> group->ptdev->reset.wq,
>>>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>> {
>>>>> struct panthor_group_pool *gpool = pfile->groups;
>>>>> struct panthor_job *job;
>>>>> + u32 credits;
>>>>> int ret;
>>>>>
>>>>> if (qsubmit->pad)
>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>> }
>>>>> }
>>>>>
>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>> + if (credits == 0) {
>>>>> + ret = -EINVAL;
>>>>> + goto err_put_job;
>>>>> + }
>>>>> +
>>>>> ret = drm_sched_job_init(&job->base,
>>>>> &job->group->queues[job->queue_idx]->entity,
>>>>> - 1, job->group);
>>>>> + credits, job->group);
>>>>> if (ret)
>>>>> goto err_put_job;
>>>>>
>>>
>
>
> Adrian Larumbe
Nothing wrong with this, I just didn't had time to double check it
myself and then forgotten about it.
Going to push it to drm-misc-next.
Regards,
Christian.
Am 23.09.24 um 11:22 schrieb Tommy Chiang:
> Ping.
> Please let me know if I'm doing something wrong.
>
> On Mon, Feb 19, 2024 at 11:00 AM Tommy Chiang <ototot(a)chromium.org> wrote:
>> Kindly ping :)
>>
>> On Fri, Jan 19, 2024 at 11:33 AM Tommy Chiang <ototot(a)chromium.org> wrote:
>>> This patch tries to improve the display of the code listing
>>> on The Linux Kernel documentation website for dma-buf [1] .
>>>
>>> Originally, it appears that it was attempting to escape
>>> the '*' character, but looks like it's not necessary (now),
>>> so we are seeing something like '\*' on the webite.
>>>
>>> This patch removes these unnecessary backslashes and adds syntax
>>> highlighting to improve the readability of the code listing.
>>>
>>> [1] https://docs.kernel.org/driver-api/dma-buf.html
>>>
>>> Signed-off-by: Tommy Chiang <ototot(a)chromium.org>
>>> ---
>>> drivers/dma-buf/dma-buf.c | 15 +++++++++------
>>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 8fe5aa67b167..e083a0ab06d7 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -1282,10 +1282,12 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF);
>>> * vmap interface is introduced. Note that on very old 32-bit architectures
>>> * vmalloc space might be limited and result in vmap calls failing.
>>> *
>>> - * Interfaces::
>>> + * Interfaces:
>>> *
>>> - * void \*dma_buf_vmap(struct dma_buf \*dmabuf, struct iosys_map \*map)
>>> - * void dma_buf_vunmap(struct dma_buf \*dmabuf, struct iosys_map \*map)
>>> + * .. code-block:: c
>>> + *
>>> + * void *dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map)
>>> + * void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
>>> *
>>> * The vmap call can fail if there is no vmap support in the exporter, or if
>>> * it runs out of vmalloc space. Note that the dma-buf layer keeps a reference
>>> @@ -1342,10 +1344,11 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF);
>>> * enough, since adding interfaces to intercept pagefaults and allow pte
>>> * shootdowns would increase the complexity quite a bit.
>>> *
>>> - * Interface::
>>> + * Interface:
>>> + *
>>> + * .. code-block:: c
>>> *
>>> - * int dma_buf_mmap(struct dma_buf \*, struct vm_area_struct \*,
>>> - * unsigned long);
>>> + * int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long);
>>> *
>>> * If the importing subsystem simply provides a special-purpose mmap call to
>>> * set up a mapping in userspace, calling do_mmap with &dma_buf.file will
>>> --
>>> 2.43.0.381.gb435a96ce8-goog
>>>