Linaro-mm-sig September 2024

linaro-mm-sig@lists.linaro.org

20 participants
27 discussions

Re: [PATCH] drm: docs: Remove item from TODO list

by Maxime Ripard

Hi, On Mon, Oct 23, 2023 at 10:25:50AM -0700, Doug Anderson wrote: > On Mon, Oct 23, 2023 at 9:31 AM Yuran Pereira <yuran.pereira(a)hotmail.com> wrote: > > > > Since "Clean up checks for already prepared/enabled in panels" has > > already been done and merged [1], I think there is no longer a need > > for this item to be in the gpu TODO. > > > > [1] https://patchwork.freedesktop.org/patch/551421/ > > > > Signed-off-by: Yuran Pereira <yuran.pereira(a)hotmail.com> > > --- > > Documentation/gpu/todo.rst | 25 ------------------------- > > 1 file changed, 25 deletions(-) > > It's not actually all done. It's in a bit of a limbo state right now, > unfortunately. I landed all of the "simple" cases where panels were > needlessly tracking prepare/enable, but the less simple cases are > still outstanding. > > Specifically the issue is that many panels have code to properly power > cycle themselves off at shutdown time and in order to do that they > need to keep track of the prepare/enable state. After a big, long > discussion [1] it was decided that we could get rid of all the panel > code handling shutdown if only all relevant DRM KMS drivers would > properly call drm_atomic_helper_shutdown(). > > I made an attempt to get DRM KMS drivers to call > drm_atomic_helper_shutdown() [2] [3] [4]. I was able to land the > patches that went through drm-misc, but currently many of the > non-drm-misc ones are blocked waiting for attention. > > ...so things that could be done to help out: > > a) Could review patches that haven't landed in [4]. Maybe adding a > Reviewed-by tag would help wake up maintainers? > > b) Could see if you can identify panels that are exclusively used w/ > DRM drivers that have already been converted and then we could post > patches for just those panels. I have no idea how easy this task would > be. Is it enough to look at upstream dts files by "compatible" string? I think it is, yes. Maxime

2 weeks, 3 days

[PATCH] Documentation: dma-buf: heaps: Add heap name definitions

by Maxime Ripard

Following a recent discussion at last Plumbers, John Stultz, Sumit Sewal, TJ Mercier and I came to an agreement that we should document what the dma-buf heaps names are expected to be, and what the buffers attributes you'll get should be documented. Let's create that doc to make sure those attributes and names are guaranteed going forward. Signed-off-by: Maxime Ripard <mripard(a)kernel.org> --- To: Jonathan Corbet <corbet(a)lwn.net> To: Sumit Semwal <sumit.semwal(a)linaro.org> Cc: Benjamin Gaignard <benjamin.gaignard(a)collabora.com> Cc: Brian Starkey <Brian.Starkey(a)arm.com> Cc: John Stultz <jstultz(a)google.com> Cc: "T.J. Mercier" <tjmercier(a)google.com> Cc: "Christian König" <christian.koenig(a)amd.com> Cc: dri-devel(a)lists.freedesktop.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-media(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org --- Documentation/userspace-api/dma-buf-heaps.rst | 71 +++++++++++++++++++ Documentation/userspace-api/index.rst | 1 + 2 files changed, 72 insertions(+) create mode 100644 Documentation/userspace-api/dma-buf-heaps.rst diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst new file mode 100644 index 000000000000..00436227b542 --- /dev/null +++ b/Documentation/userspace-api/dma-buf-heaps.rst @@ -0,0 +1,71 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +Allocating dma-buf using heaps +============================== + +Dma-buf Heaps are a way for userspace to allocate dma-buf objects. They are +typically used to allocate buffers from a specific allocation pool, or to share +buffers across frameworks. + +Heaps +===== + +A heap represent a specific allocator. The Linux kernel currently supports the +following heaps: + + - The ``system`` heap allocates virtually contiguous, cacheable, buffers + + - The ``reserved`` heap allocates physically contiguous, cacheable, buffers. + Depending on the platform, it might be called differently: + + - Acer Iconia Tab A500: ``linux,cma`` + - Allwinner sun4i, sun5i and sun7i families: ``default-pool`` + - Amlogic A1: ``linux,cma`` + - Amlogic G12A/G12B/SM1: ``linux,cma`` + - Amlogic GXBB/GXL: ``linux,cma`` + - ASUS EeePad Transformer TF101: ``linux,cma`` + - ASUS Google Nexus 7 (Project Bach / ME370TG) E1565: ``linux,cma`` + - ASUS Google Nexus 7 (Project Nakasi / ME370T) E1565: ``linux,cma`` + - ASUS Google Nexus 7 (Project Nakasi / ME370T) PM269: ``linux,cma`` + - Asus Transformer Infinity TF700T: ``linux,cma`` + - Asus Transformer Pad 3G TF300TG: ``linux,cma`` + - Asus Transformer Pad TF300T: ``linux,cma`` + - Asus Transformer Pad TF701T: ``linux,cma`` + - Asus Transformer Prime TF201: ``linux,cma`` + - ASUS Vivobook S 15: ``linux,cma`` + - Cadence KC705: ``linux,cma`` + - Digi International ConnectCore 6UL: ``linux,cma`` + - Freescale i.MX8DXL EVK: ``linux,cma`` + - Freescale TQMa8Xx: ``linux,cma`` + - Hisilicon Hikey: ``linux,cma`` + - Lenovo ThinkPad T14s Gen 6: ``linux,cma`` + - Lenovo ThinkPad X13s: ``linux,cma`` + - Lenovo Yoga Slim 7x: ``linux,cma`` + - LG Optimus 4X HD P880: ``linux,cma`` + - LG Optimus Vu P895: ``linux,cma`` + - Loongson 2k0500, 2k1000 and 2k2000: ``linux,cma`` + - Microsoft Romulus: ``linux,cma`` + - NXP i.MX8ULP EVK: ``linux,cma`` + - NXP i.MX93 9x9 QSB: ``linux,cma`` + - NXP i.MX93 11X11 EVK: ``linux,cma`` + - NXP i.MX93 14X14 EVK: ``linux,cma`` + - NXP i.MX95 19X19 EVK: ``linux,cma`` + - Ouya Game Console: ``linux,cma`` + - Pegatron Chagall: ``linux,cma`` + - PHYTEC phyCORE-AM62A SOM: ``linux,cma`` + - PHYTEC phyCORE-i.MX93 SOM: ``linux,cma`` + - Qualcomm SC8280XP CRD: ``linux,cma`` + - Qualcomm X1E80100 CRD: ``linux,cma`` + - Qualcomm X1E80100 QCP: ``linux,cma`` + - RaspberryPi: ``linux,cma`` + - Texas Instruments AM62x SK board family: ``linux,cma`` + - Texas Instruments AM62A7 SK: ``linux,cma`` + - Toradex Apalis iMX8: ``linux,cma`` + - TQ-Systems i.MX8MM TQMa8MxML: ``linux,cma`` + - TQ-Systems i.MX8MN TQMa8MxNL: ``linux,cma`` + - TQ-Systems i.MX8MPlus TQMa8MPxL: ``linux,cma`` + - TQ-Systems i.MX8MQ TQMa8MQ: ``linux,cma`` + - TQ-Systems i.MX93 TQMa93xxLA/TQMa93xxCA SOM: ``linux,cma`` + - TQ-Systems MBA6ULx Baseboard: ``linux,cma`` + diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 274cc7546efc..4901ce7c6cb7 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -41,10 +41,11 @@ Devices and I/O .. toctree:: :maxdepth: 1 accelerators/ocxl + dma-buf-heaps dma-buf-alloc-exchange gpio/index iommufd media/index dcdbas -- 2.46.1

1 year, 8 months

Re: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp accounting

by Steven Price

On 27/09/2024 15:53, Adrián Larumbe wrote: > On 25.09.2024 10:56, Steven Price wrote: >> On 23/09/2024 21:43, Adrián Larumbe wrote: >>> Hi Steve, >>> >>> On 23.09.2024 09:55, Steven Price wrote: >>>> On 20/09/2024 23:36, Adrián Larumbe wrote: >>>>> Hi Steve, thanks for the review. >>>> >>>> Hi Adrián, >>>> >>>>> I've applied all of your suggestions for the next patch series revision, so I'll >>>>> only be answering to your question about the calc_profiling_ringbuf_num_slots >>>>> function further down below. >>>>> >>>> >>>> [...] >>>> >>>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = { >>>>>>> .free_job = queue_free_job, >>>>>>> }; >>>>>>> >>>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, >>>>>>> + u32 cs_ringbuf_size) >>>>>>> +{ >>>>>>> + u32 min_profiled_job_instrs = U32_MAX; >>>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL); >>>>>>> + >>>>>>> + /* >>>>>>> + * We want to calculate the minimum size of a profiled job's CS, >>>>>>> + * because since they need additional instructions for the sampling >>>>>>> + * of performance metrics, they might take up further slots in >>>>>>> + * the queue's ringbuffer. This means we might not need as many job >>>>>>> + * slots for keeping track of their profiling information. What we >>>>>>> + * need is the maximum number of slots we should allocate to this end, >>>>>>> + * which matches the maximum number of profiled jobs we can place >>>>>>> + * simultaneously in the queue's ring buffer. >>>>>>> + * That has to be calculated separately for every single job profiling >>>>>>> + * flag, but not in the case job profiling is disabled, since unprofiled >>>>>>> + * jobs don't need to keep track of this at all. >>>>>>> + */ >>>>>>> + for (u32 i = 0; i < last_flag; i++) { >>>>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) >>>>>>> + min_profiled_job_instrs = >>>>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i))); >>>>>>> + } >>>>>>> + >>>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64)); >>>>>>> +} >>>>>> >>>>>> I may be missing something, but is there a situation where this is >>>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added >>>>>> can only add extra instructions to the no-flags case - whereas this >>>>>> implies you're thinking that instructions may also be removed (or replaced). >>>>>> >>>>>> Steve >>>>> >>>>> Since we create a separate kernel BO to hold the profiling information slot, we >>>>> need one that would be able to accomodate as many slots as the maximum number of >>>>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because >>>>> profiled jobs always take more instructions than unprofiled ones, then we would >>>>> usually need fewer slots than the number of unprofiled jobs we could insert at >>>>> once in the ring buffer. >>>>> >>>>> Because we represent profiling metrics with a bit mask, then we need to test the >>>>> size of the CS for every single metric enabled in isolation, since enabling more >>>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once >>>>> in the queue's ring buffer. >>>>> >>>>> In our case, calling calc_job_credits(0) would simply tell us the number of >>>>> instructions we need for a normal job with no profiled features enabled, which >>>>> would always requiere less instructions than profiled ones, and therefore more >>>>> slots in the profiling info kernel BO. But we don't need to keep track of >>>>> profiling numbers for unprofiled jobs, so there's no point in calculating this >>>>> number. >>>>> >>>>> At first I was simply allocating a profiling info kernel BO as big as the number >>>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out >>>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would >>>>> be wasted, since profiled jobs always require more slots because they hold more >>>>> instructions, so fewer profiling slots in said kernel BO. >>>>> >>>>> The value of this approach will eventually manifest if we decided to keep track of >>>>> more profiling metrics, since this code won't have to change at all, other than >>>>> adding new profiling flags in the panthor_device_profiling_flags enum. >>>> >>>> Thanks for the detailed explanation. I think what I was missing is that >>>> the loop is checking each bit flag independently and *not* checking >>>> calc_job_credits(0). >>>> >>>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what >>>> confused me - that should be completely redundant. Or at least we need >>>> something more intelligent if we have profiling bits which are not >>>> mutually compatible. >>> >>> I thought of an alternative that would only test bits that are actually part of >>> the mask: >>> >>> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, >>> u32 cs_ringbuf_size) >>> { >>> u32 min_profiled_job_instrs = U32_MAX; >>> u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL; >>> >>> while (profiling_mask) { >>> u32 i = ffs(profiling_mask) - 1; >>> profiling_mask &= ~BIT(i); >>> min_profiled_job_instrs = >>> min(min_profiled_job_instrs, calc_job_credits(BIT(i))); >>> } >>> >>> return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64)); >>> } >>> >>> However, I don't think this would be more efficient, because ffs() is probably >>> fetching the first set bit by performing register shifts, and I guess this would >>> take somewhat longer than iterating over every single bit from the last one, >>> even if also matching them against the whole mask, just in case in future >>> additions of performance metrics we decide to leave some of the lower >>> significance bits untouched. >> >> Efficiency isn't very important here - we're not on a fast path, so it's >> more about ensuring the code is readable. I don't think the above is >> more readable then the original for loop. >> >>> Regarding your question about mutual compatibility, I don't think that is an >>> issue here, because we're testing bits in isolation. If in the future we find >>> out that some of the values we're profiling cannot be sampled at once, we can >>> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden >>> profiling masks. >> >> My comment about compatibility is because in the original above you were >> calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL: >> >>> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL); >> >> then looping between 0 and that bit: >> >>> for (u32 i = 0; i < last_flag; i++) { >> >> So the test: >> >>> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) >> >> would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits >> that it set. The only reason I can think for that to be true in the >> future is if there is some sort of incompatibility - e.g. maybe there's >> an old and new way of doing some form of profiling with the old way >> being kept for backwards compatibility. But I suspect if/when that is >> required we'll need to revisit this function anyway. So that 'if' >> statement seems completely redundant (it's trivially always true). > > I think you're right about this. Would you be fine with the rest of the patch > as it is in revision 8 if I also deleted this bitmask check? Yes the rest of it looks fine. Thanks, Steve >> Steve >> >>>> I'm also not entirely sure that the amount of RAM saved is significant, >>>> but you've already written the code so we might as well have the saving ;) >>> >>> I think this was more evident before Boris suggested we reduce the basic slot >>> size to that of a single cache line, because then the minimum profiled job >>> might've taken twice as many ringbuffer slots as a nonprofiled one. In that >>> case, we would need a half as big BO for holding the sampled data (in case the >>> least size profiled job CS would extend over the 16 instruction boundary). >>> I still think this is a good idea so that in the future we don't need to worry >>> about adjusting the code that deals with preparing the right boilerplate CS, >>> since it'll only be a matter of adding new instructions inside prepare_job_instrs(). >>> >>>> Thanks, >>>> Steve >>>> >>>>> Regards, >>>>> Adrian >>>>> >>>>>>> + >>>>>>> static struct panthor_queue * >>>>>>> group_create_queue(struct panthor_group *group, >>>>>>> const struct drm_panthor_queue_create *args) >>>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group, >>>>>>> goto err_free_queue; >>>>>>> } >>>>>>> >>>>>>> + queue->profiling.slot_count = >>>>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size); >>>>>>> + >>>>>>> + queue->profiling.slots = >>>>>>> + panthor_kernel_bo_create(group->ptdev, group->vm, >>>>>>> + queue->profiling.slot_count * >>>>>>> + sizeof(struct panthor_job_profiling_data), >>>>>>> + DRM_PANTHOR_BO_NO_MMAP, >>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | >>>>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED, >>>>>>> + PANTHOR_VM_KERNEL_AUTO_VA); >>>>>>> + >>>>>>> + if (IS_ERR(queue->profiling.slots)) { >>>>>>> + ret = PTR_ERR(queue->profiling.slots); >>>>>>> + goto err_free_queue; >>>>>>> + } >>>>>>> + >>>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots); >>>>>>> + if (ret) >>>>>>> + goto err_free_queue; >>>>>>> + >>>>>>> + /* >>>>>>> + * Credit limit argument tells us the total number of instructions >>>>>>> + * across all CS slots in the ringbuffer, with some jobs requiring >>>>>>> + * twice as many as others, depending on their profiling status. >>>>>>> + */ >>>>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops, >>>>>>> group->ptdev->scheduler->wq, 1, >>>>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)), >>>>>>> + args->ringbuf_size / sizeof(u64), >>>>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS), >>>>>>> group->ptdev->reset.wq, >>>>>>> NULL, "panthor-queue", group->ptdev->base.dev); >>>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile, >>>>>>> { >>>>>>> struct panthor_group_pool *gpool = pfile->groups; >>>>>>> struct panthor_job *job; >>>>>>> + u32 credits; >>>>>>> int ret; >>>>>>> >>>>>>> if (qsubmit->pad) >>>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile, >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> + job->profiling.mask = pfile->ptdev->profile_mask; >>>>>>> + credits = calc_job_credits(job->profiling.mask); >>>>>>> + if (credits == 0) { >>>>>>> + ret = -EINVAL; >>>>>>> + goto err_put_job; >>>>>>> + } >>>>>>> + >>>>>>> ret = drm_sched_job_init(&job->base, >>>>>>> &job->group->queues[job->queue_idx]->entity, >>>>>>> - 1, job->group); >>>>>>> + credits, job->group); >>>>>>> if (ret) >>>>>>> goto err_put_job; >>>>>>> >>>>>

1 year, 9 months

[RFC PATCH 0/4] Linaro restricted heap

by Jens Wiklander

Hi, This patch set is based on top of Yong Wu's restricted heap patch set [1]. It's also a continuation on Olivier's Add dma-buf secure-heap patch set [2]. The Linaro restricted heap uses genalloc in the kernel to manage the heap carvout. This is a difference from the Mediatek restricted heap which relies on the secure world to manage the carveout. I've tried to adress the comments on [2], but [1] introduces changes so I'm afraid I've had to skip some comments. This can be tested on QEMU with the following steps: repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \ -b prototype/sdp-v1 repo sync -j8 cd build make toolchains -j4 make all -j$(nproc) make run-only # login and at the prompt: xtest --sdp-basic https://optee.readthedocs.io/en/latest/building/prerequisites.html list dependencies needed to build the above. The tests are pretty basic, mostly checking that a Trusted Application in the secure world can access and manipulate the memory. Cheers, Jens [1] https://lore.kernel.org/dri-devel/20240515112308.10171-1-yong.wu@mediatek.c… [2] https://lore.kernel.org/lkml/20220805135330.970-1-olivier.masse@nxp.com/ Changes since Olivier's post [2]: * Based on Yong Wu's post [1] where much of dma-buf handling is done in the generic restricted heap * Simplifications and cleanup * New commit message for "dma-buf: heaps: add Linaro restricted dmabuf heap support" * Replaced the word "secure" with "restricted" where applicable Etienne Carriere (1): tee: new ioctl to a register tee_shm from a dmabuf file descriptor Jens Wiklander (2): dma-buf: heaps: restricted_heap: add no_map attribute dma-buf: heaps: add Linaro restricted dmabuf heap support Olivier Masse (1): dt-bindings: reserved-memory: add linaro,restricted-heap .../linaro,restricted-heap.yaml | 56 ++++++ drivers/dma-buf/heaps/Kconfig | 10 ++ drivers/dma-buf/heaps/Makefile | 1 + drivers/dma-buf/heaps/restricted_heap.c | 17 +- drivers/dma-buf/heaps/restricted_heap.h | 2 + .../dma-buf/heaps/restricted_heap_linaro.c | 165 ++++++++++++++++++ drivers/tee/tee_core.c | 38 ++++ drivers/tee/tee_shm.c | 104 ++++++++++- include/linux/tee_drv.h | 11 ++ include/uapi/linux/tee.h | 29 +++ 10 files changed, 426 insertions(+), 7 deletions(-) create mode 100644 Documentation/devicetree/bindings/reserved-memory/linaro,restricted-heap.yaml create mode 100644 drivers/dma-buf/heaps/restricted_heap_linaro.c -- 2.34.1

1 year, 9 months

Re: [PATCH v8 3/5] drm/panthor: add DRM fdinfo support

by kernel test robot

Hi Adrián, kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on v6.11 next-20240927] [cannot apply to drm-misc/drm-misc-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-i… base: linus/master patch link: https://lore.kernel.org/r/20240923230912.2207320-4-adrian.larumbe%40collabo… patch subject: [PATCH v8 3/5] drm/panthor: add DRM fdinfo support config: arm-randconfig-002-20240929 (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…) compiler: arm-linux-gnueabi-gcc (GCC) 14.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240929/202409291048.zLqDeqpO-lkp@…) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp(a)intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202409291048.zLqDeqpO-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from include/linux/math64.h:6, from include/linux/time.h:6, from include/linux/stat.h:19, from include/linux/module.h:13, from drivers/gpu/drm/panthor/panthor_drv.c:7: drivers/gpu/drm/panthor/panthor_drv.c: In function 'panthor_gpu_show_fdinfo': >> drivers/gpu/drm/panthor/panthor_drv.c:1389:45: error: implicit declaration of function 'arch_timer_get_cntfrq' [-Wimplicit-function-declaration] 1389 | arch_timer_get_cntfrq())); | ^~~~~~~~~~~~~~~~~~~~~ include/linux/math.h:40:39: note: in definition of macro 'DIV_ROUND_DOWN_ULL' 40 | ({ unsigned long long _tmp = (ll); do_div(_tmp, d); _tmp; }) | ^~ drivers/gpu/drm/panthor/panthor_drv.c:1388:28: note: in expansion of macro 'DIV_ROUND_UP_ULL' 1388 | DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC), | ^~~~~~~~~~~~~~~~ vim +/arch_timer_get_cntfrq +1389 drivers/gpu/drm/panthor/panthor_drv.c 1377 1378 static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev, 1379 struct panthor_file *pfile, 1380 struct drm_printer *p) 1381 { 1382 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_ALL) 1383 panthor_fdinfo_gather_group_samples(pfile); 1384 1385 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP) { 1386 #ifdef CONFIG_ARM_ARCH_TIMER 1387 drm_printf(p, "drm-engine-panthor:\t%llu ns\n", 1388 DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC), > 1389 arch_timer_get_cntfrq())); 1390 #endif 1391 } 1392 if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES) 1393 drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles); 1394 1395 drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate); 1396 drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency); 1397 } 1398 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

1 year, 9 months

Re: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp accounting

by Steven Price

On 23/09/2024 21:43, Adrián Larumbe wrote: > Hi Steve, > > On 23.09.2024 09:55, Steven Price wrote: >> On 20/09/2024 23:36, Adrián Larumbe wrote: >>> Hi Steve, thanks for the review. >> >> Hi Adrián, >> >>> I've applied all of your suggestions for the next patch series revision, so I'll >>> only be answering to your question about the calc_profiling_ringbuf_num_slots >>> function further down below. >>> >> >> [...] >> >>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = { >>>>> .free_job = queue_free_job, >>>>> }; >>>>> >>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, >>>>> + u32 cs_ringbuf_size) >>>>> +{ >>>>> + u32 min_profiled_job_instrs = U32_MAX; >>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL); >>>>> + >>>>> + /* >>>>> + * We want to calculate the minimum size of a profiled job's CS, >>>>> + * because since they need additional instructions for the sampling >>>>> + * of performance metrics, they might take up further slots in >>>>> + * the queue's ringbuffer. This means we might not need as many job >>>>> + * slots for keeping track of their profiling information. What we >>>>> + * need is the maximum number of slots we should allocate to this end, >>>>> + * which matches the maximum number of profiled jobs we can place >>>>> + * simultaneously in the queue's ring buffer. >>>>> + * That has to be calculated separately for every single job profiling >>>>> + * flag, but not in the case job profiling is disabled, since unprofiled >>>>> + * jobs don't need to keep track of this at all. >>>>> + */ >>>>> + for (u32 i = 0; i < last_flag; i++) { >>>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) >>>>> + min_profiled_job_instrs = >>>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i))); >>>>> + } >>>>> + >>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64)); >>>>> +} >>>> >>>> I may be missing something, but is there a situation where this is >>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added >>>> can only add extra instructions to the no-flags case - whereas this >>>> implies you're thinking that instructions may also be removed (or replaced). >>>> >>>> Steve >>> >>> Since we create a separate kernel BO to hold the profiling information slot, we >>> need one that would be able to accomodate as many slots as the maximum number of >>> profiled jobs we can insert simultaneously into the queue's ring buffer. Because >>> profiled jobs always take more instructions than unprofiled ones, then we would >>> usually need fewer slots than the number of unprofiled jobs we could insert at >>> once in the ring buffer. >>> >>> Because we represent profiling metrics with a bit mask, then we need to test the >>> size of the CS for every single metric enabled in isolation, since enabling more >>> than one will always mean a bigger CS, and therefore fewer jobs tracked at once >>> in the queue's ring buffer. >>> >>> In our case, calling calc_job_credits(0) would simply tell us the number of >>> instructions we need for a normal job with no profiled features enabled, which >>> would always requiere less instructions than profiled ones, and therefore more >>> slots in the profiling info kernel BO. But we don't need to keep track of >>> profiling numbers for unprofiled jobs, so there's no point in calculating this >>> number. >>> >>> At first I was simply allocating a profiling info kernel BO as big as the number >>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out >>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would >>> be wasted, since profiled jobs always require more slots because they hold more >>> instructions, so fewer profiling slots in said kernel BO. >>> >>> The value of this approach will eventually manifest if we decided to keep track of >>> more profiling metrics, since this code won't have to change at all, other than >>> adding new profiling flags in the panthor_device_profiling_flags enum. >> >> Thanks for the detailed explanation. I think what I was missing is that >> the loop is checking each bit flag independently and *not* checking >> calc_job_credits(0). >> >> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what >> confused me - that should be completely redundant. Or at least we need >> something more intelligent if we have profiling bits which are not >> mutually compatible. > > I thought of an alternative that would only test bits that are actually part of > the mask: > > static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, > u32 cs_ringbuf_size) > { > u32 min_profiled_job_instrs = U32_MAX; > u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL; > > while (profiling_mask) { > u32 i = ffs(profiling_mask) - 1; > profiling_mask &= ~BIT(i); > min_profiled_job_instrs = > min(min_profiled_job_instrs, calc_job_credits(BIT(i))); > } > > return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64)); > } > > However, I don't think this would be more efficient, because ffs() is probably > fetching the first set bit by performing register shifts, and I guess this would > take somewhat longer than iterating over every single bit from the last one, > even if also matching them against the whole mask, just in case in future > additions of performance metrics we decide to leave some of the lower > significance bits untouched. Efficiency isn't very important here - we're not on a fast path, so it's more about ensuring the code is readable. I don't think the above is more readable then the original for loop. > Regarding your question about mutual compatibility, I don't think that is an > issue here, because we're testing bits in isolation. If in the future we find > out that some of the values we're profiling cannot be sampled at once, we can > add that logic to the sysfs knob handler, to make sure UM cannot set forbidden > profiling masks. My comment about compatibility is because in the original above you were calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL: > u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL); then looping between 0 and that bit: > for (u32 i = 0; i < last_flag; i++) { So the test: > if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits that it set. The only reason I can think for that to be true in the future is if there is some sort of incompatibility - e.g. maybe there's an old and new way of doing some form of profiling with the old way being kept for backwards compatibility. But I suspect if/when that is required we'll need to revisit this function anyway. So that 'if' statement seems completely redundant (it's trivially always true). Steve >> I'm also not entirely sure that the amount of RAM saved is significant, >> but you've already written the code so we might as well have the saving ;) > > I think this was more evident before Boris suggested we reduce the basic slot > size to that of a single cache line, because then the minimum profiled job > might've taken twice as many ringbuffer slots as a nonprofiled one. In that > case, we would need a half as big BO for holding the sampled data (in case the > least size profiled job CS would extend over the 16 instruction boundary). > I still think this is a good idea so that in the future we don't need to worry > about adjusting the code that deals with preparing the right boilerplate CS, > since it'll only be a matter of adding new instructions inside prepare_job_instrs(). > >> Thanks, >> Steve >> >>> Regards, >>> Adrian >>> >>>>> + >>>>> static struct panthor_queue * >>>>> group_create_queue(struct panthor_group *group, >>>>> const struct drm_panthor_queue_create *args) >>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group, >>>>> goto err_free_queue; >>>>> } >>>>> >>>>> + queue->profiling.slot_count = >>>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size); >>>>> + >>>>> + queue->profiling.slots = >>>>> + panthor_kernel_bo_create(group->ptdev, group->vm, >>>>> + queue->profiling.slot_count * >>>>> + sizeof(struct panthor_job_profiling_data), >>>>> + DRM_PANTHOR_BO_NO_MMAP, >>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | >>>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED, >>>>> + PANTHOR_VM_KERNEL_AUTO_VA); >>>>> + >>>>> + if (IS_ERR(queue->profiling.slots)) { >>>>> + ret = PTR_ERR(queue->profiling.slots); >>>>> + goto err_free_queue; >>>>> + } >>>>> + >>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots); >>>>> + if (ret) >>>>> + goto err_free_queue; >>>>> + >>>>> + /* >>>>> + * Credit limit argument tells us the total number of instructions >>>>> + * across all CS slots in the ringbuffer, with some jobs requiring >>>>> + * twice as many as others, depending on their profiling status. >>>>> + */ >>>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops, >>>>> group->ptdev->scheduler->wq, 1, >>>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)), >>>>> + args->ringbuf_size / sizeof(u64), >>>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS), >>>>> group->ptdev->reset.wq, >>>>> NULL, "panthor-queue", group->ptdev->base.dev); >>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile, >>>>> { >>>>> struct panthor_group_pool *gpool = pfile->groups; >>>>> struct panthor_job *job; >>>>> + u32 credits; >>>>> int ret; >>>>> >>>>> if (qsubmit->pad) >>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile, >>>>> } >>>>> } >>>>> >>>>> + job->profiling.mask = pfile->ptdev->profile_mask; >>>>> + credits = calc_job_credits(job->profiling.mask); >>>>> + if (credits == 0) { >>>>> + ret = -EINVAL; >>>>> + goto err_put_job; >>>>> + } >>>>> + >>>>> ret = drm_sched_job_init(&job->base, >>>>> &job->group->queues[job->queue_idx]->entity, >>>>> - 1, job->group); >>>>> + credits, job->group); >>>>> if (ret) >>>>> goto err_put_job; >>>>> >>> > > > Adrian Larumbe

1 year, 9 months

Re: [PATCH] dma-buf: Add syntax highlighting to code listings in the document

by Christian König

Nothing wrong with this, I just didn't had time to double check it myself and then forgotten about it. Going to push it to drm-misc-next. Regards, Christian. Am 23.09.24 um 11:22 schrieb Tommy Chiang: > Ping. > Please let me know if I'm doing something wrong. > > On Mon, Feb 19, 2024 at 11:00 AM Tommy Chiang <ototot(a)chromium.org> wrote: >> Kindly ping :) >> >> On Fri, Jan 19, 2024 at 11:33 AM Tommy Chiang <ototot(a)chromium.org> wrote: >>> This patch tries to improve the display of the code listing >>> on The Linux Kernel documentation website for dma-buf [1] . >>> >>> Originally, it appears that it was attempting to escape >>> the '*' character, but looks like it's not necessary (now), >>> so we are seeing something like '\*' on the webite. >>> >>> This patch removes these unnecessary backslashes and adds syntax >>> highlighting to improve the readability of the code listing. >>> >>> [1] https://docs.kernel.org/driver-api/dma-buf.html >>> >>> Signed-off-by: Tommy Chiang <ototot(a)chromium.org> >>> --- >>> drivers/dma-buf/dma-buf.c | 15 +++++++++------ >>> 1 file changed, 9 insertions(+), 6 deletions(-) >>> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c >>> index 8fe5aa67b167..e083a0ab06d7 100644 >>> --- a/drivers/dma-buf/dma-buf.c >>> +++ b/drivers/dma-buf/dma-buf.c >>> @@ -1282,10 +1282,12 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF); >>> * vmap interface is introduced. Note that on very old 32-bit architectures >>> * vmalloc space might be limited and result in vmap calls failing. >>> * >>> - * Interfaces:: >>> + * Interfaces: >>> * >>> - * void \*dma_buf_vmap(struct dma_buf \*dmabuf, struct iosys_map \*map) >>> - * void dma_buf_vunmap(struct dma_buf \*dmabuf, struct iosys_map \*map) >>> + * .. code-block:: c >>> + * >>> + * void *dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map) >>> + * void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map) >>> * >>> * The vmap call can fail if there is no vmap support in the exporter, or if >>> * it runs out of vmalloc space. Note that the dma-buf layer keeps a reference >>> @@ -1342,10 +1344,11 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_move_notify, DMA_BUF); >>> * enough, since adding interfaces to intercept pagefaults and allow pte >>> * shootdowns would increase the complexity quite a bit. >>> * >>> - * Interface:: >>> + * Interface: >>> + * >>> + * .. code-block:: c >>> * >>> - * int dma_buf_mmap(struct dma_buf \*, struct vm_area_struct \*, >>> - * unsigned long); >>> + * int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long); >>> * >>> * If the importing subsystem simply provides a special-purpose mmap call to >>> * set up a mapping in userspace, calling do_mmap with &dma_buf.file will >>> -- >>> 2.43.0.381.gb435a96ce8-goog >>>

1 year, 9 months

Re: [PATCH v7 1/5] drm/panthor: introduce job cycle and timestamp accounting

by Steven Price

On 23/09/2024 11:18, Boris Brezillon wrote: > On Mon, 23 Sep 2024 10:07:14 +0100 > Steven Price <steven.price(a)arm.com> wrote: > >>> +static struct dma_fence * >>> +queue_run_job(struct drm_sched_job *sched_job) >>> +{ >>> + struct panthor_job *job = container_of(sched_job, struct panthor_job, base); >>> + struct panthor_group *group = job->group; >>> + struct panthor_queue *queue = group->queues[job->queue_idx]; >>> + struct panthor_device *ptdev = group->ptdev; >>> + struct panthor_scheduler *sched = ptdev->scheduler; >>> + struct panthor_job_ringbuf_instrs instrs; >> >> instrs isn't initialised... >> >>> + struct panthor_job_cs_params cs_params; >>> + struct dma_fence *done_fence; >>> + int ret; >>> >>> /* Stream size is zero, nothing to do except making sure all previously >>> * submitted jobs are done before we signal the >>> @@ -2900,17 +3062,23 @@ queue_run_job(struct drm_sched_job *sched_job) >>> queue->fence_ctx.id, >>> atomic64_inc_return(&queue->fence_ctx.seqno)); >>> >>> - memcpy(queue->ringbuf->kmap + ringbuf_insert, >>> - call_instrs, sizeof(call_instrs)); >>> + job->profiling.slot = queue->profiling.seqno++; >>> + if (queue->profiling.seqno == queue->profiling.slot_count) >>> + queue->profiling.seqno = 0; >>> + >>> + job->ringbuf.start = queue->iface.input->insert; >>> + >>> + get_job_cs_params(job, &cs_params); >>> + prepare_job_instrs(&cs_params, &instrs); >> >> ...but it's passed into prepare_job_instrs() which depends on >> instrs.count (same bug as was in calc_job_credits()) - sorry I didn't >> spot it last review. > > Hm, can't we initialize instr::count to zero in prepare_job_instrs() > instead? Indeed that would probably be better! I hadn't noticed there were two places in the previous review. Steve

1 year, 9 months

Re: [PATCH v7 1/5] drm/panthor: introduce job cycle and timestamp accounting

by Steven Price

On 21/09/2024 00:43, Adrián Larumbe wrote: > Enable calculations of job submission times in clock cycles and wall > time. This is done by expanding the boilerplate command stream when running > a job to include instructions that compute said times right before and > after a user CS. > > A separate kernel BO is created per queue to store those values. Jobs can > access their sampled data through an index different from that of the > queue's ringbuffer. The reason for this is saving memory on the profiling > information kernel BO, since the amount of simultaneous profiled jobs we > can write into the queue's ringbuffer might be much smaller than for > regular jobs, as the former take more CSF instructions. > > This commit is done in preparation for enabling DRM fdinfo support in the > Panthor driver, which depends on the numbers calculated herein. > > A profile mode mask has been added that will in a future commit allow UM to > toggle performance metric sampling behaviour, which is disabled by default > to save power. When a ringbuffer CS is constructed, timestamp and cycling > sampling instructions are added depending on the enabled flags in the > profiling mask. > > A helper was provided that calculates the number of instructions for a > given set of enablement mask, and these are passed as the number of credits > when initialising a DRM scheduler job. > > Signed-off-by: Adrián Larumbe <adrian.larumbe(a)collabora.com> > Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com> > Reviewed-by: Liviu Dudau <liviu.dudau(a)arm.com> I think just one bug remaining - see below... > --- > drivers/gpu/drm/panthor/panthor_device.h | 22 ++ > drivers/gpu/drm/panthor/panthor_sched.c | 328 +++++++++++++++++++---- > 2 files changed, 301 insertions(+), 49 deletions(-) > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h > index e388c0472ba7..a48e30d0af30 100644 > --- a/drivers/gpu/drm/panthor/panthor_device.h > +++ b/drivers/gpu/drm/panthor/panthor_device.h > @@ -66,6 +66,25 @@ struct panthor_irq { > atomic_t suspended; > }; > > +/** > + * enum panthor_device_profiling_mode - Profiling state > + */ > +enum panthor_device_profiling_flags { > + /** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */ > + PANTHOR_DEVICE_PROFILING_DISABLED = 0, > + > + /** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */ > + PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0), > + > + /** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */ > + PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1), > + > + /** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */ > + PANTHOR_DEVICE_PROFILING_ALL = > + PANTHOR_DEVICE_PROFILING_CYCLES | > + PANTHOR_DEVICE_PROFILING_TIMESTAMP, > +}; > + > /** > * struct panthor_device - Panthor device > */ > @@ -162,6 +181,9 @@ struct panthor_device { > */ > struct page *dummy_latest_flush; > } pm; > + > + /** @profile_mask: User-set profiling flags for job accounting. */ > + u32 profile_mask; > }; > > /** > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c > index 42afdf0ddb7e..6da5c3d0015e 100644 > --- a/drivers/gpu/drm/panthor/panthor_sched.c > +++ b/drivers/gpu/drm/panthor/panthor_sched.c > @@ -93,6 +93,9 @@ > #define MIN_CSGS 3 > #define MAX_CSG_PRIO 0xf > > +#define NUM_INSTRS_PER_CACHE_LINE (64 / sizeof(u64)) > +#define MAX_INSTRS_PER_JOB 24 > + > struct panthor_group; > > /** > @@ -476,6 +479,18 @@ struct panthor_queue { > */ > struct list_head in_flight_jobs; > } fence_ctx; > + > + /** @profiling: Job profiling data slots and access information. */ > + struct { > + /** @slots: Kernel BO holding the slots. */ > + struct panthor_kernel_bo *slots; > + > + /** @slot_count: Number of jobs ringbuffer can hold at once. */ > + u32 slot_count; > + > + /** @seqno: Index of the next available profiling information slot. */ > + u32 seqno; > + } profiling; > }; > > /** > @@ -661,6 +676,18 @@ struct panthor_group { > struct list_head wait_node; > }; > > +struct panthor_job_profiling_data { > + struct { > + u64 before; > + u64 after; > + } cycles; > + > + struct { > + u64 before; > + u64 after; > + } time; > +}; > + > /** > * group_queue_work() - Queue a group work > * @group: Group to queue the work for. > @@ -774,6 +801,15 @@ struct panthor_job { > > /** @done_fence: Fence signaled when the job is finished or cancelled. */ > struct dma_fence *done_fence; > + > + /** @profiling: Job profiling information. */ > + struct { > + /** @mask: Current device job profiling enablement bitmask. */ > + u32 mask; > + > + /** @slot: Job index in the profiling slots BO. */ > + u32 slot; > + } profiling; > }; > > static void > @@ -838,6 +874,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue * > > panthor_kernel_bo_destroy(queue->ringbuf); > panthor_kernel_bo_destroy(queue->iface.mem); > + panthor_kernel_bo_destroy(queue->profiling.slots); > > /* Release the last_fence we were holding, if any. */ > dma_fence_put(queue->fence_ctx.last_fence); > @@ -1982,8 +2019,6 @@ tick_ctx_init(struct panthor_scheduler *sched, > } > } > > -#define NUM_INSTRS_PER_SLOT 16 > - > static void > group_term_post_processing(struct panthor_group *group) > { > @@ -2815,65 +2850,192 @@ static void group_sync_upd_work(struct work_struct *work) > group_put(group); > } > > -static struct dma_fence * > -queue_run_job(struct drm_sched_job *sched_job) > +struct panthor_job_ringbuf_instrs { > + u64 buffer[MAX_INSTRS_PER_JOB]; > + u32 count; > +}; > + > +struct panthor_job_instr { > + u32 profile_mask; > + u64 instr; > +}; > + > +#define JOB_INSTR(__prof, __instr) \ > + { \ > + .profile_mask = __prof, \ > + .instr = __instr, \ > + } > + > +static void > +copy_instrs_to_ringbuf(struct panthor_queue *queue, > + struct panthor_job *job, > + struct panthor_job_ringbuf_instrs *instrs) > +{ > + u64 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf); > + u64 start = job->ringbuf.start & (ringbuf_size - 1); > + u64 size, written; > + > + /* > + * We need to write a whole slot, including any trailing zeroes > + * that may come at the end of it. Also, because instrs.buffer has > + * been zero-initialised, there's no need to pad it with 0's > + */ > + instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE); > + size = instrs->count * sizeof(u64); > + WARN_ON(size > ringbuf_size); > + written = min(ringbuf_size - start, size); > + > + memcpy(queue->ringbuf->kmap + start, instrs->buffer, written); > + > + if (written < size) > + memcpy(queue->ringbuf->kmap, > + &instrs->buffer[written/sizeof(u64)], > + size - written); > +} > + > +struct panthor_job_cs_params { > + u32 profile_mask; > + u64 addr_reg; u64 val_reg; > + u64 cycle_reg; u64 time_reg; > + u64 sync_addr; u64 times_addr; > + u64 cs_start; u64 cs_size; > + u32 last_flush; u32 waitall_mask; > +}; > + > +static void > +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params) > { > - struct panthor_job *job = container_of(sched_job, struct panthor_job, base); > struct panthor_group *group = job->group; > struct panthor_queue *queue = group->queues[job->queue_idx]; > struct panthor_device *ptdev = group->ptdev; > struct panthor_scheduler *sched = ptdev->scheduler; > - u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf); > - u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1); > - u64 addr_reg = ptdev->csif_info.cs_reg_count - > - ptdev->csif_info.unpreserved_cs_reg_count; > - u64 val_reg = addr_reg + 2; > - u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) + > - job->queue_idx * sizeof(struct panthor_syncobj_64b); > - u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0); > - struct dma_fence *done_fence; > - int ret; > > - u64 call_instrs[NUM_INSTRS_PER_SLOT] = { > - /* MOV32 rX+2, cs.latest_flush */ > - (2ull << 56) | (val_reg << 48) | job->call_info.latest_flush, > + params->addr_reg = ptdev->csif_info.cs_reg_count - > + ptdev->csif_info.unpreserved_cs_reg_count; > + params->val_reg = params->addr_reg + 2; > + params->cycle_reg = params->addr_reg; > + params->time_reg = params->val_reg; > > - /* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */ > - (36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233, > + params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) + > + job->queue_idx * sizeof(struct panthor_syncobj_64b); > + params->times_addr = panthor_kernel_bo_gpuva(queue->profiling.slots) + > + (job->profiling.slot * sizeof(struct panthor_job_profiling_data)); > + params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0); > > - /* MOV48 rX:rX+1, cs.start */ > - (1ull << 56) | (addr_reg << 48) | job->call_info.start, > + params->cs_start = job->call_info.start; > + params->cs_size = job->call_info.size; > + params->last_flush = job->call_info.latest_flush; > > - /* MOV32 rX+2, cs.size */ > - (2ull << 56) | (val_reg << 48) | job->call_info.size, > + params->profile_mask = job->profiling.mask; > +} > > - /* WAIT(0) => waits for FLUSH_CACHE2 instruction */ > - (3ull << 56) | (1 << 16), > +#define JOB_INSTR_ALWAYS(instr) \ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (instr)) > +#define JOB_INSTR_TIMESTAMP(instr) \ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP, (instr)) > +#define JOB_INSTR_CYCLES(instr) \ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES, (instr)) > > +static void > +prepare_job_instrs(const struct panthor_job_cs_params *params, > + struct panthor_job_ringbuf_instrs *instrs) > +{ > + const struct panthor_job_instr instr_seq[] = { > + /* MOV32 rX+2, cs.latest_flush */ > + JOB_INSTR_ALWAYS((2ull << 56) | (params->val_reg << 48) | params->last_flush), > + /* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */ > + JOB_INSTR_ALWAYS((36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233), > + /* MOV48 rX:rX+1, cycles_offset */ > + JOB_INSTR_CYCLES((1ull << 56) | (params->cycle_reg << 48) | > + (params->times_addr + offsetof(struct panthor_job_profiling_data, cycles.before))), > + /* STORE_STATE cycles */ > + JOB_INSTR_CYCLES((40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)), > + /* MOV48 rX:rX+1, time_offset */ > + JOB_INSTR_TIMESTAMP((1ull << 56) | (params->time_reg << 48) | (params->times_addr + > + offsetof(struct panthor_job_profiling_data, time.before))), > + /* STORE_STATE timer */ > + JOB_INSTR_TIMESTAMP((40ull << 56) | (params->time_reg << 40) | (0ll << 32)), > + /* MOV48 rX:rX+1, cs.start */ > + JOB_INSTR_ALWAYS((1ull << 56) | (params->addr_reg << 48) | params->cs_start), > + /* MOV32 rX+2, cs.size */ > + JOB_INSTR_ALWAYS((2ull << 56) | (params->val_reg << 48) | params->cs_size), > + /* WAIT(0) => waits for FLUSH_CACHE2 instruction */ > + JOB_INSTR_ALWAYS((3ull << 56) | (1 << 16)), > /* CALL rX:rX+1, rX+2 */ > - (32ull << 56) | (addr_reg << 40) | (val_reg << 32), > - > + JOB_INSTR_ALWAYS((32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)), > + /* MOV48 rX:rX+1, cycles_offset */ > + JOB_INSTR_CYCLES((1ull << 56) | (params->cycle_reg << 48) | > + (params->times_addr + offsetof(struct panthor_job_profiling_data, cycles.after))), > + /* STORE_STATE cycles */ > + JOB_INSTR_CYCLES((40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)), > + /* MOV48 rX:rX+1, time_offset */ > + JOB_INSTR_TIMESTAMP((1ull << 56) | (params->time_reg << 48) | > + (params->times_addr + offsetof(struct panthor_job_profiling_data, time.after))), > + /* STORE_STATE timer */ > + JOB_INSTR_TIMESTAMP((40ull << 56) | (params->time_reg << 40) | (0ll << 32)), > /* MOV48 rX:rX+1, sync_addr */ > - (1ull << 56) | (addr_reg << 48) | sync_addr, > - > + JOB_INSTR_ALWAYS((1ull << 56) | (params->addr_reg << 48) | params->sync_addr), > /* MOV48 rX+2, #1 */ > - (1ull << 56) | (val_reg << 48) | 1, > - > + JOB_INSTR_ALWAYS((1ull << 56) | (params->val_reg << 48) | 1), > /* WAIT(all) */ > - (3ull << 56) | (waitall_mask << 16), > - > + JOB_INSTR_ALWAYS((3ull << 56) | (params->waitall_mask << 16)), > /* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/ > - (51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1, > + JOB_INSTR_ALWAYS((51ull << 56) | (0ull << 48) | (params->addr_reg << 40) | > + (params->val_reg << 32) | (0 << 16) | 1), > + /* ERROR_BARRIER, so we can recover from faults at job boundaries. */ > + JOB_INSTR_ALWAYS((47ull << 56)), > + }; > + u32 pad; > > - /* ERROR_BARRIER, so we can recover from faults at job > - * boundaries. > - */ > - (47ull << 56), > + /* NEED to be cacheline aligned to please the prefetcher. */ > + static_assert(sizeof(instrs->buffer) % 64 == 0, > + "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline"); > + > + /* Make sure we have enough storage to store the whole sequence. */ > + static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) == > + ARRAY_SIZE(instrs->buffer), > + "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch"); > + > + for (u32 i = 0; i < ARRAY_SIZE(instr_seq); i++) { > + /* If the profile mask of this instruction is not enabled, skip it. */ > + if (instr_seq[i].profile_mask && > + !(instr_seq[i].profile_mask & params->profile_mask)) > + continue; > + > + instrs->buffer[instrs->count++] = instr_seq[i].instr; > + } > + > + pad = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE); > + memset(&instrs->buffer[instrs->count], 0, > + (pad - instrs->count) * sizeof(instrs->buffer[0])); > + instrs->count = pad; > +} > + > +static u32 calc_job_credits(u32 profile_mask) > +{ > + struct panthor_job_ringbuf_instrs instrs = { > + .count = 0, > + }; > + struct panthor_job_cs_params params = { > + .profile_mask = profile_mask, > }; > > - /* Need to be cacheline aligned to please the prefetcher. */ > - static_assert(sizeof(call_instrs) % 64 == 0, > - "call_instrs is not aligned on a cacheline"); > + prepare_job_instrs(&params, &instrs); > + return instrs.count; > +} > + > +static struct dma_fence * > +queue_run_job(struct drm_sched_job *sched_job) > +{ > + struct panthor_job *job = container_of(sched_job, struct panthor_job, base); > + struct panthor_group *group = job->group; > + struct panthor_queue *queue = group->queues[job->queue_idx]; > + struct panthor_device *ptdev = group->ptdev; > + struct panthor_scheduler *sched = ptdev->scheduler; > + struct panthor_job_ringbuf_instrs instrs; instrs isn't initialised... > + struct panthor_job_cs_params cs_params; > + struct dma_fence *done_fence; > + int ret; > > /* Stream size is zero, nothing to do except making sure all previously > * submitted jobs are done before we signal the > @@ -2900,17 +3062,23 @@ queue_run_job(struct drm_sched_job *sched_job) > queue->fence_ctx.id, > atomic64_inc_return(&queue->fence_ctx.seqno)); > > - memcpy(queue->ringbuf->kmap + ringbuf_insert, > - call_instrs, sizeof(call_instrs)); > + job->profiling.slot = queue->profiling.seqno++; > + if (queue->profiling.seqno == queue->profiling.slot_count) > + queue->profiling.seqno = 0; > + > + job->ringbuf.start = queue->iface.input->insert; > + > + get_job_cs_params(job, &cs_params); > + prepare_job_instrs(&cs_params, &instrs); ...but it's passed into prepare_job_instrs() which depends on instrs.count (same bug as was in calc_job_credits()) - sorry I didn't spot it last review. Initializing instrs makes everything work for me. I'm not sure quite what kernel configuration you are using but I wonder if you've got a 'hardening' option enabled which is causing the stack to be zero-initialised. It's worth turning it off for testing purposes ;) Steve > + copy_instrs_to_ringbuf(queue, job, &instrs); > + > + job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64)); > > panthor_job_get(&job->base); > spin_lock(&queue->fence_ctx.lock); > list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs); > spin_unlock(&queue->fence_ctx.lock); > > - job->ringbuf.start = queue->iface.input->insert; > - job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs); > - > /* Make sure the ring buffer is updated before the INSERT > * register. > */ > @@ -3003,6 +3171,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = { > .free_job = queue_free_job, > }; > > +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, > + u32 cs_ringbuf_size) > +{ > + u32 min_profiled_job_instrs = U32_MAX; > + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL); > + > + /* > + * We want to calculate the minimum size of a profiled job's CS, > + * because since they need additional instructions for the sampling > + * of performance metrics, they might take up further slots in > + * the queue's ringbuffer. This means we might not need as many job > + * slots for keeping track of their profiling information. What we > + * need is the maximum number of slots we should allocate to this end, > + * which matches the maximum number of profiled jobs we can place > + * simultaneously in the queue's ring buffer. > + * That has to be calculated separately for every single job profiling > + * flag, but not in the case job profiling is disabled, since unprofiled > + * jobs don't need to keep track of this at all. > + */ > + for (u32 i = 0; i < last_flag; i++) { > + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) > + min_profiled_job_instrs = > + min(min_profiled_job_instrs, calc_job_credits(BIT(i))); > + } > + > + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64)); > +} > + > static struct panthor_queue * > group_create_queue(struct panthor_group *group, > const struct drm_panthor_queue_create *args) > @@ -3056,9 +3252,35 @@ group_create_queue(struct panthor_group *group, > goto err_free_queue; > } > > + queue->profiling.slot_count = > + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size); > + > + queue->profiling.slots = > + panthor_kernel_bo_create(group->ptdev, group->vm, > + queue->profiling.slot_count * > + sizeof(struct panthor_job_profiling_data), > + DRM_PANTHOR_BO_NO_MMAP, > + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | > + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED, > + PANTHOR_VM_KERNEL_AUTO_VA); > + > + if (IS_ERR(queue->profiling.slots)) { > + ret = PTR_ERR(queue->profiling.slots); > + goto err_free_queue; > + } > + > + ret = panthor_kernel_bo_vmap(queue->profiling.slots); > + if (ret) > + goto err_free_queue; > + > + /* > + * Credit limit argument tells us the total number of instructions > + * across all CS slots in the ringbuffer, with some jobs requiring > + * twice as many as others, depending on their profiling status. > + */ > ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops, > group->ptdev->scheduler->wq, 1, > - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)), > + args->ringbuf_size / sizeof(u64), > 0, msecs_to_jiffies(JOB_TIMEOUT_MS), > group->ptdev->reset.wq, > NULL, "panthor-queue", group->ptdev->base.dev); > @@ -3354,6 +3576,7 @@ panthor_job_create(struct panthor_file *pfile, > { > struct panthor_group_pool *gpool = pfile->groups; > struct panthor_job *job; > + u32 credits; > int ret; > > if (qsubmit->pad) > @@ -3407,9 +3630,16 @@ panthor_job_create(struct panthor_file *pfile, > } > } > > + job->profiling.mask = pfile->ptdev->profile_mask; > + credits = calc_job_credits(job->profiling.mask); > + if (credits == 0) { > + ret = -EINVAL; > + goto err_put_job; > + } > + > ret = drm_sched_job_init(&job->base, > &job->group->queues[job->queue_idx]->entity, > - 1, job->group); > + credits, job->group); > if (ret) > goto err_put_job; >

1 year, 9 months

Re: [PATCH v7 2/5] drm/panthor: record current and maximum device clock frequencies

by Steven Price

On 21/09/2024 00:43, Adrián Larumbe wrote: > In order to support UM in calculating rates of GPU utilisation, the current > operating and maximum GPU clock frequencies must be recorded during device > initialisation, and also during OPP state transitions. > > Signed-off-by: Adrián Larumbe <adrian.larumbe(a)collabora.com> I thought I gave my r-b on v6 and I can't actually see any change: Reviewed-by: Steven Price <steven.price(a)arm.com> > --- > drivers/gpu/drm/panthor/panthor_devfreq.c | 18 +++++++++++++++++- > drivers/gpu/drm/panthor/panthor_device.h | 6 ++++++ > 2 files changed, 23 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c > index c6d3c327cc24..9d0f891b9b53 100644 > --- a/drivers/gpu/drm/panthor/panthor_devfreq.c > +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c > @@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq) > static int panthor_devfreq_target(struct device *dev, unsigned long *freq, > u32 flags) > { > + struct panthor_device *ptdev = dev_get_drvdata(dev); > struct dev_pm_opp *opp; > + int err; > > opp = devfreq_recommended_opp(dev, freq, flags); > if (IS_ERR(opp)) > return PTR_ERR(opp); > dev_pm_opp_put(opp); > > - return dev_pm_opp_set_rate(dev, *freq); > + err = dev_pm_opp_set_rate(dev, *freq); > + if (!err) > + ptdev->current_frequency = *freq; > + > + return err; > } > > static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq) > @@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev) > struct panthor_devfreq *pdevfreq; > struct dev_pm_opp *opp; > unsigned long cur_freq; > + unsigned long freq = ULONG_MAX; > int ret; > > pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL); > @@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev) > return PTR_ERR(opp); > > panthor_devfreq_profile.initial_freq = cur_freq; > + ptdev->current_frequency = cur_freq; > > /* Regulator coupling only takes care of synchronizing/balancing voltage > * updates, but the coupled regulator needs to be enabled manually. > @@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev) > > dev_pm_opp_put(opp); > > + /* Find the fastest defined rate */ > + opp = dev_pm_opp_find_freq_floor(dev, &freq); > + if (IS_ERR(opp)) > + return PTR_ERR(opp); > + ptdev->fast_rate = freq; > + > + dev_pm_opp_put(opp); > + > /* > * Setup default thresholds for the simple_ondemand governor. > * The values are chosen based on experiments. > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h > index a48e30d0af30..2109905813e8 100644 > --- a/drivers/gpu/drm/panthor/panthor_device.h > +++ b/drivers/gpu/drm/panthor/panthor_device.h > @@ -184,6 +184,12 @@ struct panthor_device { > > /** @profile_mask: User-set profiling flags for job accounting. */ > u32 profile_mask; > + > + /** @current_frequency: Device clock frequency at present. Set by DVFS*/ > + unsigned long current_frequency; > + > + /** @fast_rate: Maximum device clock frequency. Set by DVFS */ > + unsigned long fast_rate; > }; > > /**

1 year, 9 months

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig September 2024