Hi Qu,
Am 02.04.21 um 05:18 schrieb Qu Huang:
> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
> the bo->base.resv lock may be held by ttm_mem_evict_first(),
That can't happen since when bo_release_notify is called the BO has not
more references and is therefore deleted.
And we never evict a deleted BO, we just wait for it to become idle.
Regards,
Christian.
> and the VRAM mem will be evicted, mem region was replaced
> by Gtt mem region. amdgpu_bo_release_notify() will then
> hold the bo->base.resv lock, and SDMA will get an invalid
> address in amdgpu_fill_buffer(), resulting in a VMFAULT
> or memory corruption.
>
> To avoid it, we have to hold bo->base.resv lock first, and
> check whether the mem.mem_type is TTM_PL_VRAM.
>
> Signed-off-by: Qu Huang <jinsdb(a)126.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4b29b82..8018574 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
> if (bo->base.resv == &bo->base._resv)
> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>
> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
> return;
>
> dma_resv_lock(bo->base.resv, NULL);
>
> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
> + dma_resv_unlock(bo->base.resv);
> + return;
> + }
> +
> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
> if (!WARN_ON(r)) {
> amdgpu_bo_fence(abo, fence, false);
> --
> 1.8.3.1
>
On Thu, Apr 1, 2021 at 8:34 AM Doug Anderson <dianders(a)chromium.org> wrote:
>
> Hi,
>
> On Wed, Mar 31, 2021 at 6:24 PM Rob Clark <robdclark(a)gmail.com> wrote:
> >
> > @@ -45,6 +30,9 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
> > list_for_each_entry(msm_obj, &priv->inactive_dontneed, mm_list) {
> > if (freed >= sc->nr_to_scan)
> > break;
> > + /* Use trylock, because we cannot block on a obj that
> > + * might be trying to acquire mm_lock
> > + */
>
> nit: I thought the above multi-line commenting style was only for
> "net" subsystem?
we do use the "net" style a fair bit already.. (OTOH I tend to not
really care what checkpatch says)
> > if (!msm_gem_trylock(&msm_obj->base))
> > continue;
> > if (is_purgeable(msm_obj)) {
> > @@ -56,8 +44,11 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
> >
> > mutex_unlock(&priv->mm_lock);
> >
> > - if (freed > 0)
> > + if (freed > 0) {
> > trace_msm_gem_purge(freed << PAGE_SHIFT);
> > + } else {
> > + return SHRINK_STOP;
> > + }
>
> It probably doesn't matter, but I wonder if we should still be
> returning SHRINK_STOP if we got any trylock failures. It could
> possibly be worth returning 0 in that case?
On the surface, you'd think that, but there be mm dragons.. we can hit
shrinker from the submit path when the obj is locked already and we
are trying to allocate backing pages. We don't want to tell vmscan to
keep trying, because we'll keep failing to grab that objects lock
>
> > @@ -75,6 +66,9 @@ vmap_shrink(struct list_head *mm_list)
> > unsigned unmapped = 0;
> >
> > list_for_each_entry(msm_obj, mm_list, mm_list) {
> > + /* Use trylock, because we cannot block on a obj that
> > + * might be trying to acquire mm_lock
> > + */
>
> If you end up changing the commenting style above, should also be here.
>
> At this point this seems fine to land to me. Though I'm not an expert
> on every interaction in this code, I've spent enough time starting at
> it that I'm comfortable with:
>
> Reviewed-by: Douglas Anderson <dianders(a)chromium.org>
thanks
BR,
-R
From: Rob Clark <robdclark(a)chromium.org>
I've been spending some time looking into how things behave under high
memory pressure. The first patch is a random cleanup I noticed along
the way. The second improves the situation significantly when we are
getting shrinker called from many threads in parallel. And the last
two are $debugfs/gem fixes I needed so I could monitor the state of GEM
objects (ie. how many are active/purgable/purged) while triggering high
memory pressure.
We could probably go a bit further with dropping the mm_lock in the
shrinker->scan() loop, but this is already a pretty big improvement.
The next step is probably actually to add support to unpin/evict
inactive objects. (We are part way there since we have already de-
coupled the iova lifetime from the pages lifetime, but there are a
few sharp corners to work through.)
Rob Clark (4):
drm/msm: Remove unused freed llist node
drm/msm: Avoid mutex in shrinker_count()
drm/msm: Fix debugfs deadlock
drm/msm: Improved debugfs gem stats
drivers/gpu/drm/msm/msm_debugfs.c | 14 ++---
drivers/gpu/drm/msm/msm_drv.c | 4 ++
drivers/gpu/drm/msm/msm_drv.h | 15 ++++--
drivers/gpu/drm/msm/msm_fb.c | 3 +-
drivers/gpu/drm/msm/msm_gem.c | 65 ++++++++++++++++++-----
drivers/gpu/drm/msm/msm_gem.h | 72 +++++++++++++++++++++++---
drivers/gpu/drm/msm/msm_gem_shrinker.c | 28 ++++------
7 files changed, 150 insertions(+), 51 deletions(-)
--
2.30.2
Applied. Thanks!
Alex
On Thu, Mar 25, 2021 at 5:26 AM Nirmoy <nirmodas(a)amd.com> wrote:
>
>
> Reviewed-by: Nirmoy Das<nirmoy.das(a)amd.com>
>
> On 3/25/21 9:53 AM, Bhaskar Chowdhury wrote:
> > s/acccess/access/
> > s/inferface/interface/
> > s/sequnce/sequence/ .....two different places.
> > s/retrive/retrieve/
> > s/sheduling/scheduling/
> > s/independant/independent/
> > s/wether/whether/ ......two different places.
> > s/emmit/emit/
> > s/synce/sync/
> >
> >
> > Signed-off-by: Bhaskar Chowdhury <unixbhaskar(a)gmail.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 22 +++++++++++-----------
> > 1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> > index a368724c3dfc..4502b95ddf6b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> > @@ -1877,7 +1877,7 @@ static void gfx_v7_0_init_compute_vmid(struct amdgpu_device *adev)
> > mutex_unlock(&adev->srbm_mutex);
> >
> > /* Initialize all compute VMIDs to have no GDS, GWS, or OA
> > - acccess. These should be enabled by FW for target VMIDs. */
> > + access. These should be enabled by FW for target VMIDs. */
> > for (i = adev->vm_manager.first_kfd_vmid; i < AMDGPU_NUM_VMID; i++) {
> > WREG32(amdgpu_gds_reg_offset[i].mem_base, 0);
> > WREG32(amdgpu_gds_reg_offset[i].mem_size, 0);
> > @@ -2058,7 +2058,7 @@ static void gfx_v7_0_constants_init(struct amdgpu_device *adev)
> > * @adev: amdgpu_device pointer
> > *
> > * Set up the number and offset of the CP scratch registers.
> > - * NOTE: use of CP scratch registers is a legacy inferface and
> > + * NOTE: use of CP scratch registers is a legacy interface and
> > * is not used by default on newer asics (r6xx+). On newer asics,
> > * memory buffers are used for fences rather than scratch regs.
> > */
> > @@ -2172,7 +2172,7 @@ static void gfx_v7_0_ring_emit_vgt_flush(struct amdgpu_ring *ring)
> > * @seq: sequence number
> > * @flags: fence related flags
> > *
> > - * Emits a fence sequnce number on the gfx ring and flushes
> > + * Emits a fence sequence number on the gfx ring and flushes
> > * GPU caches.
> > */
> > static void gfx_v7_0_ring_emit_fence_gfx(struct amdgpu_ring *ring, u64 addr,
> > @@ -2215,7 +2215,7 @@ static void gfx_v7_0_ring_emit_fence_gfx(struct amdgpu_ring *ring, u64 addr,
> > * @seq: sequence number
> > * @flags: fence related flags
> > *
> > - * Emits a fence sequnce number on the compute ring and flushes
> > + * Emits a fence sequence number on the compute ring and flushes
> > * GPU caches.
> > */
> > static void gfx_v7_0_ring_emit_fence_compute(struct amdgpu_ring *ring,
> > @@ -2245,14 +2245,14 @@ static void gfx_v7_0_ring_emit_fence_compute(struct amdgpu_ring *ring,
> > * gfx_v7_0_ring_emit_ib - emit an IB (Indirect Buffer) on the ring
> > *
> > * @ring: amdgpu_ring structure holding ring information
> > - * @job: job to retrive vmid from
> > + * @job: job to retrieve vmid from
> > * @ib: amdgpu indirect buffer object
> > * @flags: options (AMDGPU_HAVE_CTX_SWITCH)
> > *
> > * Emits an DE (drawing engine) or CE (constant engine) IB
> > * on the gfx ring. IBs are usually generated by userspace
> > * acceleration drivers and submitted to the kernel for
> > - * sheduling on the ring. This function schedules the IB
> > + * scheduling on the ring. This function schedules the IB
> > * on the gfx ring for execution by the GPU.
> > */
> > static void gfx_v7_0_ring_emit_ib_gfx(struct amdgpu_ring *ring,
> > @@ -2402,7 +2402,7 @@ static int gfx_v7_0_ring_test_ib(struct amdgpu_ring *ring, long timeout)
> >
> > /*
> > * CP.
> > - * On CIK, gfx and compute now have independant command processors.
> > + * On CIK, gfx and compute now have independent command processors.
> > *
> > * GFX
> > * Gfx consists of a single ring and can process both gfx jobs and
> > @@ -2630,7 +2630,7 @@ static int gfx_v7_0_cp_gfx_resume(struct amdgpu_device *adev)
> > ring->wptr = 0;
> > WREG32(mmCP_RB0_WPTR, lower_32_bits(ring->wptr));
> >
> > - /* set the wb address wether it's enabled or not */
> > + /* set the wb address whether it's enabled or not */
> > rptr_addr = adev->wb.gpu_addr + (ring->rptr_offs * 4);
> > WREG32(mmCP_RB0_RPTR_ADDR, lower_32_bits(rptr_addr));
> > WREG32(mmCP_RB0_RPTR_ADDR_HI, upper_32_bits(rptr_addr) & 0xFF);
> > @@ -2985,7 +2985,7 @@ static void gfx_v7_0_mqd_init(struct amdgpu_device *adev,
> > mqd->cp_hqd_pq_wptr_poll_addr_lo = wb_gpu_addr & 0xfffffffc;
> > mqd->cp_hqd_pq_wptr_poll_addr_hi = upper_32_bits(wb_gpu_addr) & 0xffff;
> >
> > - /* set the wb address wether it's enabled or not */
> > + /* set the wb address whether it's enabled or not */
> > wb_gpu_addr = adev->wb.gpu_addr + (ring->rptr_offs * 4);
> > mqd->cp_hqd_pq_rptr_report_addr_lo = wb_gpu_addr & 0xfffffffc;
> > mqd->cp_hqd_pq_rptr_report_addr_hi =
> > @@ -3198,7 +3198,7 @@ static int gfx_v7_0_cp_resume(struct amdgpu_device *adev)
> > /**
> > * gfx_v7_0_ring_emit_vm_flush - cik vm flush using the CP
> > *
> > - * @ring: the ring to emmit the commands to
> > + * @ring: the ring to emit the commands to
> > *
> > * Sync the command pipeline with the PFP. E.g. wait for everything
> > * to be completed.
> > @@ -3220,7 +3220,7 @@ static void gfx_v7_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
> > amdgpu_ring_write(ring, 4); /* poll interval */
> >
> > if (usepfp) {
> > - /* synce CE with ME to prevent CE fetch CEIB before context switch done */
> > + /* sync CE with ME to prevent CE fetch CEIB before context switch done */
> > amdgpu_ring_write(ring, PACKET3(PACKET3_SWITCH_BUFFER, 0));
> > amdgpu_ring_write(ring, 0);
> > amdgpu_ring_write(ring, PACKET3(PACKET3_SWITCH_BUFFER, 0));
> > --
> > 2.30.1
> >
> _______________________________________________
> amd-gfx mailing list
> amd-gfx(a)lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Hi,
Am 25.03.21 um 09:17 schrieb Oleksandr Natalenko:
> Hello.
>
> On Thu, Mar 25, 2021 at 07:57:33AM +0200, Ilkka Prusi wrote:
>> On 24.3.2021 16.16, Chris Rankin wrote:
>>> Hi,
>>>
>>> Theee warnings ares not present in my dmesg log from 5.11.8:
>>>
>>> [ 43.390159] ------------[ cut here ]------------
>>> [ 43.393574] WARNING: CPU: 2 PID: 1268 at
>>> drivers/gpu/drm/ttm/ttm_bo.c:517 ttm_bo_release+0x172/0x282 [ttm]
>>> [ 43.401940] Modules linked in: nf_nat_ftp nf_conntrack_ftp cfg80211
>> Changing WARN_ON to WARN_ON_ONCE in drivers/gpu/drm/ttm/ttm_bo.c
>> ttm_bo_release() reduces the flood of messages into single splat.
>>
>> This warning appears to come from 57fcd550eb15bce ("drm/ttm: Warn on pinning
>> without holding a reference)" and reverting it might be one choice.
>>
>>
>>> There are others, but I am assuming there is a common cause here.
>>>
>>> Cheers,
>>> Chris
>>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index a76eb2c14e8c..50b53355b265 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -514,7 +514,7 @@ static void ttm_bo_release(struct kref *kref)
>> * shrinkers, now that they are queued for
>> * destruction.
>> */
>> - if (WARN_ON(bo->pin_count)) {
>> + if (WARN_ON_ONCE(bo->pin_count)) {
>> bo->pin_count = 0;
>> ttm_bo_del_from_lru(bo);
>> ttm_bo_add_mem_to_lru(bo, &bo->mem);
>>
>>
>>
>> --
>> - Ilkka
>>
> WARN_ON_ONCE() will just hide the underlying problem. Do we know why
> this happens at all?
The patch was incorrectly back ported to 5.11 without also porting the
driver changes to not trigger this warning back as well.
We are probably going to revert it for 5.11.10.
Regards,
Christian.
>
> Same for me, BTW, with v5.11.9:
>
> ```
> [~]> lspci | grep VGA
> 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
>
> [ 3676.033140] ------------[ cut here ]------------
> [ 3676.033153] WARNING: CPU: 7 PID: 1318 at drivers/gpu/drm/ttm/ttm_bo.c:517 ttm_bo_release+0x375/0x500 [ttm]
> …
> [ 3676.033340] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021
> …
> [ 3676.033469] Call Trace:
> [ 3676.033473] ttm_bo_move_accel_cleanup+0x1ab/0x3a0 [ttm]
> [ 3676.033478] amdgpu_bo_move+0x334/0x860 [amdgpu]
> [ 3676.033580] ttm_bo_validate+0x1f1/0x2d0 [ttm]
> [ 3676.033585] amdgpu_cs_bo_validate+0x9b/0x1c0 [amdgpu]
> [ 3676.033665] amdgpu_cs_list_validate+0x115/0x150 [amdgpu]
> [ 3676.033743] amdgpu_cs_ioctl+0x873/0x20a0 [amdgpu]
> [ 3676.033960] drm_ioctl_kernel+0xb8/0x140 [drm]
> [ 3676.033977] drm_ioctl+0x222/0x3c0 [drm]
> [ 3676.034071] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> [ 3676.034145] __x64_sys_ioctl+0x83/0xb0
> [ 3676.034149] do_syscall_64+0x33/0x40
> …
> [ 3676.034171] ---[ end trace 66e9865b027112f3 ]---
> ```
>
> Thanks.
>
Am Di., 23. März 2021 um 03:46 Uhr schrieb Jiapeng Chong
<jiapeng.chong(a)linux.alibaba.com>:
>
> Fix the following coccicheck warnings:
>
> ./drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:622:2-8: WARNING: NULL
> check before some freeing functions is not needed.
>
> ./drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:618:2-8: WARNING: NULL
> check before some freeing functions is not needed.
>
> ./drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:616:2-8: WARNING: NULL
> check before some freeing functions is not needed.
>
> Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
> Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com>
Reviewed-by: Christian Gmeiner <christian.gmeiner(a)gmail.com>
--
greets
--
Christian Gmeiner, MSc
https://christian-gmeiner.info/privacypolicy
Am 19.03.21 um 03:58 schrieb Wang Qing:
> Using wake_up_process() is more simpler and friendly,
> and it is more convenient for analysis and statistics
>
> Signed-off-by: Wang Qing <wangqing(a)vivo.com>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Should I pick it up or do you want to push it through some other tree
than DRM?
Thanks,
Christian.
> ---
> drivers/dma-buf/dma-fence.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 7475e09..de51326
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -655,7 +655,7 @@ dma_fence_default_wait_cb(struct dma_fence *fence, struct dma_fence_cb *cb)
> struct default_wait_cb *wait =
> container_of(cb, struct default_wait_cb, base);
>
> - wake_up_state(wait->task, TASK_NORMAL);
> + wake_up_process(wait->task);
> }
>
> /**
On 3/18/21 3:19 AM, Bhaskar Chowdhury wrote:
>
> s/bariers/barriers/
>
> Signed-off-by: Bhaskar Chowdhury <unixbhaskar(a)gmail.com>
Acked-by: Randy Dunlap <rdunlap(a)infradead.org>
> ---
> drivers/gpu/drm/i915/gt/intel_timeline.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 037b0e3ccbed..25fc7f44fee0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -435,7 +435,7 @@ void intel_timeline_exit(struct intel_timeline *tl)
> spin_unlock(&timelines->lock);
>
> /*
> - * Since this timeline is idle, all bariers upon which we were waiting
> + * Since this timeline is idle, all barriers upon which we were waiting
> * must also be complete and so we can discard the last used barriers
> * without loss of information.
> */
> --
> 2.26.2
>
--
~Randy