In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.
Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:
- grabs console_lock
- does a full modeset, which grabs all kinds of locks
(drm_modeset_lock, dma_resv_lock) which can deadlock with
dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
memory
These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.
Hence split out as a seperate patch.
Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:
- The big whack, aka force-complete all fences. i915 does this for all
pending jobs if the reset is somehow stuck. Trouble is we'd need to
do this for all fences in the entire system, and just the
book-keeping for that will be fun. Plus lots of drivers use fences
for all kinds of internal stuff like memory management, so
unconditionally resetting all of them doesn't work.
I'm also hoping that with these fence annotations we could enlist
lockdep in finding the last offenders causing deadlocks, and we
could remove this get-out-of-jail trick.
- The more feasible approach (across drivers at least as part of the
dma_fence contract) is what drm/i915 does for gen2/3: When we need
to reset the display we wake up all dma_fence_wait_interruptible
calls, or well at least the equivalent of those in i915 internally.
Relying on ioctl restart we force all other threads to release their
locks, which means the tdr thread is guaranteed to be able to get
them. I think we could implement this at the dma_fence level,
including proper lockdep annotations.
dma_fence_begin_tdr():
- must be nested within a dma_fence_begin/end_signalling section
- will wake up all interruptible (but not the non-interruptible)
dma_fence_wait() calls and force them to complete with a
-ERESTARTSYS errno code. All new interrupitble calls to
dma_fence_wait() will immeidately fail with the same error code.
dma_fence_end_trdr():
- this will convert dma_fence_wait() calls back to normal.
Of course interrupting dma_fence_wait is only ok if the caller
specified that, which means we need to split the annotations into
interruptible and non-interruptible version. If we then make sure
that we only use interruptible dma_fence_wait() calls while holding
drm_modeset_lock we can grab them in tdr code, and allow display
resets. Doing the same for dma_resv_lock might be a lot harder, so
buffer updates must be avoided.
What's worse, we're not going to be able to make the dma_fence_wait
calls in mmu-notifiers interruptible, that doesn't work. So
allocating memory still wont' be allowed, even in tdr sections. Plus
obviously we can use this trick only in tdr, it is rather intrusive.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 06a736e506ad..e34a44376e87 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
{
struct drm_gpu_scheduler *sched;
struct drm_sched_job *job;
+ bool fence_cookie;
sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+ fence_cookie = dma_fence_begin_signalling();
+
/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
spin_lock(&sched->job_list_lock);
job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
spin_lock(&sched->job_list_lock);
drm_sched_start_timeout(sched);
spin_unlock(&sched->job_list_lock);
+
+ dma_fence_end_signalling(fence_cookie);
}
/**
--
2.26.2
Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.
Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c575e7394d03..04c11443b9ca 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6910,7 +6910,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
* explicitly on fences instead
* and in general should be called for
* blocking commit to as per framework helpers
+ *
+ * Yes, this deadlocks, since you're calling dma_resv_lock in a
+ * path that leads to a dma_fence_signal(). Don't do that.
*/
+#if 0
r = amdgpu_bo_reserve(abo, true);
if (unlikely(r != 0))
DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6920,6 +6924,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
tmz_surface = amdgpu_bo_encrypted(abo);
amdgpu_bo_unreserve(abo);
+#endif
+ /*
+ * this races anyway, so READ_ONCE isn't any better or worse
+ * than the stuff above. Except the stuff above can deadlock.
+ */
+ tiling_flags = READ_ONCE(abo->tiling_flags);
fill_dc_plane_info_and_addr(
dm->adev, new_plane_state, tiling_flags,
--
2.26.2
My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:
kmem_cache_alloc+0x2b/0x6d0
amdgpu_fence_emit+0x30/0x330 [amdgpu]
amdgpu_ib_schedule+0x306/0x550 [amdgpu]
amdgpu_job_run+0x10f/0x260 [amdgpu]
drm_sched_main+0x1b9/0x490 [gpu_sched]
kthread+0x12e/0x150
Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.
I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.
Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.
v2: Two more allocations in scheduler paths.
Frist one:
__kmalloc+0x58/0x720
amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
amdgpu_job_dependency+0xf9/0x120 [amdgpu]
drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
drm_sched_main+0xf9/0x490 [gpu_sched]
Second one:
kmem_cache_alloc+0x2b/0x6d0
amdgpu_sync_fence+0x7e/0x110 [amdgpu]
amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
amdgpu_job_dependency+0xf9/0x120 [amdgpu]
drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
drm_sched_main+0xf9/0x490 [gpu_sched]
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d878fe7fee51..055b47241bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
uint32_t seq;
int r;
- fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+ fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
if (fence == NULL)
return -ENOMEM;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index fe92dcd94d4a..fdcd6659f5ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
return amdgpu_sync_fence(sync, ring->vmid_wait, false);
- fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+ fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
if (!fences)
return -ENOMEM;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..330476cc0c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f,
if (amdgpu_sync_add_later(sync, f, explicit))
return 0;
- e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+ e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
if (!e)
return -ENOMEM;
--
2.26.2
This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.
A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index a25fb59c127c..e109666aec14 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1212,6 +1212,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
struct amdgpu_job *job;
uint64_t seq;
int r;
+ bool fence_cookie;
job = p->job;
p->job = NULL;
@@ -1226,6 +1227,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
*/
mutex_lock(&p->adev->notifier_lock);
+ fence_cookie = dma_fence_begin_signalling();
+
/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
* -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
*/
@@ -1262,12 +1265,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+ dma_fence_end_signalling(fence_cookie);
mutex_unlock(&p->adev->notifier_lock);
return 0;
error_abort:
drm_sched_job_cleanup(&job->base);
+ dma_fence_end_signalling(fence_cookie);
mutex_unlock(&p->adev->notifier_lock);
error_unlock:
--
2.26.2
If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.
Add dma-fence annotations so that lockdep can check this for us.
I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2f319102ae9f..06a736e506ad 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -763,9 +763,12 @@ static int drm_sched_main(void *param)
struct sched_param sparam = {.sched_priority = 1};
struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
int r;
+ bool fence_cookie;
sched_setscheduler(current, SCHED_FIFO, &sparam);
+ fence_cookie = dma_fence_begin_signalling();
+
while (!kthread_should_stop()) {
struct drm_sched_entity *entity = NULL;
struct drm_sched_fence *s_fence;
@@ -823,6 +826,9 @@ static int drm_sched_main(void *param)
wake_up(&sched->job_scheduled);
}
+
+ dma_fence_end_signalling(fence_cookie);
+
return 0;
}
--
2.26.2
This is a bit disappointing since we need to split the annotations
over all the different parts.
I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.
Hence just do it function-by-function.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 7cd7fe0d57b4..bfcc7857a9a1 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
{
struct drm_device *dev = old_state->dev;
+ bool fence_cookie = dma_fence_begin_signalling();
drm_atomic_helper_commit_modeset_disables(dev, old_state);
@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
drm_atomic_helper_commit_hw_done(old_state);
+ dma_fence_end_signalling(fence_cookie);
+
drm_atomic_helper_wait_for_vblanks(dev, old_state);
drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
{
struct drm_device *dev = old_state->dev;
+ bool fence_cookie = dma_fence_begin_signalling();
drm_atomic_helper_commit_modeset_disables(dev, old_state);
@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
drm_atomic_helper_commit_hw_done(old_state);
+ dma_fence_end_signalling(fence_cookie);
+
drm_atomic_helper_wait_for_vblanks(dev, old_state);
drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
ktime_t start;
s64 commit_time_ms;
unsigned int i, new_self_refresh_mask = 0;
+ bool fence_cookie;
+
+ fence_cookie = dma_fence_begin_signalling();
funcs = dev->mode_config.helper_private;
@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
if (new_crtc_state->self_refresh_active)
new_self_refresh_mask |= BIT(i);
+ dma_fence_end_signalling(fence_cookie);
+
if (funcs && funcs->atomic_commit_tail)
funcs->atomic_commit_tail(old_state);
else
@@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
bool nonblock)
{
int ret;
+ bool fence_cookie;
if (state->async_update) {
ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
if (ret)
return ret;
+ fence_cookie = dma_fence_begin_signalling();
+
if (!nonblock) {
ret = drm_atomic_helper_wait_for_fences(dev, state, true);
if (ret)
@@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
*/
drm_atomic_state_get(state);
+ dma_fence_end_signalling(fence_cookie);
if (nonblock)
queue_work(system_unbound_wq, &state->commit_work);
else
@@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
return 0;
err:
+ dma_fence_end_signalling(fence_cookie);
drm_atomic_helper_cleanup_planes(dev, state);
return ret;
}
--
2.26.2
This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/gpu/drm/drm_vblank.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 85e5f2db1608..93a5bba5f665 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
* OTHER DEALINGS IN THE SOFTWARE.
*/
+#include <linux/dma-fence.h>
#include <linux/export.h>
#include <linux/moduleparam.h>
@@ -1908,7 +1909,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
{
struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
unsigned long irqflags;
- bool disable_irq;
+ bool disable_irq, fence_cookie;
if (drm_WARN_ON_ONCE(dev, !drm_dev_has_vblank(dev)))
return false;
@@ -1916,6 +1917,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
if (drm_WARN_ON(dev, pipe >= dev->num_crtcs))
return false;
+ fence_cookie = dma_fence_begin_signalling();
+
spin_lock_irqsave(&dev->event_lock, irqflags);
/* Need timestamp lock to prevent concurrent execution with
@@ -1928,6 +1931,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
if (!vblank->enabled) {
spin_unlock(&dev->vblank_time_lock);
spin_unlock_irqrestore(&dev->event_lock, irqflags);
+ dma_fence_end_signalling(fence_cookie);
return false;
}
@@ -1953,6 +1957,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
if (disable_irq)
vblank_disable_fn(&vblank->disable_timer);
+ dma_fence_end_signalling(fence_cookie);
+
return true;
}
EXPORT_SYMBOL(drm_handle_vblank);
--
2.26.2
Hi all,
I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.
Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.
There's a bunch of different parts to this RFC:
- The annotations itself, in the 2nd patch after the prep patch to add
might_sleep annotations. Commit message has all the motivation for what
kind of deadlocks I want to catch, best you just read it.
Since lockdep doesn't understand cross-release natively we need to
cobble something together using rwlocks and a few more tricks, but from
the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
I have now seems to actually work. Downside is that we have to
explicitly annotate all code involved in eventual dma_fence signalling.
- Second important part is locking down the current dma-fence cross-driver
contract, using lockdep priming like we already do for dma_resv_lock.
I've just started with my own take on what we probably need to make the
current code work (-ish), but both amdgpu and i915 have issues with
that. So this needs some careful discussions, and also some thought on
how we land it all eventually to not break lockdep completely for
everyone.
The important patch for that is "dma-fence: prime lockdep annotations"
plus of course the various annotations patches and driver hacks to
highlight some of the issues caught.
Note that depending upon what exactly we end up deciding we might need
to improve the annotations for fs_reclaim_acquire/release - for
dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
and currently fs_reclaim_acquire/release only has a lockdep class for
__GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
general maybe.
- Finally there's clearly some gaps in the current dma_fence driver
interfaces: Amdgpu's hang recovery is essentially impossible to fix
as-is - it needs to reset the display state and you can't get at modeset
locks from tdr without deadlock potential. i915 has an internal trick
(but it stops working once we involve real cross-driver fences) for this
issues, but then for i915 modeset reset is only needed on very ancient
gen2/3. Modern hw is a lot more reasonable.
I'm kinda hoping that the annotations and priming for basic command
submission and atomic modeset paths could be merged soonish, while we
the tdr side clearly needs a pile more work to get going. But since we
have to explicitly annotate all code paths anyway we can hide bugs in
e.g. tdr code by simply not yet annotating those functions.
I'm trying to lay out at least one idea for solving the tdr issue in the
patch titled "drm/scheduler: use dma-fence annotations in tdr work".
Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.
My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.
Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.
Cheers, Daniel
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Daniel Vetter (17):
dma-fence: add might_sleep annotation to _wait()
dma-fence: basic lockdep annotations
dma-fence: prime lockdep annotations
drm/vkms: Annotate vblank timer
drm/vblank: Annotate with dma-fence signalling section
drm/atomic-helper: Add dma-fence annotations
drm/amdgpu: add dma-fence annotations to atomic commit path
drm/scheduler: use dma-fence annotations in main thread
drm/amdgpu: use dma-fence annotations in cs_submit()
drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
drm/amdgpu: DC also loves to allocate stuff where it shouldn't
drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
drm/scheduler: use dma-fence annotations in tdr work
drm/amdgpu: use dma-fence annotations for gpu reset code
Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
drm/amdgpu: gpu recovery does full modesets
drm/i915: Annotate dma_fence_work
drivers/dma-buf/dma-fence.c | 56 +++++++++++++++++++
drivers/dma-buf/dma-resv.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 +-
drivers/gpu/drm/amd/amdgpu/atom.c | 2 +-
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
drivers/gpu/drm/amd/display/dc/core/dc.c | 4 +-
drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++
drivers/gpu/drm/drm_vblank.c | 8 ++-
drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +
drivers/gpu/drm/scheduler/sched_main.c | 11 ++++
drivers/gpu/drm/vkms/vkms_crtc.c | 8 ++-
include/linux/dma-fence.h | 13 +++++
16 files changed, 160 insertions(+), 13 deletions(-)
--
2.26.2
On Thu, May 28, 2020 at 11:54 PM Luben Tuikov <luben.tuikov(a)amd.com> wrote:
>
> On 2020-05-12 4:59 a.m., Daniel Vetter wrote:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> > this explicit annotation can be more liberally sprinkled around.
> > With read locks lockdep isn't going to complain if the read-side
> > isn't nested the same way under all circumstances, so ABBA deadlocks
> > are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> > read lock mode lockdep does not catch read side hazards. And we
> > _very_ much want read side hazards to be caught. For full details of
> > this limitation see
> >
> > commit e91498589746065e3ae95d9a00b068e525eec34f
> > Author: Peter Zijlstra <peterz(a)infradead.org>
> > Date: Wed Aug 23 13:13:11 2017 +0200
> >
> > locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> > keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> > dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> > to call dma_fence_begin/end_signalling from soft/hardirq context.
> > First attempt was using the hardirq locking context for the write
> > side in lockdep, but this forces all normal spinlocks nested within
> > dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> > The approach now is to simple check in_atomic(), and for these cases
> > entirely rely on the might_sleep() check in dma_fence_wait(). That
> > will catch any wrong nesting against spinlocks from soft/hardirq
> > contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> > mutex_lock(A);
> > mutex_unlock(A);
> >
> > dma_fence_signal();
> >
> > Thread B:
> >
> > mutex_lock(A);
> > dma_fence_wait();
> > mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > Cc: linux-media(a)vger.kernel.org
> > Cc: linaro-mm-sig(a)lists.linaro.org
> > Cc: linux-rdma(a)vger.kernel.org
> > Cc: amd-gfx(a)lists.freedesktop.org
> > Cc: intel-gfx(a)lists.freedesktop.org
> > Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
> > Cc: Christian König <christian.koenig(a)amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
> > ---
> > drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> > include/linux/dma-fence.h | 12 +++++++++
> > 2 files changed, 65 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> > }
> > EXPORT_SYMBOL(dma_fence_context_alloc);
> >
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map dma_fence_lockdep_map = {
> > + .name = "dma_fence_map"
> > +};
> > +
> > +bool dma_fence_begin_signalling(void)
> > +{
> > + /* explicitly nesting ... */
> > + if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > + return true;
> > +
> > + /* rely on might_sleep check for soft/hardirq locks */
> > + if (in_atomic())
> > + return true;
> > +
> > + /* ... and non-recursive readlock */
> > + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > +
> > + return false;
> > +}
> > +EXPORT_SYMBOL(dma_fence_begin_signalling);
>
> Hi Daniel,
>
> This is great work and could help a lot.
>
> If you invert the result of dma_fence_begin_signalling()
> then it would naturally mean "locked", i.e. whether we need to
> later release "dma_fence_lockedep_map". Then,
> in dma_fence_end_signalling(), you can call the "cookie"
> argument "locked" and simply do:
>
> void dma_fence_end_signalling(bool locked)
> {
> if (locked)
> lock_release(&dma_fence_lockdep_map, _RET_IP_);
> }
> EXPORT_SYMBOL(dma_fence_end_signalling);
>
> It'll be more natural to understand as well.
It's intentionally called cookie so callers don't start doing funny
stuff with it. The thing is, after begin_signalling you are _always_
in the locked state. It's just that because of limitations with
lockdep we need to play a few tricks, and in some cases we do not take
the lockdep map. There's 2 cases:
- lockdep map already taken - we want recursive readlock semantics for
this, but lockdep does not correctly check recursive read locks. Hence
we only use readlock, and make sure we do not actually nest upon
ourselves with this explicit check.
- when we're in atomic sections - lockdep gets pissed at us if we take
the read lock in hard/softirq sections because of hard/softirq ctx
mismatch (lockdep thinks it's a real lock, but we don't treat it as
one). Simplest fix was to rely on the might_sleep check in patch 1
(already merged)
The commit message mentions this already a bit, but I'll try to
explain this implementation detail tersely in the kerneldoc too in the
next round.
Thanks, Daniel
>
> Regards,
> Luben
>
> > +
> > +void dma_fence_end_signalling(bool cookie)
> > +{
> > + if (cookie)
> > + return;
> > +
> > + lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > +}
> > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > +
> > +void __dma_fence_might_wait(void)
> > +{
> > + bool tmp;
> > +
> > + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > + if (tmp)
> > + lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > + lock_map_acquire(&dma_fence_lockdep_map);
> > + lock_map_release(&dma_fence_lockdep_map);
> > + if (tmp)
> > + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > +}
> > +#endif
> > +
> > +
> > /**
> > * dma_fence_signal_locked - signal completion of a fence
> > * @fence: the fence to signal
> > @@ -170,14 +216,19 @@ int dma_fence_signal(struct dma_fence *fence)
> > {
> > unsigned long flags;
> > int ret;
> > + bool tmp;
> >
> > if (!fence)
> > return -EINVAL;
> >
> > + tmp = dma_fence_begin_signalling();
> > +
> > spin_lock_irqsave(fence->lock, flags);
> > ret = dma_fence_signal_locked(fence);
> > spin_unlock_irqrestore(fence->lock, flags);
> >
> > + dma_fence_end_signalling(tmp);
> > +
> > return ret;
> > }
> > EXPORT_SYMBOL(dma_fence_signal);
> > @@ -211,6 +262,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> > if (timeout > 0)
> > might_sleep();
> >
> > + __dma_fence_might_wait();
> > +
> > trace_dma_fence_wait_start(fence);
> > if (fence->ops->wait)
> > ret = fence->ops->wait(fence, intr, timeout);
> > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > index 3347c54f3a87..3f288f7db2ef 100644
> > --- a/include/linux/dma-fence.h
> > +++ b/include/linux/dma-fence.h
> > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> > } while (1);
> > }
> >
> > +#ifdef CONFIG_LOCKDEP
> > +bool dma_fence_begin_signalling(void);
> > +void dma_fence_end_signalling(bool cookie);
> > +#else
> > +static inline bool dma_fence_begin_signalling(void)
> > +{
> > + return true;
> > +}
> > +static inline void dma_fence_end_signalling(bool cookie) {}
> > +static inline void __dma_fence_might_wait(void) {}
> > +#endif
> > +
> > int dma_fence_signal(struct dma_fence *fence);
> > int dma_fence_signal_locked(struct dma_fence *fence);
> > signed long dma_fence_default_wait(struct dma_fence *fence,
> >
>
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, May 28, 2020 at 3:37 PM Thomas Hellström (Intel)
<thomas_os(a)shipmail.org> wrote:
>
> On 2020-05-12 10:59, Daniel Vetter wrote:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> > this explicit annotation can be more liberally sprinkled around.
> > With read locks lockdep isn't going to complain if the read-side
> > isn't nested the same way under all circumstances, so ABBA deadlocks
> > are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> > read lock mode lockdep does not catch read side hazards. And we
> > _very_ much want read side hazards to be caught. For full details of
> > this limitation see
> >
> > commit e91498589746065e3ae95d9a00b068e525eec34f
> > Author: Peter Zijlstra <peterz(a)infradead.org>
> > Date: Wed Aug 23 13:13:11 2017 +0200
> >
> > locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> > keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> > dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> > to call dma_fence_begin/end_signalling from soft/hardirq context.
> > First attempt was using the hardirq locking context for the write
> > side in lockdep, but this forces all normal spinlocks nested within
> > dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> > The approach now is to simple check in_atomic(), and for these cases
> > entirely rely on the might_sleep() check in dma_fence_wait(). That
> > will catch any wrong nesting against spinlocks from soft/hardirq
> > contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> > mutex_lock(A);
> > mutex_unlock(A);
> >
> > dma_fence_signal();
> >
> > Thread B:
> >
> > mutex_lock(A);
> > dma_fence_wait();
> > mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > Cc: linux-media(a)vger.kernel.org
> > Cc: linaro-mm-sig(a)lists.linaro.org
> > Cc: linux-rdma(a)vger.kernel.org
> > Cc: amd-gfx(a)lists.freedesktop.org
> > Cc: intel-gfx(a)lists.freedesktop.org
> > Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
> > Cc: Christian König <christian.koenig(a)amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
>
> LGTM. Perhaps some in-code documentation on how to use the new functions
> are called.
See cover letter, that's going to be done for next round. For this one
here I just wanted to showcase a bit how it's used in a few different
places, mostly selected to get as much feedback from across different
drivers. Hence e.g. annotating drm/scheduler.
> Otherwise for patch 2 and 3,
>
> Reviewed-by: Thomas Hellstrom <thomas.hellstrom(a)intel.com>
I think I'll just cc you for the next round with docs, so you can make
sure it looks ok :-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, May 26, 2020 at 07:58:08PM +0900, David Stevens wrote:
> This patchset implements the current proposal for virtio cross-device
> resource sharing [1]. It will be used to import virtio resources into
> the virtio-video driver currently under discussion [2]. The patch
> under consideration to add support in the virtio-video driver is [3].
> It uses the APIs from v3 of this series, but the changes to update it
> are relatively minor.
>
> This patchset adds a new flavor of dma-bufs that supports querying the
> underlying virtio object UUID, as well as adding support for exporting
> resources from virtgpu.
>
> [1] https://markmail.org/thread/2ypjt5cfeu3m6lxu
> [2] https://markmail.org/thread/p5d3k566srtdtute
> [3] https://markmail.org/thread/j4xlqaaim266qpks
>
> v3 -> v4 changes:
> - Replace dma-buf hooks with virtio dma-buf from v1.
> - Remove virtio_attach callback, as the work that had been done
> in that callback is now done on dma-buf export. The documented
> requirement that get_uuid only be called on attached virtio
> dma-bufs is also removed.
> - Rebase and add call to virtio_gpu_notify for ASSIGN_UUID.
>
> David Stevens (3):
> virtio: add dma-buf support for exported objects
> virtio-gpu: add VIRTIO_GPU_F_RESOURCE_UUID feature
> drm/virtio: Support virtgpu exported resources
Looks all sane to me. mst, have you looked at the virtio core changes?
How we are going to merge this? If you ack I can merge via
drm-misc-next. Merging through virtio queue would be fine too.
thanks,
Gerd
Dear All,
During the Exynos DRM GEM rework and fixing the issues in the.
drm_prime_sg_to_page_addr_arrays() function [1] I've noticed that most
drivers in DRM framework incorrectly use nents and orig_nents entries of
the struct sg_table.
In case of the most DMA-mapping implementations exchanging those two
entries or using nents for all loops on the scatterlist is harmless,
because they both have the same value. There exists however a DMA-mapping
implementations, for which such incorrect usage breaks things. The nents
returned by dma_map_sg() might be lower than the nents passed as its
parameter and this is perfectly fine. DMA framework or IOMMU is allowed
to join consecutive chunks while mapping if such operation is supported
by the underlying HW (bus, bridge, IOMMU, etc). Example of the case
where dma_map_sg() might return 1 'DMA' chunk for the 4 'physical' pages
is described here [2]
The DMA-mapping framework documentation [3] states that dma_map_sg()
returns the numer of the created entries in the DMA address space.
However the subsequent calls to dma_sync_sg_for_{device,cpu} and
dma_unmap_sg must be called with the original number of entries passed to
dma_map_sg. The common pattern in DRM drivers were to assign the
dma_map_sg() return value to sg_table->nents and use that value for
the subsequent calls to dma_sync_sg_* or dma_unmap_sg functions. Also
the code iterated over nents times to access the pages stored in the
processed scatterlist, while it should use orig_nents as the numer of
the page entries.
I've tried to identify all such incorrect usage of sg_table->nents and
this is a result of my research. It looks that the incorrect pattern has
been copied over the many drivers mainly in the DRM subsystem. Too bad in
most cases it even worked correctly if the system used a simple, linear
DMA-mapping implementation, for which swapping nents and orig_nents
doesn't make any difference. To avoid similar issues in the future, I've
introduced a common wrappers for DMA-mapping calls, which operate directly
on the sg_table objects. I've also added wrappers for iterating over the
scatterlists stored in the sg_table objects and applied them where
possible. This, together with some common DRM prime helpers, allowed me
to almost get rid of all nents/orig_nents usage in the drivers. I hope
that such change makes the code robust, easier to follow and copy/paste
safe.
The biggest TODO is DRM/i915 driver and I don't feel brave enough to fix
it fully. The driver creatively uses sg_table->orig_nents to store the
size of the allocate scatterlist and ignores the number of the entries
returned by dma_map_sg function. In this patchset I only fixed the
sg_table objects exported by dmabuf related functions. I hope that I
didn't break anything there.
Patches are based on top of Linux next-20200512.
Christoph Hellwig already offered to take patches 1-3 into his immutable
branch [4]. If possible I would like ask for merging most of the
remaining patches via DRM tree (on top of that immutable branch).
Best regards,
Marek Szyprowski
References:
[1] https://lkml.org/lkml/2020/3/27/555
[2] https://lkml.org/lkml/2020/3/29/65
[3] Documentation/DMA-API-HOWTO.txt
[4] https://lore.kernel.org/linux-iommu/20200512121931.GD20393@lst.de/T/#ma18c9…
Changelog:
v5:
- fixed some minor style issues and typos
- fixed lack of the attrs argument in ion, dmabuf, rapidio, fastrpc and
vfio patches
v4: https://lore.kernel.org/linux-iommu/20200512121931.GD20393@lst.de/T/
- added for_each_sgtable_* wrappers and applied where possible
- added drm_prime_get_contiguous_size() and applied where possible
- applied drm_prime_sg_to_page_addr_arrays() where possible to remove page
extraction from sg_table objects
- added documentation for the introduced wrappers
- improved patches description a bit
v3: https://lore.kernel.org/dri-devel/20200505083926.28503-1-m.szyprowski@samsu…
- introduce dma_*_sgtable_* wrappers and use them in all patches
v2: https://lore.kernel.org/linux-iommu/c01c9766-9778-fd1f-f36e-2dc7bd376ba4@ar…
- dropped most of the changes to drm/i915
- added fixes for rcar-du, xen, media and ion
- fixed a few issues pointed by kbuild test robot
- added wide cc: list for each patch
v1: https://lore.kernel.org/linux-iommu/c01c9766-9778-fd1f-f36e-2dc7bd376ba4@ar…
- initial version
Patch summary:
Marek Szyprowski (38):
dma-mapping: add generic helpers for mapping sgtable objects
scatterlist: add generic wrappers for iterating over sgtable objects
iommu: add generic helper for mapping sgtable objects
drm: prime: add common helper to check scatterlist contiguity
drm: prime: use sgtable iterators in
drm_prime_sg_to_page_addr_arrays()
drm: core: fix common struct sg_table related issues
drm: amdgpu: fix common struct sg_table related issues
drm: armada: fix common struct sg_table related issues
drm: etnaviv: fix common struct sg_table related issues
drm: exynos: use common helper for a scatterlist contiguity check
drm: exynos: fix common struct sg_table related issues
drm: i915: fix common struct sg_table related issues
drm: lima: fix common struct sg_table related issues
drm: mediatek: use common helper for a scatterlist contiguity check
drm: mediatek: use common helper for extracting pages array
drm: msm: fix common struct sg_table related issues
drm: omapdrm: use common helper for extracting pages array
drm: omapdrm: fix common struct sg_table related issues
drm: panfrost: fix common struct sg_table related issues
drm: radeon: fix common struct sg_table related issues
drm: rockchip: use common helper for a scatterlist contiguity check
drm: rockchip: fix common struct sg_table related issues
drm: tegra: fix common struct sg_table related issues
drm: v3d: fix common struct sg_table related issues
drm: virtio: fix common struct sg_table related issues
drm: vmwgfx: fix common struct sg_table related issues
xen: gntdev: fix common struct sg_table related issues
drm: host1x: fix common struct sg_table related issues
drm: rcar-du: fix common struct sg_table related issues
dmabuf: fix common struct sg_table related issues
staging: ion: remove dead code
staging: ion: fix common struct sg_table related issues
staging: tegra-vde: fix common struct sg_table related issues
misc: fastrpc: fix common struct sg_table related issues
rapidio: fix common struct sg_table related issues
samples: vfio-mdev/mbochs: fix common struct sg_table related issues
media: pci: fix common ALSA DMA-mapping related codes
videobuf2: use sgtable-based scatterlist wrappers
drivers/dma-buf/heaps/heap-helpers.c | 13 ++--
drivers/dma-buf/udmabuf.c | 7 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 8 +-
drivers/gpu/drm/armada/armada_gem.c | 12 +--
drivers/gpu/drm/drm_cache.c | 2 +-
drivers/gpu/drm/drm_gem_cma_helper.c | 23 +-----
drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++--
drivers/gpu/drm/drm_prime.c | 86 ++++++++++++----------
drivers/gpu/drm/etnaviv/etnaviv_gem.c | 12 ++-
drivers/gpu/drm/etnaviv/etnaviv_mmu.c | 13 +---
drivers/gpu/drm/exynos/exynos_drm_g2d.c | 10 +--
drivers/gpu/drm/exynos/exynos_drm_gem.c | 23 +-----
drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 11 +--
drivers/gpu/drm/i915/gem/selftests/mock_dmabuf.c | 7 +-
drivers/gpu/drm/lima/lima_gem.c | 11 ++-
drivers/gpu/drm/lima/lima_vm.c | 5 +-
drivers/gpu/drm/mediatek/mtk_drm_gem.c | 34 ++-------
drivers/gpu/drm/msm/msm_gem.c | 13 ++--
drivers/gpu/drm/msm/msm_gpummu.c | 14 ++--
drivers/gpu/drm/msm/msm_iommu.c | 2 +-
drivers/gpu/drm/omapdrm/omap_gem.c | 20 ++---
drivers/gpu/drm/panfrost/panfrost_gem.c | 4 +-
drivers/gpu/drm/panfrost/panfrost_mmu.c | 7 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 11 ++-
drivers/gpu/drm/rcar-du/rcar_du_vsp.c | 3 +-
drivers/gpu/drm/rockchip/rockchip_drm_gem.c | 42 +++--------
drivers/gpu/drm/tegra/gem.c | 27 +++----
drivers/gpu/drm/tegra/plane.c | 15 ++--
drivers/gpu/drm/v3d/v3d_mmu.c | 17 ++---
drivers/gpu/drm/virtio/virtgpu_object.c | 36 +++++----
drivers/gpu/drm/virtio/virtgpu_vq.c | 12 ++-
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c | 17 +----
drivers/gpu/host1x/job.c | 22 ++----
.../media/common/videobuf2/videobuf2-dma-contig.c | 41 +++++------
drivers/media/common/videobuf2/videobuf2-dma-sg.c | 32 +++-----
drivers/media/common/videobuf2/videobuf2-vmalloc.c | 12 +--
drivers/media/pci/cx23885/cx23885-alsa.c | 2 +-
drivers/media/pci/cx25821/cx25821-alsa.c | 2 +-
drivers/media/pci/cx88/cx88-alsa.c | 2 +-
drivers/media/pci/saa7134/saa7134-alsa.c | 2 +-
drivers/media/platform/vsp1/vsp1_drm.c | 8 +-
drivers/misc/fastrpc.c | 4 +-
drivers/rapidio/devices/rio_mport_cdev.c | 8 +-
drivers/staging/android/ion/ion.c | 25 +++----
drivers/staging/android/ion/ion.h | 1 -
drivers/staging/android/ion/ion_heap.c | 53 ++++---------
drivers/staging/android/ion/ion_system_heap.c | 2 +-
drivers/staging/media/tegra-vde/iommu.c | 4 +-
drivers/xen/gntdev-dmabuf.c | 13 ++--
include/drm/drm_prime.h | 2 +
include/linux/dma-mapping.h | 78 ++++++++++++++++++++
include/linux/iommu.h | 16 ++++
include/linux/scatterlist.h | 50 ++++++++++++-
samples/vfio-mdev/mbochs.c | 3 +-
56 files changed, 451 insertions(+), 477 deletions(-)
--
1.9.1
On Thu, May 14, 2020 at 02:38:38PM +0300, Oded Gabbay wrote:
> On Tue, May 12, 2020 at 9:12 AM Daniel Vetter <daniel.vetter(a)ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 4:14 AM Dave Airlie <airlied(a)gmail.com> wrote:
> > >
> > > On Mon, 11 May 2020 at 19:37, Oded Gabbay <oded.gabbay(a)gmail.com> wrote:
> > > >
> > > > On Mon, May 11, 2020 at 12:11 PM Daniel Vetter <daniel.vetter(a)ffwll.ch> wrote:
> > > > >
> > > > > It's the default.
> > > > Thanks for catching that.
> > > >
> > > > >
> > > > > Also so much for "we're not going to tell the graphics people how to
> > > > > review their code", dma_fence is a pretty core piece of gpu driver
> > > > > infrastructure. And it's very much uapi relevant, including piles of
> > > > > corresponding userspace protocols and libraries for how to pass these
> > > > > around.
> > > > >
> > > > > Would be great if habanalabs would not use this (from a quick look
> > > > > it's not needed at all), since open source the userspace and playing
> > > > > by the usual rules isn't on the table. If that's not possible (because
> > > > > it's actually using the uapi part of dma_fence to interact with gpu
> > > > > drivers) then we have exactly what everyone promised we'd want to
> > > > > avoid.
> > > >
> > > > We don't use the uapi parts, we currently only using the fencing and
> > > > signaling ability of this module inside our kernel code. But maybe I
> > > > didn't understand what you request. You want us *not* to use this
> > > > well-written piece of kernel code because it is only used by graphics
> > > > drivers ?
> > > > I'm sorry but I don't get this argument, if this is indeed what you meant.
> > >
> > > We would rather drivers using a feature that has requirements on
> > > correct userspace implementations of the feature have a userspace that
> > > is open source and auditable.
> > >
> > > Fencing is tricky, cross-device fencing is really tricky, and having
> > > the ability for a closed userspace component to mess up other people's
> > > drivers, think i915 shared with closed habana userspace and shared
> > > fences, decreases ability to debug things.
> > >
> > > Ideally we wouldn't offer users known untested/broken scenarios, so
> > > yes we'd prefer that drivers that intend to expose a userspace fencing
> > > api around dma-fence would adhere to the rules of the gpu drivers.
> > >
> > > I'm not say you have to drop using dma-fence, but if you move towards
> > > cross-device stuff I believe other drivers would be correct in
> > > refusing to interact with fences from here.
> >
> > The flip side is if you only used dma-fence.c "because it's there",
> > and not because it comes with an uapi attached and a cross-driver
> > kernel internal contract for how to interact with gpu drivers, then
> > there's really not much point in using it. It's a custom-rolled
> > wait_queue/event thing, that's all. Without the gpu uapi and gpu
> > cross-driver contract it would be much cleaner to just use wait_queue
> > directly, and that's a construct all kernel developers understand, not
> > just gpu folks. From a quick look at least habanalabs doesn't use any
> > of these uapi/cross-driver/gpu bits.
> > -Daniel
>
> Hi Daniel,
> I want to say explicitly that we don't use the dma-buf uapi parts, nor
> we intend to use them to communicate with any GPU device. We only use
> it as simple completion mechanism as it was convenient to use.
> I do understand I can exchange that mechanism with a simpler one, and
> I will add an internal task to do it (albeit not in a very high
> priority) and upstream it, its just that it is part of our data path
> so we need to thoroughly validate it first.
Sounds good.
Wrt merging this patch here, can you include that in one of your next
pulls? Or should I toss it entirely, waiting for you to remove dma_fence
outright?
Thanks, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Do it uncontionally, there's a separate peek function with
dma_fence_is_signalled() which can be called from atomic context.
v2: Consensus calls for an unconditional might_sleep (Chris,
Christian)
Full audit:
- dma-fence.h: Uses MAX_SCHEDULE_TIMOUT, good chance this sleeps
- dma-resv.c: Timeout always at least 1
- st-dma-fence.c: Save to sleep in testcases
- amdgpu_cs.c: Both callers are for variants of the wait ioctl
- amdgpu_device.c: Two callers in vram recover code, both right next
to mutex_lock.
- amdgpu_vm.c: Use in the vm_wait ioctl, next to _reserve/unreserve
- remaining functions in amdgpu: All for test_ib implementations for
various engines, caller for that looks all safe (debugfs, driver
load, reset)
- etnaviv: another wait ioctl
- habanalabs: another wait ioctl
- nouveau_fence.c: hardcoded 15*HZ ... glorious
- nouveau_gem.c: hardcoded 2*HZ ... so not even super consistent, but
this one does have a WARN_ON :-/ At least this one is only a
fallback path for when kmalloc fails. Maybe this should be put onto
some worker list instead, instead of a work per unamp ...
- i915/selftests: Hardecoded HZ / 4 or HZ / 8
- i915/gt/selftests: Going up the callchain looks safe looking at
nearby callers
- i915/gt/intel_gt_requests.c. Wrapped in a mutex_lock
- i915/gem_i915_gem_wait.c: The i915-version which is called instead
for i915 fences already has a might_sleep() annotation, so all good
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Lucas Stach <l.stach(a)pengutronix.de>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: "VMware Graphics" <linux-graphics-maintainer(a)vmware.com>
Cc: Oded Gabbay <oded.gabbay(a)gmail.com>
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: linux-rdma(a)vger.kernel.org
Cc: amd-gfx(a)lists.freedesktop.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
---
drivers/dma-buf/dma-fence.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 90edf2b281b0..656e9ac2d028 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -208,6 +208,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
if (WARN_ON(timeout < 0))
return -EINVAL;
+ might_sleep();
+
trace_dma_fence_wait_start(fence);
if (fence->ops->wait)
ret = fence->ops->wait(fence, intr, timeout);
--
2.26.2
On Fri, May 15, 2020 at 02:07:06PM +0900, David Stevens wrote:
> On Thu, May 14, 2020 at 9:30 PM Daniel Vetter <daniel(a)ffwll.ch> wrote:
> > On Thu, May 14, 2020 at 05:19:40PM +0900, David Stevens wrote:
> > > Sorry for the duplicate reply, didn't notice this until now.
> > >
> > > > Just storing
> > > > the uuid should be doable (assuming this doesn't change during the
> > > > lifetime of the buffer), so no need for a callback.
> > >
> > > Directly storing the uuid doesn't work that well because of
> > > synchronization issues. The uuid needs to be shared between multiple
> > > virtio devices with independent command streams, so to prevent races
> > > between importing and exporting, the exporting driver can't share the
> > > uuid with other drivers until it knows that the device has finished
> > > registering the uuid. That requires a round trip to and then back from
> > > the device. Using a callback allows the latency from that round trip
> > > registration to be hidden.
> >
> > Uh, that means you actually do something and there's locking involved.
> > Makes stuff more complicated, invariant attributes are a lot easier
> > generally. Registering that uuid just always doesn't work, and blocking
> > when you're exporting?
>
> Registering the id at creation and blocking in gem export is feasible,
> but it doesn't work well for systems with a centralized buffer
> allocator that doesn't support batch allocations (e.g. gralloc). In
> such a system, the round trip latency would almost certainly be
> included in the buffer allocation time. At least on the system I'm
> working on, I suspect that would add 10s of milliseconds of startup
> latency to video pipelines (although I haven't benchmarked the
> difference). Doing the blocking as late as possible means most or all
> of the latency can be hidden behind other pipeline setup work.
>
> In terms of complexity, I think the synchronization would be basically
> the same in either approach, just in different locations. All it would
> do is alleviate the need for a callback to fetch the UUID.
Hm ok. I guess if we go with the older patch, where this all is a lot more
just code in virtio, doing an extra function to allocate the uuid sounds
fine. Then synchronization is entirely up to the virtio subsystem and not
a dma-buf problem (and hence not mine). You can use dma_resv_lock or so,
but no need to. But with callbacks potentially going both ways things
always get a bit interesting wrt locking - this is what makes peer2peer
dma-buf so painful right now. Hence I'd like to avoid that if needed, at
least at the dma-buf level. virtio code I don't mind what you do there :-)
Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Wed, Mar 11, 2020 at 12:20 PM David Stevens <stevensd(a)chromium.org> wrote:
>
> This change adds a new dma-buf operation that allows dma-bufs to be used
> by virtio drivers to share exported objects. The new operation allows
> the importing driver to query the exporting driver for the UUID which
> identifies the underlying exported object.
>
> Signed-off-by: David Stevens <stevensd(a)chromium.org>
Adding Tomasz Figa, I've discussed this with him at elce last year I
think. Just to make sure.
Bunch of things:
- obviously we need the users of this in a few drivers, can't really
review anything stand-alone
- adding very specific ops to the generic interface is rather awkward,
eventually everyone wants that and we end up in a mess. I think the
best solution here would be if we create a struct virtio_dma_buf which
subclasses dma-buf, add a (hopefully safe) runtime upcasting
functions, and then a virtio_dma_buf_get_uuid() function. Just storing
the uuid should be doable (assuming this doesn't change during the
lifetime of the buffer), so no need for a callback.
- for the runtime upcasting the usual approach is to check the ->ops
pointer. Which means that would need to be the same for all virtio
dma_bufs, which might get a bit awkward. But I'd really prefer we not
add allocator specific stuff like this to dma-buf.
-Daniel
> ---
> drivers/dma-buf/dma-buf.c | 12 ++++++++++++
> include/linux/dma-buf.h | 18 ++++++++++++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index d4097856c86b..fa5210ba6aaa 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1158,6 +1158,18 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, void *vaddr)
> }
> EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>
> +int dma_buf_get_uuid(struct dma_buf *dmabuf, uuid_t *uuid)
> +{
> + if (WARN_ON(!dmabuf) || !uuid)
> + return -EINVAL;
> +
> + if (!dmabuf->ops->get_uuid)
> + return -ENODEV;
> +
> + return dmabuf->ops->get_uuid(dmabuf, uuid);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_get_uuid);
> +
> #ifdef CONFIG_DEBUG_FS
> static int dma_buf_debug_show(struct seq_file *s, void *unused)
> {
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index abf5459a5b9d..00758523597d 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -251,6 +251,21 @@ struct dma_buf_ops {
>
> void *(*vmap)(struct dma_buf *);
> void (*vunmap)(struct dma_buf *, void *vaddr);
> +
> + /**
> + * @get_uuid
> + *
> + * This is called by dma_buf_get_uuid to get the UUID which identifies
> + * the buffer to virtio devices.
> + *
> + * This callback is optional.
> + *
> + * Returns:
> + *
> + * 0 on success or a negative error code on failure. On success uuid
> + * will be populated with the buffer's UUID.
> + */
> + int (*get_uuid)(struct dma_buf *dmabuf, uuid_t *uuid);
> };
>
> /**
> @@ -444,4 +459,7 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
> unsigned long);
> void *dma_buf_vmap(struct dma_buf *);
> void dma_buf_vunmap(struct dma_buf *, void *vaddr);
> +
> +int dma_buf_get_uuid(struct dma_buf *dmabuf, uuid_t *uuid);
> +
> #endif /* __DMA_BUF_H__ */
> --
> 2.25.1.481.gfbce0eb801-goog
>
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, May 14, 2020 at 05:19:40PM +0900, David Stevens wrote:
> Sorry for the duplicate reply, didn't notice this until now.
>
> > Just storing
> > the uuid should be doable (assuming this doesn't change during the
> > lifetime of the buffer), so no need for a callback.
>
> Directly storing the uuid doesn't work that well because of
> synchronization issues. The uuid needs to be shared between multiple
> virtio devices with independent command streams, so to prevent races
> between importing and exporting, the exporting driver can't share the
> uuid with other drivers until it knows that the device has finished
> registering the uuid. That requires a round trip to and then back from
> the device. Using a callback allows the latency from that round trip
> registration to be hidden.
Uh, that means you actually do something and there's locking involved.
Makes stuff more complicated, invariant attributes are a lot easier
generally. Registering that uuid just always doesn't work, and blocking
when you're exporting?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Thu, May 14, 2020 at 11:08:52AM +0900, David Stevens wrote:
> On Thu, May 14, 2020 at 12:45 AM Daniel Vetter <daniel(a)ffwll.ch> wrote:
> > On Wed, Mar 11, 2020 at 12:20 PM David Stevens <stevensd(a)chromium.org> wrote:
> > >
> > > This change adds a new dma-buf operation that allows dma-bufs to be used
> > > by virtio drivers to share exported objects. The new operation allows
> > > the importing driver to query the exporting driver for the UUID which
> > > identifies the underlying exported object.
> > >
> > > Signed-off-by: David Stevens <stevensd(a)chromium.org>
> >
> > Adding Tomasz Figa, I've discussed this with him at elce last year I
> > think. Just to make sure.
> >
> > Bunch of things:
> > - obviously we need the users of this in a few drivers, can't really
> > review anything stand-alone
>
> Here is a link to the usage of this feature by the currently under
> development virtio-video driver:
> https://markmail.org/thread/j4xlqaaim266qpks
>
> > - adding very specific ops to the generic interface is rather awkward,
> > eventually everyone wants that and we end up in a mess. I think the
> > best solution here would be if we create a struct virtio_dma_buf which
> > subclasses dma-buf, add a (hopefully safe) runtime upcasting
> > functions, and then a virtio_dma_buf_get_uuid() function. Just storing
> > the uuid should be doable (assuming this doesn't change during the
> > lifetime of the buffer), so no need for a callback.
>
> So you would prefer a solution similar to the original version of this
> patchset? https://markmail.org/message/z7if4u56q5fmaok4
yup.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Wed, May 13, 2020 at 05:40:26PM +0530, Charan Teja Kalla wrote:
>
> Thank you Greg for the comments.
> On 5/12/2020 2:22 PM, Greg KH wrote:
> > On Fri, May 08, 2020 at 12:11:03PM +0530, Charan Teja Reddy wrote:
> >> The following race occurs while accessing the dmabuf object exported as
> >> file:
> >> P1 P2
> >> dma_buf_release() dmabuffs_dname()
> >> [say lsof reading /proc/<P1 pid>/fd/<num>]
> >>
> >> read dmabuf stored in dentry->d_fsdata
> >> Free the dmabuf object
> >> Start accessing the dmabuf structure
> >>
> >> In the above description, the dmabuf object freed in P1 is being
> >> accessed from P2 which is resulting into the use-after-free. Below is
> >> the dump stack reported.
> >>
> >> We are reading the dmabuf object stored in the dentry->d_fsdata but
> >> there is no binding between the dentry and the dmabuf which means that
> >> the dmabuf can be freed while it is being read from ->d_fsdata and
> >> inuse. Reviews on the patch V1 says that protecting the dmabuf inuse
> >> with an extra refcount is not a viable solution as the exported dmabuf
> >> is already under file's refcount and keeping the multiple refcounts on
> >> the same object coordinated is not possible.
> >>
> >> As we are reading the dmabuf in ->d_fsdata just to get the user passed
> >> name, we can directly store the name in d_fsdata thus can avoid the
> >> reading of dmabuf altogether.
> >>
> >> Call Trace:
> >> kasan_report+0x12/0x20
> >> __asan_report_load8_noabort+0x14/0x20
> >> dmabuffs_dname+0x4f4/0x560
> >> tomoyo_realpath_from_path+0x165/0x660
> >> tomoyo_get_realpath
> >> tomoyo_check_open_permission+0x2a3/0x3e0
> >> tomoyo_file_open
> >> tomoyo_file_open+0xa9/0xd0
> >> security_file_open+0x71/0x300
> >> do_dentry_open+0x37a/0x1380
> >> vfs_open+0xa0/0xd0
> >> path_openat+0x12ee/0x3490
> >> do_filp_open+0x192/0x260
> >> do_sys_openat2+0x5eb/0x7e0
> >> do_sys_open+0xf2/0x180
> >>
> >> Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
> >> Reported-by: syzbot+3643a18836bce555bff6(a)syzkaller.appspotmail.com
> >> Cc: <stable(a)vger.kernel.org> [5.3+]
> >> Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
> >> ---
> >>
> >> Changes in v2:
> >>
> >> - Pass the user passed name in ->d_fsdata instead of dmabuf
> >> - Improve the commit message
> >>
> >> Changes in v1: (https://patchwork.kernel.org/patch/11514063/)
> >>
> >> drivers/dma-buf/dma-buf.c | 17 ++++++++++-------
> >> 1 file changed, 10 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >> index 01ce125..0071f7d 100644
> >> --- a/drivers/dma-buf/dma-buf.c
> >> +++ b/drivers/dma-buf/dma-buf.c
> >> @@ -25,6 +25,7 @@
> >> #include <linux/mm.h>
> >> #include <linux/mount.h>
> >> #include <linux/pseudo_fs.h>
> >> +#include <linux/dcache.h>
> >>
> >> #include <uapi/linux/dma-buf.h>
> >> #include <uapi/linux/magic.h>
> >> @@ -40,15 +41,13 @@ struct dma_buf_list {
> >>
> >> static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
> >> {
> >> - struct dma_buf *dmabuf;
> >> char name[DMA_BUF_NAME_LEN];
> >> size_t ret = 0;
> >>
> >> - dmabuf = dentry->d_fsdata;
> >> - dma_resv_lock(dmabuf->resv, NULL);
> >> - if (dmabuf->name)
> >> - ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
> >> - dma_resv_unlock(dmabuf->resv);
> >> + spin_lock(&dentry->d_lock);
> >
> > Are you sure this lock always protects d_fsdata?
>
> I think yes. In the dma-buf.c, I have to make sure that d_fsdata should
> always be under d_lock thus it will be protected. (In this posted patch
> there is one place(in dma_buf_set_name) that is missed, will update this
> in V3).
>
> >
> >> + if (dentry->d_fsdata)
> >> + ret = strlcpy(name, dentry->d_fsdata, DMA_BUF_NAME_LEN);
> >> + spin_unlock(&dentry->d_lock);
> >>
> >> return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
> >> dentry->d_name.name, ret > 0 ? name : "");
> >
> > If the above check fails the name will be what? How could d_name.name
> > be valid but d_fsdata not be valid?
>
> In case of check fails, empty string "" is appended to the name by the
> code, ret > 0 ? name : "", ret is initialized to zero. Thus the name
> string will be like "/dmabuf:".
So multiple objects can have the same "name" if this happens to multiple
ones at once?
> Regarding the validity of d_fsdata, we are setting the dmabuf's
> dentry->d_fsdata to NULL in the dma_buf_release() thus can go invalid if
> that dmabuf is in the free path.
Why are we allowing the name to be set if the dmabuf is on the free path
at all? Shouldn't that be the real fix here?
> >> @@ -80,12 +79,16 @@ static int dma_buf_fs_init_context(struct fs_context *fc)
> >> static int dma_buf_release(struct inode *inode, struct file *file)
> >> {
> >> struct dma_buf *dmabuf;
> >> + struct dentry *dentry = file->f_path.dentry;
> >>
> >> if (!is_dma_buf_file(file))
> >> return -EINVAL;
> >>
> >> dmabuf = file->private_data;
> >>
> >> + spin_lock(&dentry->d_lock);
> >> + dentry->d_fsdata = NULL;
> >> + spin_unlock(&dentry->d_lock);
> >> BUG_ON(dmabuf->vmapping_counter);
> >>
> >> /*
> >> @@ -343,6 +346,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const char __user *buf)
> >> }
> >> kfree(dmabuf->name);
> >> dmabuf->name = name;
> >> + dmabuf->file->f_path.dentry->d_fsdata = name;
> >
> > You are just changing the use of d_fsdata from being a pointer to the
> > dmabuf to being a pointer to the name string? What's to keep that name
> > string around and not have the same reference counting issues that the
> > dmabuf structure itself has? Who frees that string memory?
> >
>
> Yes, I am just storing the name string in the d_fsdata in place of
> dmabuf and this helps to get rid of any extra refcount requirement.
> Because the user passed name carried in the d_fsdata is copied to the
> local buffer in dmabuffs_dname under spin_lock(d_lock) and the same
> d_fsdata is set to NULL(under the d_lock only) when that dmabuf is in
> the release path. So, when d_fsdata is NULL, name string is not accessed
> from the dmabuffs_dname thus extra count is not required.
>
> String memory, stored in the dmabuf->name, is released from the
> dma_buf_release(). Flow will be like, It fist sets d_fsdata=NULL and
> then free the dmabuf->name.
>
> However from your comments I have realized that there is a race in this
> patch when using the name string between dma_buf_set_name() and
> dmabuffs_dname(). But, If the idea of passing the name string inplace of
> dmabuf in d_fsdata looks fine, I can update this next patch.
I'll leave that to the dmabuf authors/maintainers, but it feels odd to
me...
thanks,
greg k-h
Dear All,
During the Exynos DRM GEM rework and fixing the issues in the.
drm_prime_sg_to_page_addr_arrays() function [1] I've noticed that most
drivers in DRM framework incorrectly use nents and orig_nents entries of
the struct sg_table.
In case of the most DMA-mapping implementations exchanging those two
entries or using nents for all loops on the scatterlist is harmless,
because they both have the same value. There exists however a DMA-mapping
implementations, for which such incorrect usage breaks things. The nents
returned by dma_map_sg() might be lower than the nents passed as its
parameter and this is perfectly fine. DMA framework or IOMMU is allowed
to join consecutive chunks while mapping if such operation is supported
by the underlying HW (bus, bridge, IOMMU, etc). Example of the case
where dma_map_sg() might return 1 'DMA' chunk for the 4 'physical' pages
is described here [2]
The DMA-mapping framework documentation [3] states that dma_map_sg()
returns the numer of the created entries in the DMA address space.
However the subsequent calls to dma_sync_sg_for_{device,cpu} and
dma_unmap_sg must be called with the original number of entries passed to
dma_map_sg. The common pattern in DRM drivers were to assign the
dma_map_sg() return value to sg_table->nents and use that value for
the subsequent calls to dma_sync_sg_* or dma_unmap_sg functions. Also
the code iterated over nents times to access the pages stored in the
processed scatterlist, while it should use orig_nents as the numer of
the page entries.
I've tried to identify all such incorrect usage of sg_table->nents and
this is a result of my research. It looks that the incorrect pattern has
been copied over the many drivers mainly in the DRM subsystem. Too bad in
most cases it even worked correctly if the system used a simple, linear
DMA-mapping implementation, for which swapping nents and orig_nents
doesn't make any difference. To avoid similar issues in the future, I've
introduced a common wrappers for DMA-mapping calls, which operate directly
on the sg_table objects. I've also added wrappers for iterating over the
scatterlists stored in the sg_table objects and applied them where
possible. This, together with some common DRM prime helpers, allowed me
to almost get rid of all nents/orig_nents usage in the drivers. I hope
that such change makes the code robust, easier to follow and copy/paste
safe.
The biggest TODO is DRM/i915 driver and I don't feel brave enough to fix
it fully. The driver creatively uses sg_table->orig_nents to store the
size of the allocate scatterlist and ignores the number of the entries
returned by dma_map_sg function. In this patchset I only fixed the
sg_table objects exported by dmabuf related functions. I hope that I
didn't break anything there.
Patches are based on top of Linux next-20200511.
Best regards,
Marek Szyprowski
References:
[1] https://lkml.org/lkml/2020/3/27/555.
[2] https://lkml.org/lkml/2020/3/29/65
[3] Documentation/DMA-API-HOWTO.txt
Changelog:
v4:
- added for_each_sgtable_* wrappers and applied where possible
- added drm_prime_get_contiguous_size() and applied where possible
- applied drm_prime_sg_to_page_addr_arrays() where possible to remove page
extraction from sg_table objects
- added documentation for the introduced wrappers
- improved patches description a bit
v3: https://lore.kernel.org/dri-devel/20200505083926.28503-1-m.szyprowski@samsu…
- introduce dma_*_sgtable_* wrappers and use them in all patches
v2: https://lore.kernel.org/linux-iommu/c01c9766-9778-fd1f-f36e-2dc7bd376ba4@ar…
- dropped most of the changes to drm/i915
- added fixes for rcar-du, xen, media and ion
- fixed a few issues pointed by kbuild test robot
- added wide cc: list for each patch
v1: https://lore.kernel.org/linux-iommu/c01c9766-9778-fd1f-f36e-2dc7bd376ba4@ar…
- initial version
Patch summary:
Marek Szyprowski (38):
dma-mapping: add generic helpers for mapping sgtable objects
scatterlist: add generic wrappers for iterating over sgtable objects
iommu: add generic helper for mapping sgtable objects
drm: prime: add common helper to check scatterlist contiguity
drm: prime: use sgtable iterators in
drm_prime_sg_to_page_addr_arrays()
drm: core: fix common struct sg_table related issues
drm: amdgpu: fix common struct sg_table related issues
drm: armada: fix common struct sg_table related issues
drm: etnaviv: fix common struct sg_table related issues
drm: exynos: use common helper for a scatterlist contiguity check
drm: exynos: fix common struct sg_table related issues
drm: i915: fix common struct sg_table related issues
drm: lima: fix common struct sg_table related issues
drm: mediatek: use common helper for a scatterlist contiguity check
drm: mediatek: use common helper for extracting pages array
drm: msm: fix common struct sg_table related issues
drm: omapdrm: use common helper for extracting pages array
drm: omapdrm: fix common struct sg_table related issues
drm: panfrost: fix common struct sg_table related issues
drm: radeon: fix common struct sg_table related issues
drm: rockchip: use common helper for a scatterlist contiguity check
drm: rockchip: fix common struct sg_table related issues
drm: tegra: fix common struct sg_table related issues
drm: v3d: fix common struct sg_table related issues
drm: virtio: fix common struct sg_table related issues
drm: vmwgfx: fix common struct sg_table related issues
xen: gntdev: fix common struct sg_table related issues
drm: host1x: fix common struct sg_table related issues
drm: rcar-du: fix common struct sg_table related issues
dmabuf: fix common struct sg_table related issues
staging: ion: remove dead code
staging: ion: fix common struct sg_table related issues
staging: tegra-vde: fix common struct sg_table related issues
misc: fastrpc: fix common struct sg_table related issues
rapidio: fix common struct sg_table related issues
samples: vfio-mdev/mbochs: fix common struct sg_table related issues
media: pci: fix common ALSA DMA-mapping related codes
videobuf2: use sgtable-based scatterlist wrappers
drivers/dma-buf/heaps/heap-helpers.c | 13 ++--
drivers/dma-buf/udmabuf.c | 7 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 8 +-
drivers/gpu/drm/armada/armada_gem.c | 12 +--
drivers/gpu/drm/drm_cache.c | 2 +-
drivers/gpu/drm/drm_gem_cma_helper.c | 23 +-----
drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++--
drivers/gpu/drm/drm_prime.c | 86 ++++++++++++----------
drivers/gpu/drm/etnaviv/etnaviv_gem.c | 12 ++-
drivers/gpu/drm/etnaviv/etnaviv_mmu.c | 13 +---
drivers/gpu/drm/exynos/exynos_drm_g2d.c | 10 +--
drivers/gpu/drm/exynos/exynos_drm_gem.c | 23 +-----
drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 11 +--
drivers/gpu/drm/i915/gem/selftests/mock_dmabuf.c | 7 +-
drivers/gpu/drm/lima/lima_gem.c | 11 ++-
drivers/gpu/drm/lima/lima_vm.c | 5 +-
drivers/gpu/drm/mediatek/mtk_drm_gem.c | 34 ++-------
drivers/gpu/drm/msm/msm_gem.c | 13 ++--
drivers/gpu/drm/msm/msm_gpummu.c | 14 ++--
drivers/gpu/drm/msm/msm_iommu.c | 2 +-
drivers/gpu/drm/omapdrm/omap_gem.c | 20 ++---
drivers/gpu/drm/panfrost/panfrost_gem.c | 4 +-
drivers/gpu/drm/panfrost/panfrost_mmu.c | 7 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 11 ++-
drivers/gpu/drm/rcar-du/rcar_du_vsp.c | 3 +-
drivers/gpu/drm/rockchip/rockchip_drm_gem.c | 42 +++--------
drivers/gpu/drm/tegra/gem.c | 27 +++----
drivers/gpu/drm/tegra/plane.c | 15 ++--
drivers/gpu/drm/v3d/v3d_mmu.c | 17 ++---
drivers/gpu/drm/virtio/virtgpu_object.c | 36 +++++----
drivers/gpu/drm/virtio/virtgpu_vq.c | 12 ++-
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c | 17 +----
drivers/gpu/host1x/job.c | 22 ++----
.../media/common/videobuf2/videobuf2-dma-contig.c | 41 +++++------
drivers/media/common/videobuf2/videobuf2-dma-sg.c | 32 +++-----
drivers/media/common/videobuf2/videobuf2-vmalloc.c | 12 +--
drivers/media/pci/cx23885/cx23885-alsa.c | 2 +-
drivers/media/pci/cx25821/cx25821-alsa.c | 2 +-
drivers/media/pci/cx88/cx88-alsa.c | 2 +-
drivers/media/pci/saa7134/saa7134-alsa.c | 2 +-
drivers/media/platform/vsp1/vsp1_drm.c | 8 +-
drivers/misc/fastrpc.c | 4 +-
drivers/rapidio/devices/rio_mport_cdev.c | 8 +-
drivers/staging/android/ion/ion.c | 25 +++----
drivers/staging/android/ion/ion.h | 1 -
drivers/staging/android/ion/ion_heap.c | 53 ++++---------
drivers/staging/android/ion/ion_system_heap.c | 2 +-
drivers/staging/media/tegra-vde/iommu.c | 4 +-
drivers/xen/gntdev-dmabuf.c | 13 ++--
include/drm/drm_prime.h | 2 +
include/linux/dma-mapping.h | 79 ++++++++++++++++++++
include/linux/iommu.h | 16 ++++
include/linux/scatterlist.h | 50 ++++++++++++-
samples/vfio-mdev/mbochs.c | 3 +-
56 files changed, 452 insertions(+), 477 deletions(-)
--
1.9.1