On Wed, Jan 30, 2013 at 2:07 AM, Rob Clark robdclark@gmail.com wrote:
========================== Basic problem statement:
GPU's do operations that commonly involve many buffers. Those buffers can be shared across contexts/processes, exist in different memory domains (for example VRAM vs system memory), and so on. And with PRIME / dmabuf, they can even be shared across devices. So there are a handful of situations where the driver needs to wait for buffers to become ready. If you think about this in terms of waiting on a buffer mutex for it to become available, this presents a problem because there is no way to guarantee that buffers appear in a execbuf/batch in the same order in all contexts. That is directly under control of userspace, and a result of the sequence of GL calls that an application makes. Which results in the potential for deadlock. The problem gets more complex when you consider that the kernel may need to migrate the buffer(s) into VRAM before the GPU operates on the buffer(s), which main in turn require evicting some other buffers (and you don't want to evict other buffers which are already queued up to the GPU), but for a simplified understanding of the problem you can ignore this.
The algorithm that TTM came up with for dealing with this problem is quite simple. For each group of buffers (execbuf) that need to be locked, the caller would be assigned a unique reservation_id, from a global counter. In case of deadlock in the process of locking all the buffers associated with a execbuf, the one with the lowest reservation_id wins, and the one with the higher reservation_id unlocks all of the buffers that it has already locked, and then tries again.
Originally TTM implemented this algorithm on top of an event-queue and atomic-ops, but Maarten Lankhorst realized that by merging this with the mutex code we could take advantage of the existing mutex fast-path code and result in a simpler solution, and so ticket_mutex was born. (Well, there where also some additional complexities with the original implementation when you start adding in cross-device buffer sharing for PRIME.. Maarten could probably better explain.)
I think the motivational writeup above is really nice, but the example code below is a bit wrong
How it is used:
A very simplified version:
int submit_execbuf(execbuf) { /* acquiring locks, before queuing up to GPU: */ seqno = assign_global_seqno(); retry: for (buf in execbuf->buffers) { ret = mutex_reserve_lock(&buf->lock, seqno); switch (ret) { case 0: /* we got the lock */ break; case -EAGAIN: /* someone with a lower seqno, so unreserve and try again: */ for (buf2 in reverse order starting before buf in execbuf->buffers) mutex_unreserve_unlock(&buf2->lock); goto retry; default: goto err; } }
/* now everything is good to go, submit job to GPU: */ ...
}
int finish_execbuf(execbuf) { /* when GPU is finished: */ for (buf in execbuf->buffers) mutex_unreserve_unlock(&buf->lock); } ==========================
Since gpu command submission is all asnyc (hopefully at least) we don't unlock once it completes, but right away after the commands are submitted. Otherwise you wouldn't be able to submit new execbufs using the same buffer objects (and besides, holding locks while going back out to userspace is evil).
The trick is to add a fence object for async operation (essentially a waitqueue on steriods to support gpu->gpu direct signalling). And updating fences for a given execbuf needs to happen atomically for all buffers, for otherwise userspace could trick the kernel into creating a circular fence chain. This wouldn't deadlock the kernel, since everything is async, but it'll nicely deadlock the gpus involved. Hence why we need ticketing locks to get dma_buf fences off the ground.
Maybe wait for Maarten's feedback, then update your motivational blurb a bit?
Cheers, Daniel