(Disclaimer: I come from a graphics background, so sorry if I use graphicsy terminology; please let me know if any of this isn't clear. I tried.)
There is an wide range of hardware capabilities that require different programming approaches in order to perform optimally. We need to define an interface that is flexible enough to handle each of them, or else it won't be used and we'll be right back where we are today: with vendors rolling their own support for the things they need.
I'm going to try to enumerate some of the more unique usage patterns as I see them here.
- Many or all engines may sit behind asynchronous command stream interfaces. Programming is done through "batch buffers"; a set of commands operating on a set of in-memory buffers is prepared and then submitted to the kernel to be queued. The kernel will first make sure all of the buffers are resident (which may require paging or mapping into an IOMMU/GART, a.k.a. "pinning"), then queue the batch of commands. The hardware will process the commands at its earliest convenience, and then interrupt the CPU to notify it that it's done with the buffers (i.e. it can now be "unpinned"). Those familiar with graphics may recognize this programming model as a classic GPU command stream. But it doesn't need to be used exclusively with GPUs; any number of devices may have such an on-demand paging mechanism.
- In contrast, some engines may also stream to or from memory continuously (e.g., video capture or scanout); such buffers need to be pinned for an extended period of time, not tied to the command streams described above.
- There can be multiple different command streams working at the same time on the same buffers. (There may be hardware synchronization primitives between the multiple command streams so the CPU doesn't have to babysit too much, for both performance and power reasons.)
- In some systems, IOMMU/GART may be much smaller than physical memory; older GPUs and SoCs have this. To support these, we need to be able to map and unmap pages into the IOMMU on demand in our host command stream flow. This model also requires patching up pending batch buffers before queueing them to the hardware, to update them to point to the newly-mapped location in the IOMMU.
- In other systems, IOMMU/GART may be much larger than physical memory; more modern GPUs and SoCs have this. With these, we can reserve virtual (IOMMU) address space for each buffer up front. To userspace, the buffers always appear "mapped". This is similar in concept to how the CPU virtual space in userspace sticks around even when the underlying memory is paged out to disk. In this case, pinning is performed at the same time as the small-IOMMU case above, but in the normal/fast case, the pages are never paged out of the IOMMU, and the pin step just increments a refcount to prevent the pages from being evicted. It is desirable to keep the same IOMMU address for: a) implementing features such as http://www.opengl.org/registry/specs/NV/shader_buffer_load.txt (OpenGL client applications and shaders manipulate GPU vaddr pointers directly; a GPU virtual address is assumed to be valid forever). b) performance: scanning through the command buffers to patch up pointers can be very expensive.
One other important note: buffer format properties may be necessary to set up mappings (both CPU and iommu mappings). For example, both types of mappings may need to know tiling properties of the buffer. This may be a property of the mapping itself (consider it baked into the page table entries), not necessarily something a different driver or userspace can program later independently.
Some of the discussion I heard this morning tended towards being overly simplistic and didn't seem to cover each of these cases well. Hopefully this will help get everyone on the same page.
Thanks, Robert