Hello Ilias,
I would prefer to have a fortnightly meeting at an preferred time of
14:00 UTC (to suit IN and further east TZ). Also, conference calls are
more preferred.
Regards,
Subash
Samsung India - Linaro,
Bangalore - India.
Launchpad: https://launchpad.net/~subashp/
On 02/06/11 01:58, Jesse Barker wrote:
> > * Communication and Meetings
> > - New IRC channel #linaro-mm-sig for meetings and general
> > communication between those working on and interested in these topics
> > (already created).
> > - IRC meetings will be weekly with an option for the consituency to
> > decide on ultimate frequency (logs to be emailed to linaro-mm-sig
> > list).
> > - Linaro can provide wiki services and any dial-in needed.
> > - Next face-to-face meetings:
> > . Linaro mid-cycle summit (August 1-5, see
> > https://wiki.linaro.org/Events/2011-08-LDS)
> > . Linux Plumbers Conference (September 7-9, see
> > http://www.linuxplumbersconf.org/2011/ocw/proposals/567)
> > . V4L2 brainstorm meeting (Hans Verkuil to update with details)
> >
Since this is an area of key interest to many parties, a periodic
meeting could provide a channel to all who are interested to participate
and discuss. I can set it up and send Google calendar invitations.
One obvious issue with this idea would be: With participants in this
list from 5-6 timezones having 1 meeting time would be challenging, but
perhaps a time slot around UTC16:00 would suit most?
Means: IRC as Jesse mentioned above, we can also setup a call, via
Canonical's conferencing system.
Frequency: Is there a specific need for discussing weekly? Assuming
once-every-fortnight frequency, there could be around 2-3 meetings
before the Linaro sprint in August 1-5. Of course if there is
participation on the IRC channel, then communication can happen more
often...
There are a couple more items I wanted to ask about:
1. I think we need a single wiki page with all the relevant pointers and
consolidated info, at wiki.linaro.org. I can collect the information
pointers available and setup the wiki page.
2. Tracking work progress: certainly this work has been planned via
Launchpad blueprints added by Jesse. I do not know if everyone is on
Launchpad - I'd like to ask for suggestions on how to track work
progress especially from those who are not using Launchpad. Would
progress updates via the wiki suffice? If you have other suggestions
please let me know.
BR,
-- Ilias Biris, Aallonkohina 2D 19, 02320 Espoo, Finland Tel: +358 50
4839608 (mobile) Email: ilias dot biris at linaro dot org Skype:
ilias_biris
Memory Management Mini-Summit
Linaro Developer Summit, Budapest, May 9-11, 2011
=================================================
Hi all. Apologies for this report being so long in coming. I know
others have thrown in their perceptions and opinions on how the
mini-summit went, so I suppose it's my turn.
Outcomes:
---------
* Approach (full proposal under draft, to be sent to the lists below)
- Modified CMA for additional physically contiguous buffer support.
- dma-mapping API changes, enhancements and ARM architecture support.
- "struct dma_buf" based buffer sharing infrastructure with support
from device drivers.
- Pick any "low-hanging fruit" with respect to consolidation
(supporting the ARM arch/sub-arch goals).
* Proposal for work around allocation, mapping and buffer sharing to
be announced on:
- dri-devel
- linux-arm-kernel
- linux-kernel
- linux-media
- linux-mm
- linux-mm-sig
* Communication and Meetings
- New IRC channel #linaro-mm-sig for meetings and general
communication between those working on and interested in these topics
(already created).
- IRC meetings will be weekly with an option for the consituency to
decide on ultimate frequency (logs to be emailed to linaro-mm-sig
list).
- Linaro can provide wiki services and any dial-in needed.
- Next face-to-face meetings:
. Linaro mid-cycle summit (August 1-5, see
https://wiki.linaro.org/Events/2011-08-LDS)
. Linux Plumbers Conference (September 7-9, see
http://www.linuxplumbersconf.org/2011/ocw/proposals/567)
. V4L2 brainstorm meeting (Hans Verkuil to update with details)
Overview and Goals for the 3 days:
----------------------------------
* Day 1 - Component overviews, expected to spill over into day 2
* Day 2 - Concrete use case that outlines a definition of the problem
that we are trying to solve, and shows that we have solved it.
* Day 3 - Dig into the lower level details of the current
implementations. What do we have, what's missing, what's not
implemented for ARM.
This is about memory management, zero-copy pipelines, kernel/userspace
interfaces, memory management, memory reservations and much more :-)
In particular, what we would like to end up with is:
* Understand who is working on what; avoid work duplication.
* Focus on a specific problem we want to solve and discuss possible solutions.
* Come up with a plan to fix this specific problem.
* Start enumerating work items that the Linaro Graphics WG can work
on in this cycle.
Day 1:
------
The first day got off to a little bit of a stutter start as the summit
scheduler would not let us indicate that our desired starting time was
immediately after lunch, during the plenaries. However, that didn't
stop people from flocking to the session in droves. By the time I
made the kickoff comments on why we were there, and what we were there
to accomplish (see "Overview and Goals for the 3 days" above), we had
brought in an extra 10 chairs and there were people on the floor and
spilling out into the hallway.
Based upon our experiences from the birds-of-a-feather at the Embedded
Linux Conference, 2 things dominated day 1. First things first, I
assigned someone to take notes ;-). Etherpad made it really easy for
people to take notes collectively, including those participating
remotely, and for everyone to see who was writing what, but we
definitely needed someone whose focus would be capturing the
proceedings, so thanks to Dave Rusling for shouldering that burden.
The second thing was that we desperately needed an education in each
others components and subsystems. Without this, we would risk missing
significant areas of discussion, or possibly even be violently
agreeing on something without realizing it. So, we started with a
series of component overviews. These were presentations on the order
of 20 minutes with some room for Q&A. On day 1, we had:
* V4L2 - Hans Verkuil
* DRM/GEM/KMS - Daniel Vetter
* TTM - Thomas Hellstrom
* CMA - Marek Szyprowski
* VCMM - Zach Pfeffer
All of these (as well as the ones from day 2) are available through
links on the mini-summit wiki
(https://wiki.linaro.org/Events/2011-05-MM).
Day 2:
------
The second day got off to a bit better a start than did day 1 as we
more clearly communicated the start time to everyone involved, and
forgot about the summit scheduler. We (conceptually) picked up where
day 1 left off with one more component overview:
* UMP - Ketil Johnson
and, covered the MediaController API for good measure. From there, we
spent a fair amount of time discussing use cases to illustrate our
problem space. We started (via pre-summit submissions) with a couple
of variations on what amounted to basically the same thing. I think
the actual case is probably best illustrated by the pdf slides from
Sakari Ailus (see the link on the mini-summit wiki). Basically, we
want to take a video input, either from a camera or from a file,
decode it, process it, render to it and/or with it and display it.
These pipeline stages may be handled by hardware, by software on the
CPU or some combination of the two; each stage should be handled by
accepting a buffer from the last stage and operating on it in some
fashion (no copies wherever possible). It turned out that still image
capture can actually be a more complicated version of this use case,
but even something as simple as taking input through the camera and
displaying it (image preview) can involve much of the underpinnings
required to support the more complicated cases. We may indeed start
with this simple case as a proof-of-concept.
Once we had the use case nailed down, we moved onto the actual
components/subsystems that would need to share buffers in order for
the use case to work properly with the zero-copy (or at least
minimal-copy) requirement. We had:
* DRM
* V4L2
* fbdev
* ALSA
* DSP
* User-space (kind of all encompassing and could include things like
OpenCL, which also makes an interesting use case).
* DVB
* Out-of-tree GPU drivers
We wound out the day by discussing exactly what metadata we would want
to track in order to enable the desired levels of sharing with
simultaneous device mappings, cache management and other
considerations (e.g., device peculiarities). What we came up with is
a struct (we called it "dma_buf") that has the following info:
* Size
* Creator/Allocator
* Attributes:
- sharable?
- contiguous?
- device-local?
* Reference count
* Pinning reference count
* CPU cache management data
* Device private data (e.g., quirky tiling modes)
* Scatter list
* Synchronization data (for managing in-flight device transactions)
* Mapping data
These last few (device privates through mapping data) are lists of
data, one for each device that has a mapping of the buffer. The
mapping data is nominally an address and per-device cache management
data. We actually got through the this part fairly quickly. The
biggest part of the discussion was what to use for handles/identifiers
in the buffer sharing scheme. The discussion was between global
identifiers like GEM uses, or file descriptors as favored by Android.
Initially, there was an informal consensus around unique IDs, though
it was not a definitive decision (yet). The atomicity of passing file
descriptors between processes makes them quite attractive for the
task.
Day 3:
------
By the third day, there was a sense of running out of time and really
needing to ensure that we left with a reasonable set of outcomes (see
the overview and goals section above). In short, we wanted to make
sure that we had a plan/roadmap, a reasonably actionable set of tasks
that could be picked up by Linaro engineers and community members
alike, and that we would not only avoid duplicating new work, but also
reduce some of the existing code duplication that got us to this point
in the first place.
But, we weren't done. We still had to cover the requirements around
allocation and explore the dma-mapping and IOMMU APIs.
This took most of the day, but was a quite fruitful set of
discussions. As with the rest of the discussions, we focused on
leveraging existing technologies as much as possible. With
allocations, however, this wasn't entirely possible as we have devices
on ARM SoCs that do not have an IOMMU and require physically
contiguous buffers in order to operate. After a fair amount of
discussion, it was decided that a modified version of the current CMA
(see Marek's slides linked from the wiki). It assumes the pages are
movable and manages them and not the mappings. There was concern that
the API didn't quite fit with other related API, so the changes from
the current state will be around those details.
On the mapping side, we focused on the dma-mapping API with
appropriate layering on the IOMMU API where appropriate. Without
going into crazy detail, we are looking at something like 4
implementation s of the dma_map_ops functions for ARM: with and
without IOMMU, with and without bounce buffer (these last two exist,
but not using the dma_map_ops API). Marek has put out patches for
comment on the IOMMU based implementation of this based upon work he
had in progress. Also in the area of dma_map_ops, the sync related
API need a start address and offset, and the alloc and free need
attribute parameters like map and unmap already have (to support
cacheable/coherent/write-combined). In the "not involving
dma_map_ops" category, we have a couple of changes that are likely to
be non-trivial (not that any of the other proposed work is). It was
proposed to modify (actually, the word thrown about in the discussions
was "fix") dma_alloc_coherent for ARM to support unmapping from the
kernel linear mapping and the use of HIGHMEM; two separate
implementations, configured at build-time. And, last but not least,
there was a fair amount of concern over the cache management API and
its ability to live cleanly with the IOMMU code and to resist breakage
from other architecture implementations.
At this point, we reviewed what we had done and finalized the outcomes
(see the outcomes section at the top). And, with a half an hour to
spare, I re-instigated the file descriptors versus unique identifiers
discussion from day 2. I think file descriptors were winning by the
end (especially after people started posting pointers to code samples
of how to actually pass them between processes)....
Attendees:
----------
I will likely miss people here trying to list out everyone, especially
given that some of the sessions were quite literally overflowing the
room we were in. For as accurate an account of attendance as I can
muster, check out the list of attendees on the mini-summit wiki page
or the discussion blueprints we used for scheduling:
https://wiki.linaro.org/Events/2011-05-MM#Attendeeshttps://blueprints.launchpad.net/linaro-graphics-wg/+spec/linaro-graphics-m…https://blueprints.launchpad.net/linaro-graphics-wg/+spec/linaro-graphics-m…https://blueprints.launchpad.net/linaro-graphics-wg/+spec/linaro-graphics-m…
The occupants of the fishbowl (the front/center of the room in closest
proximity to the microphones) were primarily:
Arnd Bergmann
Laurent Pinchart
Hans Verkuil
Mauro Chehab
Daniel Vetter
Sakari Ailus
Thomas Hellstrom
Marek Szyprowski
Jesse Barker
The IRC fishbowl seemed to consist of:
Rob Morell
Jordan Crouse
David Brown
There were certainly others both local and remote participating to
varying degrees that I do not intend to omit, and a special thanks
goes out to Joey Stanford for arranging a larger room for us on days 2
and 3 when we had people sitting on the floor and spilling into the
hallway during day 1.
On Mon, May 30, 2011 at 12:30 PM, PRASANNA KUMAR
<prasanna_tsm_kumar(a)yahoo.co.in> wrote:
> USB graphics devices from displaylink does not have 3D hardware. To get 3D
> effects (compiz, GNOME 3, KWin, OpenGL apps etc) with these device in Linux
> the native (primary) GPU can be used to provide hardware acceleration. All
> the graphics operation is done using the native (primary) GPU and the end
> result is taken and send to the displaylink device. Can this be achieved? If
> so is it possible to implement a generic framework so that any device (USB,
> thunderbolt or any new technology) can use this just by implementing device
> specific (compression and) data transport? I am not sure this is the correct
> mailing list.
fwiw, this situation is not too far different from the SoC world. For
example, there are multiple ARM SoC's that share the same IMG/PowerVR
core or ARM/mali 3d core, but each have their own unique display
controller..
I don't know quite the best way to deal with this (either at the
DRM/kernel layer or xorg driver layer), but there would certainly be
some benefit to be able to make DRM driver a bit more modular to
combine a SoC specific display driver (mostly the KMS part) with a
different 2d and/or 3d accelerator IP. Of course the (or some of the)
challenge here is that different display controllers might have
different memory mgmt requirements (for ex, depending on whether the
display controller has an IOMMU or not) and formats, and that the flip
command should somehow come via the 2d/3d command stream.
I have an (experimental) DRM/KMS driver for OMAP which tries to solve
the issue by way of a simple plugin API, ie the idea being to separate
the PVR part from the OMAP display controller part more cleanly. I
don't think it is perfect, but it is an attempt. (I'll send patches
as an RFC, but wanted to do some cleanup first.. just haven't had time
yet.) But I'm definitely open to suggestions here.
BR,
-R
> Thanks,
> Prasanna Kumar
> _______________________________________________
> dri-devel mailing list
> dri-devel(a)lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
After the mm panels, I had a few discussions with Hans, Rob and Daniel,
among others, during the V4L and KMS discussions and after that. Based
on those discussions, I'm pretty much convinced that the normal MMAP
way of streaming (VIDIOC_[REQBUF|STREAMON|STREAMOFF|QBUF|DQBUF ioctl's)
are not the best way to share data with framebuffers. We probably need
something that it is close to VIDIOC_FBUF/VIDIOC_OVERLAY, but it is
still not the same thing.
I suspect that working on such API is somewhat orthogonal to the decision of
using a file pointer based or a bufer ID based based kABI for passing the
buffer parameters to the newly V4L calls, but we cannot decide about the type
of buffer ID that we'll use if we not finish working at an initial RFC
for the V4L API, as the way the buffers will be passed into it will depend
on how we design such API.
It should be also noticed that, while in the shared buffers some
definitions can be postponed to happen later (as it is basically
a Kernelspace-only ABI - at least initially), the V4L API should be
designed to consider all possible scenarios, as "diamonds and userspace
API's are forever"(tm).
It seems to me that the proper way to develop such API is starting working
with Xorg V4L driver, changing it to work with KMS and with the new API
(probably porting some parts of it to kernelspace).
One of the problems with a shared framebuffer is that an overlayed V4L stream
may, at the worse case, be sent to up to 4 different GPU's and/or displays, like:
===================+===================
| | |
| D1 +----|---+ D2 |
| | V4L| | |
+-------------|----+---|--------------|
| | | | |
| D3 +----+---+ D4 |
| | |
=======================================
Where D1, D2, D3 and D4 are 4 different displays, and the same V4L framebuffer is
partially shared between them (the above is an example of a V4L input, although
the reverse scenario of having one frame buffer divided into 4 V4L outputs
also seems to be possible).
As the same image may be divided into 4 monitors, the buffer filling should be
synced with all of them, in order to avoid flipping effects. Also, the buffer
can't be re-used until all displays finish reading.
Display API's currently has similar issues. From what I understood from Rob and
Daniel, this is solved there by dynamically allocating buffers. So, we may need to
do something similar to that also at V4L (in a matter of fact, there's currently a
proposal to hack REQBUF's, in order to extend V4L API to allow dynamically creating
more buffers than used by a stream). It makes sense to me to discuss such proposal
together with the above discussions, in order to keep the API consistent.
>From my side, I'm expecting that the responsible(s) for the API proposals to
also provide with open source drivers and userspace application(s),
that allows to test and validate such API RFC.
Thanks,
Mauro
Hi,
Here are my own notes from the Linaro memory management mini-summit in
Budapest. I've written them from my own point of view, which is mostly
V4L2 in embedded devices and camera related use cases. I attempted to
summarise the discussion mostly concentrating into parts which I've
considered important and ignored the rest.
So please do not consider this as the generic notes of the mini-summit.
:-) I still felt like sharing this since it might be found useful by
those who are working with similar systems with similar problems.
Memory buffer management --- the future
=======================================
The memory buffer management can be split to following sub-problems
which may have dependencies, both in implementation and possibly in
the APIs as well:
- Fulfilling buffer allocation requirements
- API to allocate buffers
- Sharing buffers among kernel subsystems (e.g. V4L2, DRM, FB)
- Sharing buffers between processes
- Cache coherency
What has been agreed that we need kernel to recognise a DMA buffer
which may be passed between user processes and different kernel subsystems.
Fulfilling buffer allocation requirements
-----------------------------------------
APIs, as well as devices, have different requirements on the buffers.
It is difficult to come up with generic requirements for buffer
allocation, and to keep the solution future-proof is challenging as
well. In principle the user is interested in being able to share
buffers between subsystems without knowing the exact requirements of
the devices, which makes it possible to keep the requirement handling
internal to the kernel. Whether this is the way to go or not, will be
seen in the future. The buffer allocation remains a problem to be
resolved in the future.
Majority of the devices' requirements could be filled using a few
allocators; one for physically continugous memory and the other for
physically non-contiguous memory of single page allocations. Being
able to allocate large pages would also be beneficial in many cases.
API to allocate buffers
-----------------------
It was agreed there was a need to have a generic interface for buffer
object creation. This could be either a new system call which would be
supported by all devices supporting such buffers in subsystem APIs
(such as V4L2), or a new dedicated character device.
Different subsystems have different ways of describing the properties
of the buffers, such as how the data in the buffer should be
interpreted. The V4L2 has width, height, bytesperline and pixel
format, for example. The generic buffers should not recognise such
properties since this is very subsystem specific information. Instead,
the user which is aware of the different subsystems must come with
matching set of buffer properties using the subsystem specific
interfaces.
Sharing buffers among kernel subsystems
---------------------------------------
There was discussion on how to refer to generic DMA buffers, and the
audience was first mostly split between using buffer IDs to refer to
the buffers and using file handles for the purpose. Using file handles
have pros and cons compared to the numeric IDs:
+ Easy life cycle management. Deallocation of buffers no longer in use
is trivial.
+ Access control for files exists already. Passing file descriptors
between processes is possible throught Unix sockets.
- Allocating extremely large number of buffers would require as many
file descriptors. This is not likely to be an important issue.
Before the day ended, it was felt that the file handles are the right
way to go.
The generic DMA buffers further need to be associated to the subsystem
buffers. This is up to the subsystem APIs. In V4L2, this would most
likely mean that there will be a new buffer type for the generic DMA
buffers.
Sharing buffers between processes
---------------------------------
Numeric IDs can be easily shared between processes while sharing file
handles is more difficult. However, it can be done using the Unix
sockets between any two processes. This also gives automatically
the same access control mechanism as every other file. Access control
mechanisms are mandatory when making the buffer shareable between
processes.
Cache coherency
---------------
Cache coherency is seen largely orthogonal to any other sub-problems
in memory buffer management. In few cases this might have something in
common with buffer allocation. Some architectures, ARM in particular, do
not have coherent caches, meaning that the operating system must know
when to invalidate or clean various parts of the cache. There are two
ways to approach the issue, independently of the cache implementation:
1. Allocate non-cacheable memory, or
2. invalidate or clean (or flush) the cache when necessary.
Allocating non-cacheable memory is a valid solution to cache coherency
handling in some situations, but mostly only when the buffer is only
partially accessed by the CPU or at least not multiple times. In other
cases, invalidating or cleaning the cache is the way to go.
The exact circumstances in which using non-cacheable memory gives a
performance benefit over invalidating or cleaning the cache when
necessary are very system and use case dependent. This should be
selectable from the user space.
The cache invalidation or cleaning can be either on the whole (data)
cache or a particular memory area. Performing the operation on a
particular memory area may be difficult since it should be done to all
mappings of the memory in the system. Also, there is a limit beyond
which performing invalidation or clean for an area is always more
expensive than a full cache flush: on many machines the cache line
size is 64 bytes, and the invalidate/clean must be performed for the
whole buffer, which in cameras could be tens of megabytes in size, per
every cache line.
Mapping buffers to application memory is not always necessary --- the
buffers may only be used by the devices, in which case a scatterlist
of the pages in the buffer is necessary to map the buffer to the IOMMU.
More (impartial :-)) information can be found here:
<URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…>
<URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…>
<URL:http://summit.ubuntu.com/uds-o/meeting/linaro-graphics-memory-managemen…>
Regards,
--
Sakari Ailus
sakari.ailus(a)maxwell.research.nokia.com
Hi all,
During the Budapest meetings it was mentioned that you can pass a fd between
processes. How does that work? Does someone have a code example or a link to
code that does that? Just to satisfy my curiosity.
Regards,
Hans
Thanks Jesse for initiating the mailing list.
We need to address the requirements of Graphics and Multimedia Accelerators
(IPs).
What we really need is a permanent solution (at upstream) which accommodates
the following requirements and conforms to Graphics and Multimedia use
cases.
1.Mechanism to map/unmap the memory. Some of the IPs’ have the ability to
address virtual memory and some can address only physically contiguous
address space. We need to address both these cases.
2.Mechanism to allocate and release memory.
3.Method to share the memory (ZERO copy is a MUST for better performance)
between different device drivers (example output of camera to multimedia
encoder).
4.Method to share the memory with different processes in userspace. The
sharing mechanism should include built-in security features.
Are there any special requirements from V4L or DRM perspectives?
Thanks,
Sree
(Disclaimer: I come from a graphics background, so sorry if I use graphicsy
terminology; please let me know if any of this isn't clear. I tried.)
There is an wide range of hardware capabilities that require different
programming approaches in order to perform optimally. We need to define an
interface that is flexible enough to handle each of them, or else it won't be
used and we'll be right back where we are today: with vendors rolling their own
support for the things they need.
I'm going to try to enumerate some of the more unique usage patterns as
I see them here.
- Many or all engines may sit behind asynchronous command stream interfaces.
Programming is done through "batch buffers"; a set of commands operating on a
set of in-memory buffers is prepared and then submitted to the kernel to be
queued. The kernel will first make sure all of the buffers are resident
(which may require paging or mapping into an IOMMU/GART, a.k.a. "pinning"),
then queue the batch of commands. The hardware will process the commands at
its earliest convenience, and then interrupt the CPU to notify it that it's
done with the buffers (i.e. it can now be "unpinned").
Those familiar with graphics may recognize this programming model as a
classic GPU command stream. But it doesn't need to be used exclusively with
GPUs; any number of devices may have such an on-demand paging mechanism.
- In contrast, some engines may also stream to or from memory continuously
(e.g., video capture or scanout); such buffers need to be pinned for an
extended period of time, not tied to the command streams described above.
- There can be multiple different command streams working at the same time on
the same buffers. (There may be hardware synchronization primitives between
the multiple command streams so the CPU doesn't have to babysit too much, for
both performance and power reasons.)
- In some systems, IOMMU/GART may be much smaller than physical memory; older
GPUs and SoCs have this. To support these, we need to be able to map and
unmap pages into the IOMMU on demand in our host command stream flow. This
model also requires patching up pending batch buffers before queueing them to
the hardware, to update them to point to the newly-mapped location in the
IOMMU.
- In other systems, IOMMU/GART may be much larger than physical memory; more
modern GPUs and SoCs have this. With these, we can reserve virtual (IOMMU)
address space for each buffer up front. To userspace, the buffers always
appear "mapped". This is similar in concept to how the CPU virtual space in
userspace sticks around even when the underlying memory is paged out to disk.
In this case, pinning is performed at the same time as the small-IOMMU case
above, but in the normal/fast case, the pages are never paged out of the
IOMMU, and the pin step just increments a refcount to prevent the pages from
being evicted.
It is desirable to keep the same IOMMU address for:
a) implementing features such as
http://www.opengl.org/registry/specs/NV/shader_buffer_load.txt
(OpenGL client applications and shaders manipulate GPU vaddr pointers
directly; a GPU virtual address is assumed to be valid forever).
b) performance: scanning through the command buffers to patch up pointers can
be very expensive.
One other important note: buffer format properties may be necessary to set up
mappings (both CPU and iommu mappings). For example, both types of mappings
may need to know tiling properties of the buffer. This may be a property of
the mapping itself (consider it baked into the page table entries), not
necessarily something a different driver or userspace can program later
independently.
Some of the discussion I heard this morning tended towards being overly
simplistic and didn't seem to cover each of these cases well. Hopefully this
will help get everyone on the same page.
Thanks,
Robert
Hi all,
A bit later than what I've hoped for, but here we go [Jesse and Dave,
please correct/clarify/extend where you see fit]:
The core idea of GEM is to identify graphic buffer objects with 32bit ids.
The reason being "X runs out of open fds" (KDE easily reaches a few
thousand).
The core design principle behind GEM is that the kernel is in full control
of the allocation of these buffer objects and is free to move the around
in any way it sees fit. This is to make concurrent rendering by multiple
processes possible while userspace can still assume that it is in sole
possession of the gpu - GEM means "graphics execution manager".
Below some more details on what GEM is and does, what it does
_not_ do and how it relates to other graphic subsystems.
GEM does ...
------------
- lifecycle management. Userspace references are associated with the drm
fd and get reaped on close (in case userspace forgets about them).
- per-device global names to exchange buffers between processes (eg dri2).
These names are again 32bit ids. These global ids do not count as
userspace references and don't prevent a buffer from being reaped.
- it implements very few generic ioctls:
* flink for creating a global name for a buffer object
* open for getting a per-fd handle to a buffer object with a global name
* close for dropping a per-fd handle.
- a little bit of kernel-internal helpers to facilitate mmap (by blending
multiple buffer objects into the single drm device address space) and a
few other things.
That's it, i.e. GEM is very much meant to be as simple as possible.
Driver-specific GEM ioctls
--------------------------
The generic GEM stuff is obviously not very useful. So drivers implement
quite a bit driver-specific ioctls, like:
- buffer creation. In recent kernels there is some support to create dumb
scanout objects for KMS. But they're only really useful for boot-splashs
and unaccelerated dumb KMS drivers. Creating buffers usable for
rendering is only possible with driver specific ioctls.
- command submission. An important part is mapping abstract buffer ids to
actual gpu address (and rewriting batchbuffers with these). In the
future, with support for virtual gpu address spaces this might change.
- tiling management. The kernel needs to know this to correctly
tile/detile buffers when moving them around (e.g. evicting from vram).
- command completion signalling and gpu/cpu synchronization.
There are currently two approaches for implementing a GEM driver:
- roll-your-own, used by drm/i915 (and sometimes getting flaked for NIH).
- ttm-base: radeon & nouveau.
GEM does not ...
----------------
This still leaves out a few things that I've seen mentioned as
ideas/requirements here and elsewhere:
- cross-device buffer sharing and namespaces (see below) and
- buffer format handling and mediation between different users (except
tiling as mentioned above). The reason here is that gpus are a mess
and one of the worst parts is format handling. Better keep that out
of the kernel ...
KMS (kernel mode setting)
-------------------------
KMS is essentially just a port of the xrandr api to the kernel as an ioctl
interface:
- crtcs feed (possible multiple) outputs and get their data from a
framebuffer object. A major part of KMS is also the support for
vsynced-pageflipping of framebuffers.
- Internally there's some support infrastructure to simplify drivers (all
the drm_*_helper.c code).
- framebuffers are created from a opaque driver-specific 32bit id and a
format description. For GEM drivers these ids name GEM objects, but that
need not be: The recently merged qemu kms driver does not implement gem
and has one unique buffer object with id 0.
- as mentioned above there newly is a generic ioctl to create an object
suitable as a dumb scanout (plus some support to mmap it).
- currently KMS has no generic support for overlays (there are
driver-specific ioctls in i915 and vmgfx, though). Jesse Barnes has
posted an RFC to remedy this:
http://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg10415.html
GEM and PRIME
-------------
PRIME is a proof-of-concept implementation from Dave Airlie for sharing
GEM objects between drivers/devices: Buffer sharing is done with a list of
struct page pointers. While being shared, buffers can't be moved anymore.
No further buffer description is passed along in the kernel, format/layout
mediation is to be handled in userspace.
Blog-post describing the initial design for sharing buffers between an
integrated Intel igd and a discrete ATI gpu:
http://airlied.livejournal.com/71734.html
Other code using the same framework to render on an Intel igd and display
the framebuffer on an usb-connected displayport:
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-testing.git;a=shortlo…
GEM/KMS and fbdev
-----------------
There's some minimal support to emulate an fbdev with a gem/kms driver.
Resolution can't be changed and it's unaccelerated. There's been some
muttering once in a while to better integrate this with either a kms
kernel console driver or by routing fbdev resolution changes to kms.
But the main use case is to display a kernel oops, which works. For
everything else there's X (or an EGL client that understands kms).
-Daniel
--
Daniel Vetter
Mail: daniel(a)ffwll.ch
Mobile: +41 (0)79 365 57 48