On Tuesday, September 27, 2011 16:19:56 Daniel Vetter wrote:
Hi Hans,
I'll try to explain a bit, after all I've been pushing this attachment buisness quite a bit.
On Tue, Sep 27, 2011 at 03:24:24PM +0200, Hans Verkuil wrote:
OK, it is not clear to me what the purpose is of the attachments.
If I understand the discussion from this list correctly, then the idea is that each device that wants to use this buffer first attaches itself by calling dma_buf_attach().
Then at some point the application asks some driver to export the buffer. So the driver calls dma_buf_export() and passes its own dma_buf_ops. In other words, this becomes the driver that 'controls' the memory, right?
Actually, the ordering is the other way round. First, some driver calls dam_buf_export, userspace then passes around the fd to all other drivers, they do an import and call attach. While all this happens, the driver that exported the dma_buf does not (yet) allocate any backing storage.
Ah. This really should be documented better :-)
Another driver that receives the fd will call dma_buf_get() and can then call e.g. get_scatterlist from dma_buf->ops. (As an aside: I would make inline functions that take a dma_buf pointer and call the corresponding op, rather than requiring drivers to go through ops directly)
Well, drivers should only call get_scatterlist when they actually need to access the memory. This way the originating driver can go through the list of all attached devices and decide where to allocate backing storage on the first get_scatterlist.
Unless I am mistaken there is nothing there in the attachment datastructure at this moment to help determine the best device to use for the scatterlist. Right? It's just a list of devices together with an opaque pointer.
Another thing I don't get (sorry, I must be really dense today) is that the get_scatterlist op is set by the dma_buf_export call. I would expect that that op is part of the attachment data since I would expect that to be device specific.
At least, my assumption is that the actual memory is allocated by the first call to get_scatterlist?
Actually, I think this might be key: who allocates the actual memory and when? There is no documentation whatsoever on this crucial topic.
Before attempting to post this to a wider audience you really need to have proper documentation for this API and an example implementation.
But what I miss in this picture is the role of dma_buf_attachment. I'm passing it to get_scatterlist, but which attachment is that? That of the calling driver? And what is the get_scatterlist implementation supposed to do with it?
See above, essentially an attachment is just list bookkeeping for all the devices that take part in a buffer sharing.
I also read some discussion about what is supposed to happen if another device is attached after get_scatterlist was already called. Apparently the idea was that the old scatterlist is somehow migrated to a new one if that should be necessary? Although I got the impression that that involved a lot of hand-waving with a pinch of wishful thinking. But I may be wrong about that.
Now this is very it's getting "intersting". If all drivers guard they're usage with get_scatterlist/put_scatterlist, and we add a new driver, and all drivers that currently hold onto a mapping are known to release that with put_scatterlist in a finite time, we can do Cool Stuff (tm). First, that scenario actually happens for e.g. a video pipe, where we cycle through buffers.
Now when adding a new device with stricter backing storage constrains, the originator can just stall in the get_scatterlist call until all outstanding access has completed (signalled by put_scatterlist), move the object around and let things continue merily. The video pipe might stutter a bit when e.g. switching on the encoder until all buffers have settled into the new place, but it should Just Work (tm).
Sounds great. Of course how to determine 'stricter backing storage constrains' is currently just a lot of hand-waving :-)
Anyway, I guess my main point is that this patch does not explain the role of the attachments and how they should be used (and who uses them).
I agree.
One other thing: once you call REQBUFS on a V4L device the V4L spec says that the memory should be allocated at that time. Because V4L often needs a lot of memory that behavior makes sense: you know immediately if you can get the memory or not. In addition, that memory is mmap-ed before the DMA is started.
If that is actually a fixed requirement for v4l, that's a good reason for mmap support on the dma_buf object. We could hide all the complecity of shooting down userspace mmapings on buffer movements from the drivers. Can you elaborate a bit on this?
There's not much to elaborate on. Calling VIDIOC_REQBUFS is supposed to allocate all buffers and pin them in memory, thus ensuring that all is ready for DMA.
Note that drivers based around the videobuf framework actually postpone the allocation until the first use. This violates the spec and is fixed in the videobuf2 framework.
As Rob suggested, this is a requirement that can probably be relaxed for new memory types (such as DMA_BUF).
While not a requirement, it is common practice that applications mmap the buffers immediately after the call to REQBUFS. For dma_buf it is very likely that such mmap operations would go through the dma_buf fd. Of course, if you just hand over the buffer to another device, then there is no need for userspace to call mmap.
This behavior may pose a problem if the idea is to wait with actually allocating memory until the pipeline is started.
I think you're looking at v4lv3 ;-)
More seriously all modern linux apis for pushing frames out use one of two modes:
- gimme the next frame to draw into (dri2)
- here's the next frame I've drawn into (wayland)
To make that fast, we obviously need to recycle buffers. But from a semantic point of view, you only ever have one buffer, namely the current one. All the other N buffers to make the graphics pipeline not stutter are transparently in-flight somewhere.
Imo such a dynamic scheme has a few advantages:
- there's just no way to know the amount of buffers you need up-front on any reasonable complex graphics pipeline. As soon as a gpu is in the mix, it's best effort. With a dynamic limit on the in-flight buffers you can cope with latencies until you hit -ENOMEM. With a fixed set you always have to make a compromise and can't really allocate for the worst case - it will hinder stuff running in the background.
Currently in V4L2 the number of buffers is fixed after calling REQBUFS. That is, you can't add more buffers later. However, work is being done to lift this limitation (it's almost finished, I expect to see this in 3.2 or 3.3).
- in the usual case you need much fewer buffers to make any given pipeline run stutter-free than in the worst case. No point wasting that memory.
Now I have no idea how you could shoe-horn that onto the current v4l interfaces.
Not in the current API, but we hope to have the required flexibility quite soon. Certainly in time for dma_buf.
Hmm, I'm rambling a bit, but I hope the gist of my mail is clear.
It's clear and I think you're raising good points.
Good :-)
Hans