Re: [Linaro-mm-sig] [RFC v2] dma-buf: Add buffer sharing framework

28 Sep 2011

      On Tuesday, September 27, 2011 16:19:56 Daniel Vetter wrote:
...
Hi Hans,
I'll try to explain a bit, after all I've been pushing this attachment
buisness quite a bit.
On Tue, Sep 27, 2011 at 03:24:24PM +0200, Hans Verkuil wrote:
...
OK, it is not clear to me what the purpose is of the attachments.
If I understand the discussion from this list correctly, then the idea is
that each device that wants to use this buffer first attaches itself by
calling dma_buf_attach().
Then at some point the application asks some driver to export the buffer.
So the driver calls dma_buf_export() and passes its own dma_buf_ops. In
other words, this becomes the driver that 'controls' the memory, right?
Actually, the ordering is the other way round. First, some driver calls
dam_buf_export, userspace then passes around the fd to all other drivers,
they do an import and call attach. While all this happens, the driver that
exported the dma_buf does not (yet) allocate any backing storage.
Ah. This really should be documented better :-)
...
...
Another driver that receives the fd will call dma_buf_get() and can then
call e.g. get_scatterlist from dma_buf->ops. (As an aside: I would make
inline functions that take a dma_buf pointer and call the corresponding
op, rather than requiring drivers to go through ops directly)
Well, drivers should only call get_scatterlist when they actually need to
access the memory. This way the originating driver can go through the list
of all attached devices and decide where to allocate backing storage on
the first get_scatterlist.
Unless I am mistaken there is nothing there in the attachment datastructure
at this moment to help determine the best device to use for the scatterlist.
Right? It's just a list of devices together with an opaque pointer.
Another thing I don't get (sorry, I must be really dense today) is that the
get_scatterlist op is set by the dma_buf_export call. I would expect that
that op is part of the attachment data since I would expect that to be
device specific.
At least, my assumption is that the actual memory is allocated by the first
call to get_scatterlist?
Actually, I think this might be key: who allocates the actual memory and
when? There is no documentation whatsoever on this crucial topic.
Before attempting to post this to a wider audience you really need to have
proper documentation for this API and an example implementation.
...
...
But what I miss in this picture is the role of dma_buf_attachment. I'm
passing it to get_scatterlist, but which attachment is that? That of the
calling driver? And what is the get_scatterlist implementation supposed
to do with it?
See above, essentially an attachment is just list bookkeeping for all the
devices that take part in a buffer sharing.
...
I also read some discussion about what is supposed to happen if another
device is attached after get_scatterlist was already called. Apparently
the idea was that the old scatterlist is somehow migrated to a new one
if that should be necessary? Although I got the impression that that
involved a lot of hand-waving with a pinch of wishful thinking. But I
may be wrong about that.
Now this is very it's getting "intersting". If all drivers guard they're
usage with get_scatterlist/put_scatterlist, and we add a new driver, and
all drivers that currently hold onto a mapping are known to release that
with put_scatterlist in a finite time, we can do Cool Stuff (tm). First,
that scenario actually happens for e.g. a video pipe, where we cycle
through buffers.
Now when adding a new device with stricter backing storage constrains, the
originator can just stall in the get_scatterlist call until all
outstanding access has completed (signalled by put_scatterlist), move the
object around and let things continue merily. The video pipe might stutter
a bit when e.g. switching on the encoder until all buffers have settled
into the new place, but it should Just Work (tm).
Sounds great. Of course how to determine 'stricter backing storage constrains'
is currently just a lot of hand-waving :-)
...
...
Anyway, I guess my main point is that this patch does not explain the
role of the attachments and how they should be used (and who uses them).
I agree.
...
One other thing: once you call REQBUFS on a V4L device the V4L spec says that
the memory should be allocated at that time. Because V4L often needs a lot of
memory that behavior makes sense: you know immediately if you can get the memory
or not. In addition, that memory is mmap-ed before the DMA is started.
If that is actually a fixed requirement for v4l, that's a good reason for
mmap support on the dma_buf object. We could hide all the complecity of
shooting down userspace mmapings on buffer movements from the drivers.
Can you elaborate a bit on this?
There's not much to elaborate on. Calling VIDIOC_REQBUFS is supposed to allocate
all buffers and pin them in memory, thus ensuring that all is ready for DMA.
Note that drivers based around the videobuf framework actually postpone the
allocation until the first use. This violates the spec and is fixed in the videobuf2
framework.
As Rob suggested, this is a requirement that can probably be relaxed for new
memory types (such as DMA_BUF).
While not a requirement, it is common practice that applications mmap the buffers
immediately after the call to REQBUFS. For dma_buf it is very likely that such mmap
operations would go through the dma_buf fd. Of course, if you just hand over the
buffer to another device, then there is no need for userspace to call mmap.
...
...
This behavior may pose a problem if the idea is to wait with actually
allocating memory until the pipeline is started.
I think you're looking at v4lv3 ;-)
More seriously all modern linux apis for pushing frames out use one of
two modes:

gimme the next frame to draw into (dri2)
here's the next frame I've drawn into (wayland)

To make that fast, we obviously need to recycle buffers. But from a
semantic point of view, you only ever have one buffer, namely the current
one. All the other N buffers to make the graphics pipeline not stutter are
transparently in-flight somewhere.
Imo such a dynamic scheme has a few advantages:

there's just no way to know the amount of buffers you need up-front on
any reasonable complex graphics pipeline. As soon as a gpu is in the
mix, it's best effort. With a dynamic limit on the in-flight buffers you
can cope with latencies until you hit -ENOMEM. With a fixed set you
always have to make a compromise and can't really allocate for the
worst case - it will hinder stuff running in the background.

Currently in V4L2 the number of buffers is fixed after calling REQBUFS. That
is, you can't add more buffers later. However, work is being done to lift
this limitation (it's almost finished, I expect to see this in 3.2 or 3.3).
...

in the usual case you need much fewer buffers to make any given pipeline
run stutter-free than in the worst case. No point wasting that memory.

Now I have no idea how you could shoe-horn that onto the current v4l
interfaces.
Not in the current API, but we hope to have the required flexibility
quite soon. Certainly in time for dma_buf.
...
...
Hmm, I'm rambling a bit, but I hope the gist of my mail is clear.
It's clear and I think you're raising good points.
Good :-)
Hans

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] [RFC v2] dma-buf: Add buffer sharing framework