Each dma-buf has an associated size and it's reasonable for userspace
to want to know what it is.
Since userspace already has an fd, expose the size using the
size = lseek(fd, SEEK_END, 0); lseek(fd, SEEK_CUR, 0);
idiom.
Signed-off-by: Christopher James Halse Rogers <christopher.halse.rogers(a)canonical.com>
---
I've run into a point in the radeon DRM userspace where I need the
size of a dma-buf. I could add a radeon-specific mechanism to get that,
but this seems like something that would be more generally useful.
I'm not entirely sure about supporting both SEEK_END and SEEK_CUR; this
is somewhat of an abuse of lseek, as seeking obviously doesn't make sense.
It's the obivous idiom for getting the size of what's on the other end of a
file descriptor, though.
I didn't notice anywhere to document this; Documentation/dma-buf-api didn't
seem like the right place. Is there somewhere I've overlooked?
drivers/base/dma-buf.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 6687ba7..c33a857 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -77,9 +77,36 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
return dmabuf->ops->mmap(dmabuf, vma);
}
+static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
+{
+ struct dma_buf *dmabuf;
+ loff_t base;
+
+ if (!is_dma_buf_file(file))
+ return -EBADF;
+
+ dmabuf = file->private_data;
+
+ /* only support discovering the end of the buffer,
+ but also allow SEEK_SET to maintain the idiomatic
+ SEEK_END(0), SEEK_CUR(0) pattern */
+ if (whence == SEEK_END)
+ base = dmabuf->size;
+ else if (whence == SEEK_SET)
+ base = 0;
+ else
+ return -EINVAL;
+
+ if (offset != 0)
+ return -EINVAL;
+
+ return base + offset;
+}
+
static const struct file_operations dma_buf_fops = {
.release = dma_buf_release,
.mmap = dma_buf_mmap_internal,
+ .llseek = dma_buf_llseek,
};
/*
@@ -133,6 +160,7 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops,
dmabuf->exp_name = exp_name;
file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf, flags);
+ file->f_mode |= FMODE_LSEEK;
dmabuf->file = file;
--
1.8.3.2
On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> > Turning to DRM/KMS, it seems the supported formats of a plane can be
>> > queried using drm_mode_get_plane. However, there doesn't seem to be a
>> > way to query the supported formats of a crtc? If display HW only
>> > supports scanning out from a single buffer (like pl111 does), I think
>> > it won't have any planes and a fb can only be set on the crtc. In
>> > which case, how should user-space query which pixel formats that crtc
>> > supports?
>>
>> it is exposed for drm plane's. What is missing is to expose the
>> primary-plane associated with the crtc.
>
> Cool - so a patch which adds a way to query the what formats a crtc
> supports would be welcome?
well, I kinda think we want something that exposes the "primary plane"
of the crtc.. I'm thinking something roughly like:
---------
diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
index 53db7ce..c7ffca8 100644
--- a/include/uapi/drm/drm_mode.h
+++ b/include/uapi/drm/drm_mode.h
@@ -157,6 +157,12 @@ struct drm_mode_get_plane {
struct drm_mode_get_plane_res {
__u64 plane_id_ptr;
__u32 count_planes;
+ /* The primary planes are in matching order to crtc_id_ptr in
+ * drm_mode_card_res (and same length). For crtc_id[n], it's
+ * primary plane is given by primary_plane_id[n].
+ */
+ __u32 count_primary_planes;
+ __u64 primary_plane_id_ptr;
};
#define DRM_MODE_ENCODER_NONE 0
---------
then use the existing GETPLANE ioctl to query the capabilities
> What about a way to query the stride alignment constraints?
>
> Presumably using the drm_mode_get_property mechanism would be the
> right way to implement that?
>
I suppose you could try that.. typically that is in userspace,
however. It seems like get_property would get messy quickly (ie. is
it a pitch alignment constraint, or stride alignment? What if this is
different for different formats (in particular tiled)? etc)
>
>> > As with v4l2, DRM doesn't appear to have a way to query the stride
>> > constraints? Assuming there is a way to query the stride constraints,
>> > there also isn't a way to specify them when creating a buffer with
>> > DRM, though perhaps the existing pitch parameter of
>> > drm_mode_create_dumb could be used to allow user-space to pass in a
>> > minimum stride as well as receive the allocated stride?
>> >
>>
>> well, you really shouldn't be using create_dumb.. you should have a
>> userspace piece that is specific to the drm driver, and knows how to
>> use that driver's gem allocate ioctl.
>
> Sorry, why does this need a driver-specific allocation function? It's
> just a display controller driver and I just want to allocate a scan-
> out buffer - all I'm asking is for the display controller driver to
> use a minimum stride alignment so I can export the buffer and use
> another device to fill it with data.
Sure.. but userspace has more information readily available to make a
better choice. For example, for omapdrm I'd do things differently
depending on whether I wanted to scan out that buffer (or a portion of
it) rotated. This is something I know in the DDX driver, but not in
the kernel. And it is quite likely something that is driver specific.
Sure, we could add that to a generic "allocate me a buffer" ioctl.
But that doesn't really seem better, and it becomes a problem as soon
as we come across some hw that needs to know something different. In
userspace, you have a lot more flexibility, since you don't really
need to commit to an API for life.
And to bring back the GStreamer argument (since that seems a fitting
example when you start talking about sharing buffers between many
devices, for example camera+codec+display), it would already be
negotiating format between v4l2src + fooencoder + displaysink.. the
pitch/stride is part of that format information. If the display isn't
the one with the strictest requirements, we don't want the display
driver deciding what pitch to use.
> The whole point is to be able to allocate the buffer in such a way
> that another device can access it. So the driver _can't_ use a
> special, device specific format, nor can it allocate it from a
> private memory pool because doing so would preclude it from being
> shared with another device.
>
> That other device doesn't need to be a GPU wither, it could just as
> easily be a camera/ISP or video decoder.
>
>
>
>> >> > So presumably you're talking about a GPU driver being the exporter
>> >> > here? If so, how could the GPU driver do these kind of tricks on
>> >> > memory shared with another device?
>> >>
>> >> Yes, that is gpu-as-exporter. If someone else is allocating
>> >> buffers, it is up to them to do these tricks or not. Probably
>> >> there is a pretty good chance that if you aren't a GPU you don't
>> >> need those sort of tricks for fast allocation of transient upload
>> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't
>> >> really think a v4l camera or video decoder would benefit from that
>> >> sort of optimization.
>> >
>> > Right - but none of those are really buffers you'd want to export
>>
>> > with dma_buf to share with another device are they? In which case,
>> > why not just have dma_buf figure out the constraints and allocate
>> > the memory?
>>
>> maybe not.. but (a) you don't necessarily know at creation time if it
>> is going to be exported (maybe you know if it is definitely not going
>> to be exported, but the converse is not true),
>
> I can't actually think of an example where you would not know if a
> buffer was going to be exported or not at allocation time? Do you have
> a case in mind?
yeah, dri2.. when the front buffer is allocated it is just a regular
pixmap. If you swap/flip it becomes the back buffer and now shared
;-)
And pixmaps are allocated w/ enough frequency that it is the sort of
thing you might want to optimize.
And even when you know it will be shared, you don't know with who.
> Regardless, you'd certainly have to know if a buffer will be exported
> pretty quickly, before it's used so that you can import it into
> whatever devices are going to access it. Otherwise if it gets
> allocated before you export it, the allocation won't satisfy the
> constraints of the other devices which will need to access it and
> importing will fail. Assuming of course deferred allocation of the
> backing pages as discussed earlier in the thread.
>
>
>
>> and (b) there isn't
>> really any reason to special case the allocation in the driver because
>> it is going to be exported.
>
> Not sure I follow you here? Surely you absolutely have to special-case
> the allocation if the buffer is to be exported because you have to
> take the other devices' constraints into account when you allocate? Or
> do you mean you don't need to special-case the GEM buffer object
> creation, only the allocation of the backing pages? Though I'm not
> sure how that distinction is useful - at the end of the day, you need
> to special-case allocation of the backing pages.
>
well, you need to consider separately what is (a) in the pages, and
(b) where the pages come from.
By moving the allocation into dmabuf you restrict (b). For sharing
buffers, (a) may be restricted, but there is at least some examples of
hardware where (b) would not otherwise be restricted by sharing.
>
>> helpers that can be used by simple drivers, yes. Forcing the way the
>> buffer is allocated, for sure not. Currently, for example, there is
>> no issue to export a buffer allocated from stolen-mem.
>
> Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk
> of memory reserved at boot for the GPU which the OS can't touch? I
> guess I view such memory as accessible to all media devices on the
> system and as such, needs to be managed by a central allocator which
> dma_buf can use to allocate from.
think carve-out created by bios. In all the cases I am aware of, the
drm driver handles allocation of buffer(s) from the carveout.
> I guess if that stolen-mem is managed by a single device then in
> essence that device becomes the central allocator you have to use to
> be able to allocate from that stolen mem?
>
>
>> > If a driver needs to allocate memory in a special way for a
>> > particular device, I can't really imagine how it would be able
>> > to share that buffer with another device using dma_buf? I guess
>> > a driver is likely to need some magic voodoo to configure access
>> > to the buffer for its device, but surely that would be done by
>> > the dma_mapping framework when dma_buf_map happens?
>> >
>>
>> if, what it has to configure actually manages to fit in the
>> dma-mapping framework
>
> But if it doesn't, surely that's an issue which needs to be addressed
> in the dma_mapping framework or else you won't be able to import
> buffers for use by that device anyway?
>
I'm not sure if we have to fit everything in dma-mapping framework, at
least in cases where you have something that is specific to one
platform.
Currently dma-buf provides enough flexibility for other drivers to be
able to import these buffers.
>
>> anyways, where the pages come from has nothing to do with whether a
>> buffer can be shared or not
>
> Sure, but where they are located in physical memory really does
> matter.
>
s/does/can/
it doesn't always matter. And in cases where it does matter, as long
as we can express the restrictions in dma_parms (which we can already
for the case of range-of-memory restrictions) we are covered
>
>> >> At any rate, for both xorg and wayland/gbm, you know when a buffer
>> >> is going to be a scanout buffer. What I'd recommend is define a
>> >> small userspace API that your customers (the SoC vendors) implement
>> >> to allocate a scanout buffer and hand you back a dmabuf fd. That
>> >> could be used both for x11 and for gbm. Inputs should be requested
>> >> width/height and format. And outputs pitch plus dmabuf fd.
>> >>
>> >> (Actually you might even just want to use gbm as your starting
>> >> point. You could probably just use gbm from xf86-video-armsoc for
>> >> allocation, to have one thing that works for both wayland and x11.
>> >> Scanout and cursor buffers should go to vendor/SoC specific fxn,
>> >> rest can be allocated from mali kernel driver.)
>> >
>> > What does that buy us over just using drm_mode_create_dumb on the
>> > display's DRM driver?
>>
>> well, for example, if there was actually some hw w/ omap's dss + mali,
>> you could actually have mali render transparently to tiled buffers
>> which could be scanned out rotated. Which would not be possible w/
>> dumb buffers.
>
> Why not? As you said earlier, the format is defined when you setup the
> fb with drm_mode_fb_cmd2. If you wanted to share the buffer between
> devices, you have to be explicit about what format that buffer is in,
> so you'd have to add an entry to drm_fourcc.h for the tiled format.
no, that doesn't really work
in this case, the format to any device (or userspace) accessing the
buffer is not tiled. (Ie. it would look like normal NV12 or
whatever). But there are some different requirements on stride. And
there are cases where you would prefer not to use tiled buffers, but
the kernel doesn't know enough in the dumb-buffer alloc ioctl to make
the correct decision.
> So userspace queries what formats the GPU DRM supports and what
> formats the OMAP DSS DRM supports, selects the tiled format and then
> uses drm_mode_create_dumb to allocate a buffer of the correct size and
> sets the appropriate drm_fourcc.h enum value when creating an fb for
> that buffer. Or have I missed something?
>
>
>
>> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> >> >> (where phys contig is not required for scanout), but causes CMA
>> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care.
>> >> >> It just knows that it wants to be able to scanout that particular
>> >> >> buffer.
>> >> >
>> >> > I think that's the idea? The omap3's allocator driver would use
>> >> > contiguous memory when it detects the SCANOUT flag whereas the
>> >> > omap4 allocator driver wouldn't have to. No complex negotiation
>> >> > of constraints - it just "knows".
>> >> >
>> >>
>> >> well, it is same allocating driver in both cases (although maybe
>> >> that is unimportant). The "it" that just knows it wants to scanout
>> >> is userspace. The "it" that just knows that scanout translates to
>> >> contiguous (or not) is the kernel. Perhaps we are saying the same
>> >> thing ;-)
>> >
>> > Yeah - I think we are... so what's the issue with having a per-SoC
>> > allocation driver again?
>> >
>>
>> In a way the display driver is a per-SoC allocator. But not
>> necessarily the *central* allocator for everything. Ie. no need for
>> display driver to allocate vertex buffers for a separate gpu driver,
>> and that sort of thing.
>
> Again, I'm only talking about allocating buffers which will be shared
> between different devices. At no point have I mentioned the allocation
> of buffers which aren't to be shared between devices. Sorry if that's
> not been clear.
ok, I guess we were talking about slightly different things ;-)
> So for buffers which are to be shared between devices, your suggesting
> that the display driver is the per-SoC allocator? But as I say, and
> how this thread got started, the same display driver can be used on
> different SoCs, so having _it_ be the central allocator isn't ideal.
> Though this is our current solution and why we're "abusing" the dumb
> buffer allocation functions. :-)
>
which is why you want to let userspace figure out the pitch and then
tell the display driver what size it wants, rather than using dumb
buffer ioctl ;-)
Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or
property or what have you.. but I think that would be hard to get
right for all cases, and most people don't really care about that
because they already need a gpu/display specific xorg driver and/or
gl/egl talking to their kernel driver. You are in a slightly special
case, since you are providing GL driver independently of the display
driver. But I think that is easier to handle by just telling your
customers "here, fill out this function(s) to allocate buffer for
scanout" (and, well, I guess you'd need one to query for
pitch/stride), rather than trying to cram everything into the kernel.
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> > Didn't you say that programmatically describing device placement
>> >> > constraints was an unbounded problem? I guess we would have to
>> >> > accept that it's not possible to describe all possible constraints
>> >> > and instead find a way to describe the common ones?
>> >>
>> >> well, the point I'm trying to make, is by dividing your constraints
>> >> into two groups, one that impacts and is handled by userspace, and
>> >> one that is in the kernel (ie. where the pages go), you cut down
>> >> the number of permutations that the kernel has to care about
>> >> considerably. And kernel already cares about, for example, what
>> >> range of addresses that a device can dma to/from. I think really
>> >> the only thing missing is the max # of sglist entries (contiguous
>> >> or not)
>> >
>> > I think it's more than physically contiguous or not.
>> >
>> > For example, it can be more efficient to use large page sizes on
>> > devices with IOMMUs to reduce TLB traffic. I think the size and even
>> > the availability of large pages varies between different IOMMUs.
>>
>> sure.. but I suppose if we can spiff out dma_params to express "I need
>> contiguous", perhaps we can add some way to express "I prefer
>> as-contiguous-as-possible".. either way, this is about where the pages
>> are placed, and not about the layout of pixels within the page, so
>> should be in kernel. It's something that is missing, but I believe
>> that it belongs in dma_params and hidden behind dma_alloc_*() for
>> simple drivers.
>
> Thinking about it, isn't this more a property of the IOMMU? I mean,
> are there any cases where an IOMMU had a large page mode but you
> wouldn't want to use it? So when allocating the memory, you'd have to
> take into account not just the constraints of the devices themselves,
> but also of any IOMMUs any of the device sit behind?
>
perhaps yes. But the device is associated w/ the iommu it is attached
to, so this shouldn't be a problem
>
>> > There's also the issue of buffer stride alignment. As I say, if the
>> > buffer is to be written by a tile-based GPU like Mali, it's more
>> > efficient if the buffer's stride is aligned to the max AXI bus burst
>> > length. Though I guess a buffer stride only makes sense as a concept
>> > when interpreting the data as a linear-layout 2D image, so perhaps
>> > belongs in user-space along with format negotiation?
>> >
>>
>> Yeah.. this isn't about where the pages go, but about the arrangement
>> within a page.
>>
>> And, well, except for hw that supports the same tiling (or
>> compressed-fb) in display+gpu, you probably aren't sharing tiled
>> buffers.
>
> You'd only want to share a buffer between devices if those devices can
> understand the same pixel format. That pixel format can't be device-
> specific or opaque, it has to be explicit. I think drm_fourcc.h is
> what defines all the possible pixel formats. This is the enum I used
> in EGL_EXT_image_dma_buf_import at least. So if we get to the point
> where multiple devices can understand a tiled or compressed format, I
> assume we could just add that format to drm_fourcc.h and possibly
> v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.
>
> For user-space to negotiate a common pixel format and now stride
> alignment, I guess it will obviously need a way to query what pixel
> formats a device supports and what its stride alignment requirements
> are.
>
> I don't know v4l2 very well, but it certainly seems the pixel format
> can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
> a particular format. I couldn't however find a way to retrieve a list
> of supported formats - it seems the mechanism is to try out each
> format in turn to determine if it is supported. Is that right?
it is exposed for drm plane's. What is missing is to expose the
primary-plane associated with the crtc.
> There doesn't however seem a way to query what stride constraints a
> V4l2 device might have. Does HW abstracted by v4l2 typically have
> such constraints? If so, how can we query them such that a buffer
> allocated by a DRM driver can be imported into v4l2 and used with
> that HW?
>
> Turning to DRM/KMS, it seems the supported formats of a plane can be
> queried using drm_mode_get_plane. However, there doesn't seem to be a
> way to query the supported formats of a crtc? If display HW only
> supports scanning out from a single buffer (like pl111 does), I think
> it won't have any planes and a fb can only be set on the crtc. In
> which case, how should user-space query which pixel formats that crtc
> supports?
>
> Assuming user-space can query the supported formats and find a common
> one, it will need to allocate a buffer. Looks like
> drm_mode_create_dumb can do that, but it only takes a bpp parameter,
> there's no format parameter. I assume then that user-space defines
> the format and tells the DRM driver which format the buffer is in
> when creating the fb with drm_mode_fb_cmd2, which does take a format
> parameter? Is that right?
Right, the gem object has no inherent format, it is just some bytes.
The format/width/height/pitch are all attributes of the fb.
> As with v4l2, DRM doesn't appear to have a way to query the stride
> constraints? Assuming there is a way to query the stride constraints,
> there also isn't a way to specify them when creating a buffer with
> DRM, though perhaps the existing pitch parameter of
> drm_mode_create_dumb could be used to allow user-space to pass in a
> minimum stride as well as receive the allocated stride?
>
well, you really shouldn't be using create_dumb.. you should have a
userspace piece that is specific to the drm driver, and knows how to
use that driver's gem allocate ioctl.
>
>> >> > One problem with this is it duplicates a lot of logic in each
>> >> > driver which can export a dma_buf buffer. Each exporter will
>> >> > need to do pretty much the same thing: iterate over all the
>> >> > attachments, determine of all the constraints (assuming that
>> >> > can be done) and allocate pages such that the lowest-common-
>> >> > denominator is satisfied. Perhaps rather than duplicating that
>> >> > logic in every driver, we could instead move allocation of the
>> >> > backing pages into dma_buf itself?
>> >>
>> >> I tend to think it is better to add helpers as we see common
>>
>> >> patterns emerge, which drivers can opt-in to using. I don't
>> >> think that we should move allocation into dma_buf itself, but
>> >> it would perhaps be useful to have dma_alloc_*() variants that
>> >> could allocate for multiple devices.
>> >
>> > A helper could work I guess, though I quite like the idea of
>> > having dma_alloc_*() variants which take a list of devices to
>> > allocate memory for.
>> >
>> >
>> >> That would help for simple stuff, although I'd suspect
>> >> eventually a GPU driver will move away from that. (Since you
>> >> probably want to play tricks w/ pools of pages that are
>> >> pre-zero'd and in the correct cache state, use spare cycles on
>> >> the gpu or dma engine to pre-zero uncached pages, and games
>> >> like that.)
>> >
>> > So presumably you're talking about a GPU driver being the exporter
>> > here? If so, how could the GPU driver do these kind of tricks on
>> > memory shared with another device?
>>
>> Yes, that is gpu-as-exporter. If someone else is allocating buffers,
>> it is up to them to do these tricks or not. Probably there is a
>> pretty good chance that if you aren't a GPU you don't need those sort
>> of tricks for fast allocation of transient upload buffers, staging
>> textures, temporary pixmaps, etc. Ie. I don't really think a v4l
>> camera or video decoder would benefit from that sort of optimization.
>
> Right - but none of those are really buffers you'd want to export with
> dma_buf to share with another device are they? In which case, why not
> just have dma_buf figure out the constraints and allocate the memory?
maybe not.. but (a) you don't necessarily know at creation time if it
is going to be exported (maybe you know if it is definitely not going
to be exported, but the converse is not true), and (b) there isn't
really any reason to special case the allocation in the driver because
it is going to be exported.
helpers that can be used by simple drivers, yes. Forcing the way the
buffer is allocated, for sure not. Currently, for example, there is
no issue to export a buffer allocated from stolen-mem. If we put the
page allocation in dma-buf, this would not be possible. That is just
one quick example off the top of my head, I'm sure there are plenty
more. But we definitely do not want the allocate in dma_buf itself.
> If a driver needs to allocate memory in a special way for a particular
> device, I can't really imagine how it would be able to share that
> buffer with another device using dma_buf? I guess a driver is likely
> to need some magic voodoo to configure access to the buffer for its
> device, but surely that would be done by the dma_mapping framework
> when dma_buf_map happens?
>
if, what it has to configure actually manages to fit in the
dma-mapping framework
anyways, where the pages come from has nothing to do with whether a
buffer can be shared or not
>
>
>> >> You probably want to get out of the SoC mindset, otherwise you are
>> >> going to make bad assumptions that come back to bite you later on.
>> >
>> > Sure - there are always going to be PC-like devices where the
>> > hardware configuration isn't fixed like it is on a traditional SoC.
>> > But I'd rather have a simple solution which works on traditional SoCs
>> > than no solution at all. Today our solution is to over-load the dumb
>> > buffer alloc functions of the display's DRM driver - For now I'm just
>> > looking for the next step up from that! ;-)
>>
>> True.. the original intention, which is perhaps a bit desktop-centric,
>> really was for there to be a userspace component talking to the drm
>> driver for allocation, ie. xf86-video-foo and/or
>> src/gallium/drivers/foo (for example) ;-)
>>
>> Which means for x11 having a SoC vendor specific xf86-video-foo for
>> x11.. or vendor specific gbm implementation for wayland. (Although
>> at least in the latter case it is a pretty small piece of code.) But
>> that is probably what you are trying to avoid.
>
> I've been trying to get my head around how PRIME relates to DDX
> drivers. As I understand it (which is likely wrong), you have a laptop
> with both an Intel & an nVidia GPU. You have both the i915 & nouveau
> kernel drivers loaded. What I'm not sure about is which GPU's display
> controller is actually hooked up to the physical connector? Perhaps
> there is a MUX like there is on Versatile Express?
afaiu it can be a, b, or c (ie. either gpu can have the display or
there can be a mux)..
> What I also don't understand is what DDX driver is loaded? Is it
> xf86-video-intel, xf86-video-nouveau or both? I get the impression
> that there's a "master" DDX which implements 2D operations but can
> import buffers using PRIME from the other driver and draw to them.
> Or is it more that it's able to export rendered buffers to the
> "slave" DRM for scanout? Either way, it's pretty similar to an ARM
> SoC setup which has the GPU and the display as two totally
> independent devices.
>
>
>
>> At any rate, for both xorg and wayland/gbm, you know when a buffer is
>> going to be a scanout buffer. What I'd recommend is define a small
>> userspace API that your customers (the SoC vendors) implement to
>> allocate a scanout buffer and hand you back a dmabuf fd. That could
>> be used both for x11 and for gbm. Inputs should be requested
>> width/height and format. And outputs pitch plus dmabuf fd.
>>
>> (Actually you might even just want to use gbm as your starting point.
>> You could probably just use gbm from xf86-video-armsoc for allocation,
>> to have one thing that works for both wayland and x11. Scanout and
>> cursor buffers should go to vendor/SoC specific fxn, rest can be
>> allocated from mali kernel driver.)
>
> What does that buy us over just using drm_mode_create_dumb on the
> display's DRM driver?
>
well, for example, if there was actually some hw w/ omap's dss + mali,
you could actually have mali render transparently to tiled buffers
which could be scanned out rotated. Which would not be possible w/
dumb buffers.
>
>
>> >> > wouldn't need a way to programmatically describe the constraints
>> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would
>> >> > just "know" that on this SoC, that buffer needs to be physically
>> >> > contiguous for example.
>> >>
>> >> not really.. it just knows it wants to scanout the buffer, and tells
>> >> this as a hint to the kernel.
>> >>
>> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> >> (where phys contig is not required for scanout), but causes CMA
>> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care.
>> >> It just knows that it wants to be able to scanout that particular
>> >> buffer.
>> >
>> > I think that's the idea? The omap3's allocator driver would use
>> > contiguous memory when it detects the SCANOUT flag whereas the omap4
>> > allocator driver wouldn't have to. No complex negotiation of
>> > constraints - it just "knows".
>> >
>>
>> well, it is same allocating driver in both cases (although maybe that
>> is unimportant). The "it" that just knows it wants to scanout is
>> userspace. The "it" that just knows that scanout translates to
>> contiguous (or not) is the kernel. Perhaps we are saying the same
>> thing ;-)
>
> Yeah - I think we are... so what's the issue with having a per-SoC
> allocation driver again?
>
In a way the display driver is a per-SoC allocator. But not
necessarily the *central* allocator for everything. Ie. no need for
display driver to allocate vertex buffers for a separate gpu driver,
and that sort of thing.
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> > Didn't you say that programmatically describing device placement
>> >> > constraints was an unbounded problem? I guess we would have to
>> >> > accept that it's not possible to describe all possible constraints
>> >> > and instead find a way to describe the common ones?
>> >>
>> >> well, the point I'm trying to make, is by dividing your constraints
>> >> into two groups, one that impacts and is handled by userspace, and
>> >> one that is in the kernel (ie. where the pages go), you cut down
>> >> the number of permutations that the kernel has to care about
>> >> considerably. And kernel already cares about, for example, what
>> >> range of addresses that a device can dma to/from. I think really
>> >> the only thing missing is the max # of sglist entries (contiguous
>> >> or not)
>> >
>> > I think it's more than physically contiguous or not.
>> >
>> > For example, it can be more efficient to use large page sizes on
>> > devices with IOMMUs to reduce TLB traffic. I think the size and even
>> > the availability of large pages varies between different IOMMUs.
>>
>> sure.. but I suppose if we can spiff out dma_params to express "I need
>> contiguous", perhaps we can add some way to express "I prefer
>> as-contiguous-as-possible".. either way, this is about where the pages
>> are placed, and not about the layout of pixels within the page, so
>> should be in kernel. It's something that is missing, but I believe
>> that it belongs in dma_params and hidden behind dma_alloc_*() for
>> simple drivers.
>
> Thinking about it, isn't this more a property of the IOMMU? I mean,
> are there any cases where an IOMMU had a large page mode but you
> wouldn't want to use it? So when allocating the memory, you'd have to
> take into account not just the constraints of the devices themselves,
> but also of any IOMMUs any of the device sit behind?
>
>
>> > There's also the issue of buffer stride alignment. As I say, if the
>> > buffer is to be written by a tile-based GPU like Mali, it's more
>> > efficient if the buffer's stride is aligned to the max AXI bus burst
>> > length. Though I guess a buffer stride only makes sense as a concept
>> > when interpreting the data as a linear-layout 2D image, so perhaps
>> > belongs in user-space along with format negotiation?
>> >
>>
>> Yeah.. this isn't about where the pages go, but about the arrangement
>> within a page.
>>
>> And, well, except for hw that supports the same tiling (or
>> compressed-fb) in display+gpu, you probably aren't sharing tiled
>> buffers.
>
> You'd only want to share a buffer between devices if those devices can
> understand the same pixel format. That pixel format can't be device-
> specific or opaque, it has to be explicit. I think drm_fourcc.h is
> what defines all the possible pixel formats. This is the enum I used
> in EGL_EXT_image_dma_buf_import at least. So if we get to the point
> where multiple devices can understand a tiled or compressed format, I
> assume we could just add that format to drm_fourcc.h and possibly
> v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.
>
> For user-space to negotiate a common pixel format and now stride
> alignment, I guess it will obviously need a way to query what pixel
> formats a device supports and what its stride alignment requirements
> are.
>
> I don't know v4l2 very well, but it certainly seems the pixel format
> can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
> a particular format. I couldn't however find a way to retrieve a list
> of supported formats - it seems the mechanism is to try out each
> format in turn to determine if it is supported. Is that right?
>
> There doesn't however seem a way to query what stride constraints a
> V4l2 device might have. Does HW abstracted by v4l2 typically have
> such constraints? If so, how can we query them such that a buffer
> allocated by a DRM driver can be imported into v4l2 and used with
> that HW?
>
> Turning to DRM/KMS, it seems the supported formats of a plane can be
> queried using drm_mode_get_plane. However, there doesn't seem to be a
> way to query the supported formats of a crtc? If display HW only
> supports scanning out from a single buffer (like pl111 does), I think
> it won't have any planes and a fb can only be set on the crtc. In
> which case, how should user-space query which pixel formats that crtc
> supports?
>
> Assuming user-space can query the supported formats and find a common
> one, it will need to allocate a buffer. Looks like
> drm_mode_create_dumb can do that, but it only takes a bpp parameter,
> there's no format parameter. I assume then that user-space defines
> the format and tells the DRM driver which format the buffer is in
> when creating the fb with drm_mode_fb_cmd2, which does take a format
> parameter? Is that right?
>
> As with v4l2, DRM doesn't appear to have a way to query the stride
> constraints? Assuming there is a way to query the stride constraints,
> there also isn't a way to specify them when creating a buffer with
> DRM, though perhaps the existing pitch parameter of
> drm_mode_create_dumb could be used to allow user-space to pass in a
> minimum stride as well as receive the allocated stride?
>
>
>> >> > One problem with this is it duplicates a lot of logic in each
>> >> > driver which can export a dma_buf buffer. Each exporter will
>> >> > need to do pretty much the same thing: iterate over all the
>> >> > attachments, determine of all the constraints (assuming that
>> >> > can be done) and allocate pages such that the lowest-common-
>> >> > denominator is satisfied. Perhaps rather than duplicating that
>> >> > logic in every driver, we could instead move allocation of the
>> >> > backing pages into dma_buf itself?
>> >>
>> >> I tend to think it is better to add helpers as we see common
>>
>> >> patterns emerge, which drivers can opt-in to using. I don't
>> >> think that we should move allocation into dma_buf itself, but
>> >> it would perhaps be useful to have dma_alloc_*() variants that
>> >> could allocate for multiple devices.
>> >
>> > A helper could work I guess, though I quite like the idea of
>> > having dma_alloc_*() variants which take a list of devices to
>> > allocate memory for.
>> >
>> >
>> >> That would help for simple stuff, although I'd suspect
>> >> eventually a GPU driver will move away from that. (Since you
>> >> probably want to play tricks w/ pools of pages that are
>> >> pre-zero'd and in the correct cache state, use spare cycles on
>> >> the gpu or dma engine to pre-zero uncached pages, and games
>> >> like that.)
>> >
>> > So presumably you're talking about a GPU driver being the exporter
>> > here? If so, how could the GPU driver do these kind of tricks on
>> > memory shared with another device?
>>
>> Yes, that is gpu-as-exporter. If someone else is allocating buffers,
>> it is up to them to do these tricks or not. Probably there is a
>> pretty good chance that if you aren't a GPU you don't need those sort
>> of tricks for fast allocation of transient upload buffers, staging
>> textures, temporary pixmaps, etc. Ie. I don't really think a v4l
>> camera or video decoder would benefit from that sort of optimization.
>
> Right - but none of those are really buffers you'd want to export with
> dma_buf to share with another device are they? In which case, why not
> just have dma_buf figure out the constraints and allocate the memory?
>
> If a driver needs to allocate memory in a special way for a particular
> device, I can't really imagine how it would be able to share that
> buffer with another device using dma_buf? I guess a driver is likely
> to need some magic voodoo to configure access to the buffer for its
> device, but surely that would be done by the dma_mapping framework
> when dma_buf_map happens?
>
>
>
>> >> You probably want to get out of the SoC mindset, otherwise you are
>> >> going to make bad assumptions that come back to bite you later on.
>> >
>> > Sure - there are always going to be PC-like devices where the
>> > hardware configuration isn't fixed like it is on a traditional SoC.
>> > But I'd rather have a simple solution which works on traditional SoCs
>> > than no solution at all. Today our solution is to over-load the dumb
>> > buffer alloc functions of the display's DRM driver - For now I'm just
>> > looking for the next step up from that! ;-)
>>
>> True.. the original intention, which is perhaps a bit desktop-centric,
>> really was for there to be a userspace component talking to the drm
>> driver for allocation, ie. xf86-video-foo and/or
>> src/gallium/drivers/foo (for example) ;-)
>>
>> Which means for x11 having a SoC vendor specific xf86-video-foo for
>> x11.. or vendor specific gbm implementation for wayland. (Although
>> at least in the latter case it is a pretty small piece of code.) But
>> that is probably what you are trying to avoid.
>
> I've been trying to get my head around how PRIME relates to DDX
> drivers. As I understand it (which is likely wrong), you have a laptop
> with both an Intel & an nVidia GPU. You have both the i915 & nouveau
> kernel drivers loaded. What I'm not sure about is which GPU's display
> controller is actually hooked up to the physical connector? Perhaps
> there is a MUX like there is on Versatile Express?
>
> What I also don't understand is what DDX driver is loaded? Is it
> xf86-video-intel, xf86-video-nouveau or both? I get the impression
> that there's a "master" DDX which implements 2D operations but can
> import buffers using PRIME from the other driver and draw to them.
> Or is it more that it's able to export rendered buffers to the
> "slave" DRM for scanout? Either way, it's pretty similar to an ARM
> SoC setup which has the GPU and the display as two totally
> independent devices.
>
In the early days, there was a MUX to switch the displays between the
two GPUs, but most modern systems are MUX-less and the dGPU is either
connected to no displays or in some cases the local panel is attached
to the integrated GPU and the external displays are connected to the
dGPU.
In the MUX-less case, the dGPU can be used to render, and then the
result is copied to the integrated GPU for display.
Alex
On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> ... This is the purpose of the attach step,
>> >> so you know all the devices involved in sharing up front before
>> >> allocating the backing pages. (Or in the worst case, if you have a
>> >> "late attacher" you at least know when no device is doing dma access
>> >> to a buffer and can reallocate and move the buffer.) A long time
>> >> back, I had a patch that added a field or two to 'struct
>> >> device_dma_parameters' so that it could be known if a device
>> >> required contiguous buffers.. looks like that never got merged, so
>> >> I'd need to dig that back up and resend it. But the idea was to
>> >> have the 'struct device' encapsulate all the information that would
>> >> be needed to do-the-right-thing when it comes to placement.
>> >
>> > As I understand it, it's up to the exporting device to allocate the
>> > memory backing the dma_buf buffer. I guess the latest possible point
>> > you can allocate the backing pages is when map_dma_buf is first
>> > called? At that point the exporter can iterate over the current set
>> > of attachments, programmatically determine the all the constraints of
>> > all the attached drivers and attempt to allocate the backing pages
>> > in such a way as to satisfy all those constraints?
>>
>> yes, this is the idea.. possibly some room for some helpers to help
>> out with this, but that is all under the hood from userspace
>> perspective
>>
>> > Didn't you say that programmatically describing device placement
>> > constraints was an unbounded problem? I guess we would have to
>> > accept that it's not possible to describe all possible constraints
>> > and instead find a way to describe the common ones?
>>
>> well, the point I'm trying to make, is by dividing your constraints
>> into two groups, one that impacts and is handled by userspace, and one
>> that is in the kernel (ie. where the pages go), you cut down the
>> number of permutations that the kernel has to care about considerably.
>> And kernel already cares about, for example, what range of addresses
>> that a device can dma to/from. I think really the only thing missing
>> is the max # of sglist entries (contiguous or not)
>
> I think it's more than physically contiguous or not.
>
> For example, it can be more efficient to use large page sizes on
> devices with IOMMUs to reduce TLB traffic. I think the size and even
> the availability of large pages varies between different IOMMUs.
sure.. but I suppose if we can spiff out dma_params to express "I need
contiguous", perhaps we can add some way to express "I prefer
as-contiguous-as-possible".. either way, this is about where the pages
are placed, and not about the layout of pixels within the page, so
should be in kernel. It's something that is missing, but I believe
that it belongs in dma_params and hidden behind dma_alloc_*() for
simple drivers.
> There's also the issue of buffer stride alignment. As I say, if the
> buffer is to be written by a tile-based GPU like Mali, it's more
> efficient if the buffer's stride is aligned to the max AXI bus burst
> length. Though I guess a buffer stride only makes sense as a concept
> when interpreting the data as a linear-layout 2D image, so perhaps
> belongs in user-space along with format negotiation?
>
Yeah.. this isn't about where the pages go, but about the arrangement
within a page.
And, well, except for hw that supports the same tiling (or
compressed-fb) in display+gpu, you probably aren't sharing tiled
buffers.
>
>> > One problem with this is it duplicates a lot of logic in each
>> > driver which can export a dma_buf buffer. Each exporter will need to
>> > do pretty much the same thing: iterate over all the attachments,
>> > determine of all the constraints (assuming that can be done) and
>> > allocate pages such that the lowest-common-denominator is satisfied.
>> >
>> > Perhaps rather than duplicating that logic in every driver, we could
>> > Instead move allocation of the backing pages into dma_buf itself?
>> >
>>
>> I tend to think it is better to add helpers as we see common patterns
>> emerge, which drivers can opt-in to using. I don't think that we
>> should move allocation into dma_buf itself, but it would perhaps be
>> useful to have dma_alloc_*() variants that could allocate for multiple
>> devices.
>
> A helper could work I guess, though I quite like the idea of having
> dma_alloc_*() variants which take a list of devices to allocate memory
> for.
>
>
>> That would help for simple stuff, although I'd suspect
>> eventually a GPU driver will move away from that. (Since you probably
>> want to play tricks w/ pools of pages that are pre-zero'd and in the
>> correct cache state, use spare cycles on the gpu or dma engine to
>> pre-zero uncached pages, and games like that.)
>
> So presumably you're talking about a GPU driver being the exporter
> here? If so, how could the GPU driver do these kind of tricks on
> memory shared with another device?
>
Yes, that is gpu-as-exporter. If someone else is allocating buffers,
it is up to them to do these tricks or not. Probably there is a
pretty good chance that if you aren't a GPU you don't need those sort
of tricks for fast allocation of transient upload buffers, staging
textures, temporary pixmaps, etc. Ie. I don't really think a v4l
camera or video decoder would benefit from that sort of optimization.
>
>> >> > Anyway, assuming user-space can figure out how a buffer should be
>> >> > stored in memory, how does it indicate this to a kernel driver and
>> >> > actually allocate it? Which ioctl on which device does user-space
>> >> > call, with what parameters? Are you suggesting using something
>> >> > like ION which exposes the low-level details of how buffers are
>> > >> laid out in physical memory to userspace? If not, what?
>> >> >
>> >>
>> >> no, userspace should not need to know this. And having a central
>> >> driver that knows this for all the other drivers in the system
>> >> doesn't really solve anything and isn't really scalable. At best
>> >> you might want, in some cases, a flag you can pass when allocating.
>> >> For example, some of the drivers have a 'SCANOUT' flag that can be
>> >> passed when allocating a GEM buffer, as a hint to the kernel that
>> >> 'if this hw requires contig memory for scanout, allocate this
>> >> buffer contig'. But really, when it comes to sharing buffers
>> >> between devices, we want this sort of information in dev->dma_params
>> >> of the importing device(s).
>> >
>> > If you had a single driver which knew the constraints of all devices
>> > on that particular SoC and the interface allowed user-space to
>> > specify which devices a buffer is intended to be used with, I guess
>> > it could pretty trivially allocate pages which satisfy those
> constraints?
>>
>> keep in mind, even a number of SoC's come with pcie these days. You
>> already have things like
>>
>> https://developer.nvidia.com/content/kayla-platform
>>
>> You probably want to get out of the SoC mindset, otherwise you are
>> going to make bad assumptions that come back to bite you later on.
>
> Sure - there are always going to be PC-like devices where the
> hardware configuration isn't fixed like it is on a traditional SoC.
> But I'd rather have a simple solution which works on traditional SoCs
> than no solution at all. Today our solution is to over-load the dumb
> buffer alloc functions of the display's DRM driver - For now I'm just
> looking for the next step up from that! ;-)
>
True.. the original intention, which is perhaps a bit desktop-centric,
really was for there to be a userspace component talking to the drm
driver for allocation, ie. xf86-video-foo and/or
src/gallium/drivers/foo (for example) ;-)
Which means for x11 having a SoC vendor specific xf86-video-foo for
x11.. or vendor specific gbm implementation for wayland. (Although
at least in the latter case it is a pretty small piece of code.) But
that is probably what you are trying to avoid.
At any rate, for both xorg and wayland/gbm, you know when a buffer is
going to be a scanout buffer. What I'd recommend is define a small
userspace API that your customers (the SoC vendors) implement to
allocate a scanout buffer and hand you back a dmabuf fd. That could
be used both for x11 and for gbm. Inputs should be requested
width/height and format. And outputs pitch plus dmabuf fd.
(Actually you might even just want to use gbm as your starting point.
You could probably just use gbm from xf86-video-armsoc for allocation,
to have one thing that works for both wayland and x11. Scanout and
cursor buffers should go to vendor/SoC specific fxn, rest can be
allocated from mali kernel driver.)
>
>> > wouldn't need a way to programmatically describe the constraints
>> > either: As you say, if userspace sets the "SCANOUT" flag, it would
>> > just "know" that on this SoC, that buffer needs to be physically
>> > contiguous for example.
>>
>> not really.. it just knows it wants to scanout the buffer, and tells
>> this as a hint to the kernel.
>>
>> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> (where phys contig is not required for scanout), but causes CMA
>> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just
>> knows that it wants to be able to scanout that particular buffer.
>
> I think that's the idea? The omap3's allocator driver would use
> contiguous memory when it detects the SCANOUT flag whereas the omap4
> allocator driver wouldn't have to. No complex negotiation of
> constraints - it just "knows".
>
well, it is same allocating driver in both cases (although maybe that
is unimportant). The "it" that just knows it wants to scanout is
userspace. The "it" that just knows that scanout translates to
contiguous (or not) is the kernel. Perhaps we are saying the same
thing ;-)
BR,
-R
>
> Cheers,
>
> Tom
>
>
>
>
Hello,
This is a fourth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. After the comments from
Grant Likely I moved back memory region definitions back to /memory node
(as it was in the first version of this proposal). I've also extended
the code and made it more generic, added support for so called reserved
dma memory (special dma memory regions created by dma_alloc_coherent()
function, for exclusive usage for dma allocation for the given device).
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v4:
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (4):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add function to scan fdt nodes given by path
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++++
arch/arm/mm/init.c | 3 +
drivers/base/dma-contiguous.c | 147 +++++++++++-----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 76 +++++++++++
drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++
include/asm-generic/dma-coherent.h | 6 +
include/asm-generic/dma-contiguous.h | 2 -
include/linux/dma-contiguous.h | 49 +++++++-
include/linux/of_fdt.h | 3 +
11 files changed, 541 insertions(+), 79 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
--
1.7.9.5
On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz(a)linaro.org> wrote:
> On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark(a)gmail.com> wrote:
>> well, let's divide things up into two categories:
>>
>> 1) the arrangement and format of pixels.. ie. what userspace would
>> need to know if it mmap's a buffer. This includes pixel format,
>> stride, etc. This should be negotiated in userspace, it would be
>> crazy to try to do this in the kernel.
>>
>> 2) the physical placement of the pages. Ie. whether it is contiguous
>> or not. Which bank the pages in the buffer are placed in, etc. This
>> is not visible to userspace. This is the purpose of the attach step,
>> so you know all the devices involved in sharing up front before
>> allocating the backing pages. (Or in the worst case, if you have a
>> "late attacher" you at least know when no device is doing dma access
>> to a buffer and can reallocate and move the buffer.) A long time
>
> One concern I know the Android folks have expressed previously (and
> correct me if its no longer an objection), is that this attach time
> in-kernel constraint solving / moving or reallocating buffers is
> likely to hurt determinism. If I understood, their perspective was
> that userland knows the device path the buffers will travel through,
> so why not leverage that knowledge, rather then having the kernel have
> to sort it out for itself after the fact.
If you know the device path, then attach the buffer at all the devices
before you start using it. Problem solved.. kernel knows all devices
before pages need be allocated ;-)
BR,
-R
On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
> Hi Rob,
>
>> >> > We may also then have additional constraints when sharing buffers
>> >> > between the display HW and video decode or even camera ISP HW.
>> >> > Programmatically describing buffer allocation constraints is very
>> >> > difficult and I'm not sure you can actually do it - there's some
>> >> > pretty complex constraints out there! E.g. I believe there's a
>> >> > platform where Y and UV planes of the reference frame need to be
>> >> > in separate DRAM banks for real-time 1080p decode, or something
>> >> > like that?
>> >>
>> >> yes, this was discussed. This is different from pitch/format/size
>> >> constraints.. it is really just a placement constraint (ie. where
>> >> do the physical pages go). IIRC the conclusion was to use a dummy
>> >> devices with it's own CMA pool for attaching the Y vs UV buffers.
>> >>
>> >> > Anyway, I guess my point is that even if we solve how to allocate
>> >> > buffers which will be shared between the GPU and display HW such
>> >> > that both sets of constraints are satisfied, that may not be the
>> >> > end of the story.
>> >> >
>> >>
>> >> that was part of the reason to punt this problem to userspace ;-)
>> >>
>> >> In practice, the kernel drivers doesn't usually know too much about
>> >> the dimensions/format/etc.. that is really userspace level
>> >> knowledge. There are a few exceptions when the kernel needs to know
>> >> how to setup GTT/etc for tiled buffers, but normally this sort of
>> >> information is up at the next level up (userspace, and
>> >> drm_framebuffer in case of scanout). Userspace media frameworks
>> >> like GStreamer already have a concept of format/caps negotiation.
>> >> For non-display<->gpu sharing, I think this is probably where this
>> >> sort of constraint negotiation should be handled.
>> >
>> > I agree that user-space will know which devices will access the
>> > buffer and thus can figure out at least a common pixel format.
>> > Though I'm not so sure userspace can figure out more low-level
>> > details like alignment and placement in physical memory, etc.
>> >
>>
>> well, let's divide things up into two categories:
>>
>> 1) the arrangement and format of pixels.. ie. what userspace would
>> need to know if it mmap's a buffer. This includes pixel format,
>> stride, etc. This should be negotiated in userspace, it would be
>> crazy to try to do this in the kernel.
>
> Absolutely. Pixel format has to be negotiated by user-space as in
> most cases, user-space can map the buffer and thus will need to
> know how to interpret the data.
>
>
>
>> 2) the physical placement of the pages. Ie. whether it is contiguous
>> or not. Which bank the pages in the buffer are placed in, etc. This
>> is not visible to userspace.
>
> Seems sensible to me.
>
>
>> ... This is the purpose of the attach step,
>> so you know all the devices involved in sharing up front before
>> allocating the backing pages. (Or in the worst case, if you have a
>> "late attacher" you at least know when no device is doing dma access
>> to a buffer and can reallocate and move the buffer.) A long time
>> back, I had a patch that added a field or two to 'struct
>> device_dma_parameters' so that it could be known if a device required
>> contiguous buffers.. looks like that never got merged, so I'd need to
>> dig that back up and resend it. But the idea was to have the 'struct
>> device' encapsulate all the information that would be needed to
>> do-the-right-thing when it comes to placement.
>
> As I understand it, it's up to the exporting device to allocate the
> memory backing the dma_buf buffer. I guess the latest possible point
> you can allocate the backing pages is when map_dma_buf is first
> called? At that point the exporter can iterate over the current set
> of attachments, programmatically determine the all the constraints of
> all the attached drivers and attempt to allocate the backing pages
> in such a way as to satisfy all those constraints?
yes, this is the idea.. possibly some room for some helpers to help
out with this, but that is all under the hood from userspace
perspective
> Didn't you say that programmatically describing device placement
> constraints was an unbounded problem? I guess we would have to
> accept that it's not possible to describe all possible constraints
> and instead find a way to describe the common ones?
well, the point I'm trying to make, is by dividing your constraints
into two groups, one that impacts and is handled by userspace, and one
that is in the kernel (ie. where the pages go), you cut down the
number of permutations that the kernel has to care about considerably.
And kernel already cares about, for example, what range of addresses
that a device can dma to/from. I think really the only thing missing
is the max # of sglist entries (contiguous or not)
> One problem with this is it duplicates a lot of logic in each
> driver which can export a dma_buf buffer. Each exporter will need to
> do pretty much the same thing: iterate over all the attachments,
> determine of all the constraints (assuming that can be done) and
> allocate pages such that the lowest-common-denominator is satisfied.
>
> Perhaps rather than duplicating that logic in every driver, we could
> Instead move allocation of the backing pages into dma_buf itself?
>
I tend to think it is better to add helpers as we see common patterns
emerge, which drivers can opt-in to using. I don't think that we
should move allocation into dma_buf itself, but it would perhaps be
useful to have dma_alloc_*() variants that could allocate for multiple
devices. That would help for simple stuff, although I'd suspect
eventually a GPU driver will move away from that. (Since you probably
want to play tricks w/ pools of pages that are pre-zero'd and in the
correct cache state, use spare cycles on the gpu or dma engine to
pre-zero uncached pages, and games like that.)
>
>> > Anyway, assuming user-space can figure out how a buffer should be
>> > stored in memory, how does it indicate this to a kernel driver and
>> > actually allocate it? Which ioctl on which device does user-space
>> > call, with what parameters? Are you suggesting using something like
>> > ION which exposes the low-level details of how buffers are laid out
>> in
>> > physical memory to userspace? If not, what?
>>
>> no, userspace should not need to know this. And having a central
>> driver that knows this for all the other drivers in the system doesn't
>> really solve anything and isn't really scalable. At best you might
>> want, in some cases, a flag you can pass when allocating. For
>> example, some of the drivers have a 'SCANOUT' flag that can be passed
>> when allocating a GEM buffer, as a hint to the kernel that 'if this hw
>> requires contig memory for scanout, allocate this buffer contig'. But
>> really, when it comes to sharing buffers between devices, we want this
>> sort of information in dev->dma_params of the importing device(s).
>
> If you had a single driver which knew the constraints of all devices
> on that particular SoC and the interface allowed user-space to specify
> which devices a buffer is intended to be used with, I guess it could
> pretty trivially allocate pages which satisfy those constraints? It
keep in mind, even a number of SoC's come with pcie these days. You
already have things like
https://developer.nvidia.com/content/kayla-platform
You probably want to get out of the SoC mindset, otherwise you are
going to make bad assumptions that come back to bite you later on.
> wouldn't need a way to programmatically describe the constraints
> either: As you say, if userspace sets the "SCANOUT" flag, it would
> just "know" that on this SoC, that buffer needs to be physically
> contiguous for example.
not really.. it just knows it wants to scanout the buffer, and tells
this as a hint to the kernel.
For example, on omapdrm, the SCANOUT flag does nothing on omap4+
(where phys contig is not required for scanout), but causes CMA
(dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just
knows that it wants to be able to scanout that particular buffer.
> Though It would effectively mean you'd need an "allocation" driver per
> SoC, which as you say may not be scalable?
Right.. and not actually even possible in the general sense (see SoC +
external pcie gfx card)
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey:
> Hi Rob,
>
> +lkml
>
> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
> > >> wrote:
> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to
> > >> >> > also allocate buffers for the GPU. Still not sure how to
> > >> >> > resolve this as we don't use DRM for our GPU driver.
> > >> >>
> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term
> > >> >> (esp. once the dma-fence stuff is in place), we'd have
> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display
> > >> >> specific drm/kms driver, using prime/dmabuf to share between
> > >> >> the two.
> > >> >
> > >> > The "extra" buffers we were allocating from armsoc DDX were really
> > >> > being allocated through DRM/GEM so we could get an flink name
> > >> > for them and pass a reference to them back to our GPU driver on
> > >> > the client side. If it weren't for our need to access those
> > >> > extra off-screen buffers with the GPU we wouldn't need to
> > >> > allocate them with DRM at all. So, given they are really "GPU"
> > >> > buffers, it does absolutely make sense to allocate them in a
> > >> > different driver to the display driver.
> > >> >
> > >> > However, to avoid unnecessary memcpys & related cache
> > >> > maintenance ops, we'd also like the GPU to render into buffers
> > >> > which are scanned out by the display controller. So let's say
> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > >> > out buffers with the display's DRM driver but a custom ioctl
> > >> > on the GPU's DRM driver to allocate non scanout, off-screen
> > >> > buffers. Sounds great, but I don't think that really works
> > >> > with DRI2. If we used two drivers to allocate buffers, which
> > >> > of those drivers do we return in DRI2ConnectReply? Even if we
> > >> > solve that somehow, GEM flink names are name-spaced to a
> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > >> > how does the EGL in the client know which DRM device owns GEM
> > >> > flink name "1234"? We'd need some pretty dirty hacks.
> > >>
> > >> You would return the name of the display driver allocating the
> > >> buffers. On the client side you can use generic ioctls to go from
> > >> flink -> handle -> dmabuf. So the client side would end up opening
> > >> both the display drm device and the gpu, but without needing to know
> > >> too much about the display.
> > >
> > > I think the bit I was missing was that a GEM bo for a buffer imported
> > > using dma_buf/PRIME can still be flink'd. So the display controller's
> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate
> > > ioctl. Those scan-out buffers than then be exported from the
> > > dispaly's DRM driver and imported into the GPU's DRM driver using
> > > PRIME. Once imported into the GPU's driver, we can use flink to get a
> > > name for that buffer within the GPU DRM driver's name-space to return
> > > to the DRI2 client. That same namespace is also what DRI2 back-
> > > buffers are allocated from, so I think that could work... Except...
> >
> > (and.. the general direction is that things will move more to just use
> > dmabuf directly, ie. wayland or dri3)
>
> I agree, DRI2 is the only reason why we need a system-wide ID. I also
> prefer buffers to be passed around by dma_buf fd, but we still need to
> support DRI2 and will do for some time I expect.
>
>
>
> > >> > Anyway, that latter case also gets quite difficult. The "GPU"
> > >> > DRM driver would need to know the constraints of the display
> > >> > controller when allocating buffers intended to be scanned out.
> > >> > For example, pl111 typically isn't behind an IOMMU and so
> > >> > requires physically contiguous memory. We'd have to teach the
> > >> > GPU's DRM driver about the constraints of the display HW. Not
> > >> > exactly a clean driver model. :-(
> > >> >
> > >> > I'm still a little stuck on how to proceed, so any ideas
> > >> > would greatly appreciated! My current train of thought is
> > >> > having a kind of SoC-specific DRM driver which allocates
> > >> > buffers for both display and GPU within a single GEM
> > >> > namespace. That SoC-specific DRM driver could then know the
> > >> > constraints of both the GPU and the display HW. We could then
> > >> > use PRIME to export buffers allocated with the SoC DRM driver
> > >> > and import them into the GPU and/or display DRM driver.
> > >>
> > >> Usually if the display drm driver is allocating the buffers that
> > >> might be scanned out, it just needs to have minimal knowledge of
> > >> the GPU (pitch alignment constraints). I don't think we need a
> > >> 3rd device just to allocate buffers.
> > >
> > > While Mali can render to pretty much any buffer, there is a mild
> > > performance improvement to be had if the buffer stride is aligned to
> > > the AXI bus's max burst length when drawing to the buffer.
> >
> > I suspect the display controllers might frequently benefit if the
> > pitch is aligned to AXI burst length too..
>
> If the display controller is going to be reading from linear memory
> I don't think it will make much difference - you'll just get an extra
> 1-2 bus transactions per scanline. With a tile-based GPU like Mali,
> you get those extra transactions per _tile_ scan-line and as such,
> the overhead is more pronounced.
>
>
>
> > > So in some respects, there is a constraint on how buffers which will
> > > be drawn to using the GPU are allocated. I don't really like the idea
> > > of teaching the display controller DRM driver about the GPU buffer
> > > constraints, even if they are fairly trivial like this. If the same
> > > display HW IP is being used on several SoCs, it seems wrong somehow
> > > to enforce those GPU constraints if some of those SoCs don't have a
> > > GPU.
> >
> > Well, I suppose you could get min_pitch_alignment from devicetree, or
> > something like this..
> >
> > In the end, the easy solution is just to make the display allocate to
> > the worst-case pitch alignment. In the early days of dma-buf
> > discussions, we kicked around the idea of negotiating or
> > programatically describing the constraints, but that didn't really
> > seem like a bounded problem.
>
> Yeah - I was around for some of those discussions and agree it's not
> really an easy problem to solve.
>
>
>
> > > We may also then have additional constraints when sharing buffers
> > > between the display HW and video decode or even camera ISP HW.
> > > Programmatically describing buffer allocation constraints is very
> > > difficult and I'm not sure you can actually do it - there's some
> > > pretty complex constraints out there! E.g. I believe there's a
> > > platform where Y and UV planes of the reference frame need to be in
> > > separate DRAM banks for real-time 1080p decode, or something like
> > > that?
> >
> > yes, this was discussed. This is different from pitch/format/size
> > constraints.. it is really just a placement constraint (ie. where do
> > the physical pages go). IIRC the conclusion was to use a dummy
> > devices with it's own CMA pool for attaching the Y vs UV buffers.
> >
> > > Anyway, I guess my point is that even if we solve how to allocate
> > > buffers which will be shared between the GPU and display HW such that
> > > both sets of constraints are satisfied, that may not be the end of
> > > the story.
> > >
> >
> > that was part of the reason to punt this problem to userspace ;-)
> >
> > In practice, the kernel drivers doesn't usually know too much about
> > the dimensions/format/etc.. that is really userspace level knowledge.
> > There are a few exceptions when the kernel needs to know how to setup
> > GTT/etc for tiled buffers, but normally this sort of information is up
> > at the next level up (userspace, and drm_framebuffer in case of
> > scanout). Userspace media frameworks like GStreamer already have a
> > concept of format/caps negotiation. For non-display<->gpu sharing, I
> > think this is probably where this sort of constraint negotiation
> > should be handled.
>
> I agree that user-space will know which devices will access the buffer
> and thus can figure out at least a common pixel format. Though I'm not
> so sure userspace can figure out more low-level details like alignment
> and placement in physical memory, etc.
>
> Anyway, assuming user-space can figure out how a buffer should be
> stored in memory, how does it indicate this to a kernel driver and
> actually allocate it? Which ioctl on which device does user-space
> call, with what parameters? Are you suggesting using something like
> ION which exposes the low-level details of how buffers are laid out in
> physical memory to userspace? If not, what?
>
I strongly disagree with exposing low-level hardware details like tiling
to userspace. If we have to do the negotiation of those things in
userspace we will end up with having to pipe those information through
things like the wayland protocol. I don't see how this could ever be
considered a good idea.
I would rather see kernel drivers negotiating those things at dmabuf
attach time in way invisible to userspace. I agree that this negotiation
thing isn't easy to get right for the plethora of different hardware
constraints we see today, but I would rather see this in-kernel, where
we have the chance to fix things up if needed, than in a fixed userspace
interface.
Regards,
Lucas
--
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> > So in some respects, there is a constraint on how buffers which will
>> > be drawn to using the GPU are allocated. I don't really like the idea
>> > of teaching the display controller DRM driver about the GPU buffer
>> > constraints, even if they are fairly trivial like this. If the same
>> > display HW IP is being used on several SoCs, it seems wrong somehow
>> > to enforce those GPU constraints if some of those SoCs don't have a
>> > GPU.
>>
>> Well, I suppose you could get min_pitch_alignment from devicetree, or
>> something like this..
>>
>> In the end, the easy solution is just to make the display allocate to
>> the worst-case pitch alignment. In the early days of dma-buf
>> discussions, we kicked around the idea of negotiating or
>> programatically describing the constraints, but that didn't really
>> seem like a bounded problem.
>
> Yeah - I was around for some of those discussions and agree it's not
> really an easy problem to solve.
>
>
>
>> > We may also then have additional constraints when sharing buffers
>> > between the display HW and video decode or even camera ISP HW.
>> > Programmatically describing buffer allocation constraints is very
>> > difficult and I'm not sure you can actually do it - there's some
>> > pretty complex constraints out there! E.g. I believe there's a
>> > platform where Y and UV planes of the reference frame need to be in
>> > separate DRAM banks for real-time 1080p decode, or something like
>> > that?
>>
>> yes, this was discussed. This is different from pitch/format/size
>> constraints.. it is really just a placement constraint (ie. where do
>> the physical pages go). IIRC the conclusion was to use a dummy
>> devices with it's own CMA pool for attaching the Y vs UV buffers.
>>
>> > Anyway, I guess my point is that even if we solve how to allocate
>> > buffers which will be shared between the GPU and display HW such that
>> > both sets of constraints are satisfied, that may not be the end of
>> > the story.
>> >
>>
>> that was part of the reason to punt this problem to userspace ;-)
>>
>> In practice, the kernel drivers doesn't usually know too much about
>> the dimensions/format/etc.. that is really userspace level knowledge.
>> There are a few exceptions when the kernel needs to know how to setup
>> GTT/etc for tiled buffers, but normally this sort of information is up
>> at the next level up (userspace, and drm_framebuffer in case of
>> scanout). Userspace media frameworks like GStreamer already have a
>> concept of format/caps negotiation. For non-display<->gpu sharing, I
>> think this is probably where this sort of constraint negotiation
>> should be handled.
>
> I agree that user-space will know which devices will access the buffer
> and thus can figure out at least a common pixel format. Though I'm not
> so sure userspace can figure out more low-level details like alignment
> and placement in physical memory, etc.
well, let's divide things up into two categories:
1) the arrangement and format of pixels.. ie. what userspace would
need to know if it mmap's a buffer. This includes pixel format,
stride, etc. This should be negotiated in userspace, it would be
crazy to try to do this in the kernel.
2) the physical placement of the pages. Ie. whether it is contiguous
or not. Which bank the pages in the buffer are placed in, etc. This
is not visible to userspace. This is the purpose of the attach step,
so you know all the devices involved in sharing up front before
allocating the backing pages. (Or in the worst case, if you have a
"late attacher" you at least know when no device is doing dma access
to a buffer and can reallocate and move the buffer.) A long time
back, I had a patch that added a field or two to 'struct
device_dma_parameters' so that it could be known if a device required
contiguous buffers.. looks like that never got merged, so I'd need to
dig that back up and resend it. But the idea was to have the 'struct
device' encapsulate all the information that would be needed to
do-the-right-thing when it comes to placement.
> Anyway, assuming user-space can figure out how a buffer should be
> stored in memory, how does it indicate this to a kernel driver and
> actually allocate it? Which ioctl on which device does user-space
> call, with what parameters? Are you suggesting using something like
> ION which exposes the low-level details of how buffers are laid out in
> physical memory to userspace? If not, what?
no, userspace should not need to know this. And having a central
driver that knows this for all the other drivers in the system doesn't
really solve anything and isn't really scalable. At best you might
want, in some cases, a flag you can pass when allocating. For
example, some of the drivers have a 'SCANOUT' flag that can be passed
when allocating a GEM buffer, as a hint to the kernel that 'if this hw
requires contig memory for scanout, allocate this buffer contig'. But
really, when it comes to sharing buffers between devices, we want this
sort of information in dev->dma_params of the importing device(s).
BR,
-R
On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
> Hi Rob,
>
> +linux-media, +linaro-mm-sig for discussion of video/camera
> buffer constraints...
>
>
>> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
>> wrote:
>> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
>> >> > allocate buffers for the GPU. Still not sure how to resolve
>> >> > this as we don't use DRM for our GPU driver.
>> >>
>> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp.
>> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
>> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
>> >> using prime/dmabuf to share between the two.
>> >
>> > The "extra" buffers we were allocating from armsoc DDX were really
>> > being allocated through DRM/GEM so we could get an flink name
>> > for them and pass a reference to them back to our GPU driver on
>> > the client side. If it weren't for our need to access those
>> > extra off-screen buffers with the GPU we wouldn't need to
>> > allocate them with DRM at all. So, given they are really "GPU"
>> > buffers, it does absolutely make sense to allocate them in a
>> > different driver to the display driver.
>> >
>> > However, to avoid unnecessary memcpys & related cache
>> > maintenance ops, we'd also like the GPU to render into buffers
>> > which are scanned out by the display controller. So let's say
>> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
>> > out buffers with the display's DRM driver but a custom ioctl
>> > on the GPU's DRM driver to allocate non scanout, off-screen
>> > buffers. Sounds great, but I don't think that really works
>> > with DRI2. If we used two drivers to allocate buffers, which
>> > of those drivers do we return in DRI2ConnectReply? Even if we
>> > solve that somehow, GEM flink names are name-spaced to a
>> > single device node (AFAIK). So when we do a DRI2GetBuffers,
>> > how does the EGL in the client know which DRM device owns GEM
>> > flink name "1234"? We'd need some pretty dirty hacks.
>>
>> You would return the name of the display driver allocating the
>> buffers. On the client side you can use generic ioctls to go from
>> flink -> handle -> dmabuf. So the client side would end up opening
>> both the display drm device and the gpu, but without needing to know
>> too much about the display.
>
> I think the bit I was missing was that a GEM bo for a buffer imported
> using dma_buf/PRIME can still be flink'd. So the display controller's
> DRM driver allocates scan-out buffers via the DUMB buffer allocate
> ioctl. Those scan-out buffers than then be exported from the
> dispaly's DRM driver and imported into the GPU's DRM driver using
> PRIME. Once imported into the GPU's driver, we can use flink to get a
> name for that buffer within the GPU DRM driver's name-space to return
> to the DRI2 client. That same namespace is also what DRI2 back-buffers
> are allocated from, so I think that could work... Except...
>
(and.. the general direction is that things will move more to just use
dmabuf directly, ie. wayland or dri3)
>
>> > Anyway, that latter case also gets quite difficult. The "GPU"
>> > DRM driver would need to know the constraints of the display
>> > controller when allocating buffers intended to be scanned out.
>> > For example, pl111 typically isn't behind an IOMMU and so
>> > requires physically contiguous memory. We'd have to teach the
>> > GPU's DRM driver about the constraints of the display HW. Not
>> > exactly a clean driver model. :-(
>> >
>> > I'm still a little stuck on how to proceed, so any ideas
>> > would greatly appreciated! My current train of thought is
>> > having a kind of SoC-specific DRM driver which allocates
>> > buffers for both display and GPU within a single GEM
>> > namespace. That SoC-specific DRM driver could then know the
>> > constraints of both the GPU and the display HW. We could then
>> > use PRIME to export buffers allocated with the SoC DRM driver
>> > and import them into the GPU and/or display DRM driver.
>>
>> Usually if the display drm driver is allocating the buffers that might
>> be scanned out, it just needs to have minimal knowledge of the GPU
>> (pitch alignment constraints). I don't think we need a 3rd device
>> just to allocate buffers.
>
> While Mali can render to pretty much any buffer, there is a mild
> performance improvement to be had if the buffer stride is aligned to
> the AXI bus's max burst length when drawing to the buffer.
I suspect the display controllers might frequently benefit if the
pitch is aligned to AXI burst length too..
> So in some respects, there is a constraint on how buffers which will
> be drawn to using the GPU are allocated. I don't really like the idea
> of teaching the display controller DRM driver about the GPU buffer
> constraints, even if they are fairly trivial like this. If the same
> display HW IP is being used on several SoCs, it seems wrong somehow
> to enforce those GPU constraints if some of those SoCs don't have a
> GPU.
Well, I suppose you could get min_pitch_alignment from devicetree, or
something like this..
In the end, the easy solution is just to make the display allocate to
the worst-case pitch alignment. In the early days of dma-buf
discussions, we kicked around the idea of negotiating or
programatically describing the constraints, but that didn't really
seem like a bounded problem.
> We may also then have additional constraints when sharing buffers
> between the display HW and video decode or even camera ISP HW.
> Programmatically describing buffer allocation constraints is very
> difficult and I'm not sure you can actually do it - there's some
> pretty complex constraints out there! E.g. I believe there's a
> platform where Y and UV planes of the reference frame need to be in
> separate DRAM banks for real-time 1080p decode, or something like
> that?
yes, this was discussed. This is different from pitch/format/size
constraints.. it is really just a placement constraint (ie. where do
the physical pages go). IIRC the conclusion was to use a dummy
devices with it's own CMA pool for attaching the Y vs UV buffers.
> Anyway, I guess my point is that even if we solve how to allocate
> buffers which will be shared between the GPU and display HW such that
> both sets of constraints are satisfied, that may not be the end of
> the story.
>
that was part of the reason to punt this problem to userspace ;-)
In practice, the kernel drivers doesn't usually know too much about
the dimensions/format/etc.. that is really userspace level knowledge.
There are a few exceptions when the kernel needs to know how to setup
GTT/etc for tiled buffers, but normally this sort of information is up
at the next level up (userspace, and drm_framebuffer in case of
scanout). Userspace media frameworks like GStreamer already have a
concept of format/caps negotiation. For non-display<->gpu sharing, I
think this is probably where this sort of constraint negotiation
should be handled.
BR,
-R
>
> Cheers,
>
> Tom
>
>
>
>
>
Hi Rob,
+linux-media, +linaro-mm-sig for discussion of video/camera
buffer constraints...
> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
> wrote:
> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >> > allocate buffers for the GPU. Still not sure how to resolve
> >> > this as we don't use DRM for our GPU driver.
> >>
> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp.
> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> >> using prime/dmabuf to share between the two.
> >
> > The "extra" buffers we were allocating from armsoc DDX were really
> > being allocated through DRM/GEM so we could get an flink name
> > for them and pass a reference to them back to our GPU driver on
> > the client side. If it weren't for our need to access those
> > extra off-screen buffers with the GPU we wouldn't need to
> > allocate them with DRM at all. So, given they are really "GPU"
> > buffers, it does absolutely make sense to allocate them in a
> > different driver to the display driver.
> >
> > However, to avoid unnecessary memcpys & related cache
> > maintenance ops, we'd also like the GPU to render into buffers
> > which are scanned out by the display controller. So let's say
> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > out buffers with the display's DRM driver but a custom ioctl
> > on the GPU's DRM driver to allocate non scanout, off-screen
> > buffers. Sounds great, but I don't think that really works
> > with DRI2. If we used two drivers to allocate buffers, which
> > of those drivers do we return in DRI2ConnectReply? Even if we
> > solve that somehow, GEM flink names are name-spaced to a
> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > how does the EGL in the client know which DRM device owns GEM
> > flink name "1234"? We'd need some pretty dirty hacks.
>
> You would return the name of the display driver allocating the
> buffers. On the client side you can use generic ioctls to go from
> flink -> handle -> dmabuf. So the client side would end up opening
> both the display drm device and the gpu, but without needing to know
> too much about the display.
I think the bit I was missing was that a GEM bo for a buffer imported
using dma_buf/PRIME can still be flink'd. So the display controller's
DRM driver allocates scan-out buffers via the DUMB buffer allocate
ioctl. Those scan-out buffers than then be exported from the
dispaly's DRM driver and imported into the GPU's DRM driver using
PRIME. Once imported into the GPU's driver, we can use flink to get a
name for that buffer within the GPU DRM driver's name-space to return
to the DRI2 client. That same namespace is also what DRI2 back-buffers
are allocated from, so I think that could work... Except...
> > Anyway, that latter case also gets quite difficult. The "GPU"
> > DRM driver would need to know the constraints of the display
> > controller when allocating buffers intended to be scanned out.
> > For example, pl111 typically isn't behind an IOMMU and so
> > requires physically contiguous memory. We'd have to teach the
> > GPU's DRM driver about the constraints of the display HW. Not
> > exactly a clean driver model. :-(
> >
> > I'm still a little stuck on how to proceed, so any ideas
> > would greatly appreciated! My current train of thought is
> > having a kind of SoC-specific DRM driver which allocates
> > buffers for both display and GPU within a single GEM
> > namespace. That SoC-specific DRM driver could then know the
> > constraints of both the GPU and the display HW. We could then
> > use PRIME to export buffers allocated with the SoC DRM driver
> > and import them into the GPU and/or display DRM driver.
>
> Usually if the display drm driver is allocating the buffers that might
> be scanned out, it just needs to have minimal knowledge of the GPU
> (pitch alignment constraints). I don't think we need a 3rd device
> just to allocate buffers.
While Mali can render to pretty much any buffer, there is a mild
performance improvement to be had if the buffer stride is aligned to
the AXI bus's max burst length when drawing to the buffer.
So in some respects, there is a constraint on how buffers which will
be drawn to using the GPU are allocated. I don't really like the idea
of teaching the display controller DRM driver about the GPU buffer
constraints, even if they are fairly trivial like this. If the same
display HW IP is being used on several SoCs, it seems wrong somehow
to enforce those GPU constraints if some of those SoCs don't have a
GPU.
We may also then have additional constraints when sharing buffers
between the display HW and video decode or even camera ISP HW.
Programmatically describing buffer allocation constraints is very
difficult and I'm not sure you can actually do it - there's some
pretty complex constraints out there! E.g. I believe there's a
platform where Y and UV planes of the reference frame need to be in
separate DRAM banks for real-time 1080p decode, or something like
that?
Anyway, I guess my point is that even if we solve how to allocate
buffers which will be shared between the GPU and display HW such that
both sets of constraints are satisfied, that may not be the end of
the story.
Cheers,
Tom
Hello,
This is a fourth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. After the comments from
Grant Likely I moved back memory region definitions back to /memory node
(as it was in the first version of this proposal). I've also extended
the code and made it more generic, added support for so called reserved
dma memory (special dma memory regions created by dma_alloc_coherent()
function, for exclusive usage for dma allocation for the given device).
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v4:
- corrected Devcie Tree mailing list address (resend)
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (4):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add function to scan fdt nodes given by path
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++++
arch/arm/mm/init.c | 3 +
drivers/base/dma-contiguous.c | 147 +++++++++++-----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 76 +++++++++++
drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++
include/asm-generic/dma-coherent.h | 6 +
include/asm-generic/dma-contiguous.h | 2 -
include/linux/dma-contiguous.h | 49 +++++++-
include/linux/of_fdt.h | 3 +
11 files changed, 541 insertions(+), 79 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
--
1.7.9.5
From: "Luis R. Rodriguez" <mcgrof(a)do-not-panic.com>
This backports the kernel's wound/wait style locks 040a0a371,
using the linux-stable v3.11-rc2 as a base for development.
Given the complexity to support debugging mutexes this backport
implementation is simplified by only making this feature availabe
if you to have DEBUG_MUTEXES and DEBUG_LOCK_ALLOC disabled.
Given that ww mutex is required for DRM this also means we must
update the kconfig for DRM and require you to also not be able to build
DRM if you have either of these options enabled. Support for
DEBUG_MUTEXES and DEBUG_LOCK_ALLOC can be added later by anyone
daring. This uses the new dependencies file kconfig language
extension to specify the backport feature build restrictions
for DRM.
Part of the ww mutex addition to the kernel required modifying
the fast path mutex locking scheme by requiring you to deal
with the slow path alternatives on your own (refer to a41b56ef).
The reason for this change was that the mutex fastpath implementation
assumed your slowpath alternative can only be passed one argument
and the addition of ww mutexes requires dealing with the slow
path with a context passed.
It'd be painful to backport all asm for an optimized fastpath
implementation so we penalize the backport ww mutex fast path
by using the generic atomic_dec_return().
To backport a clean our own mutex_lock_common() with the least
amount of changes against upstream commits 2bd2c92c and 41fcb9f2
also needed to be backported. Commit 2bd2c92c dealt with adding
support for queue mutex spinners with an MCS lock, since this
cannot be backported for older kernels we provide empty inlines.
Commit 41fcb9f2 just removed SCHED_FEAT_OWNER_SPIN as it was an
early hack, the only thing required to backport this commit was
to provide an alternative declaration for mutex_spin_on_owner()
as it was declared non-inline for older kernels.
Finally c5491ea7 required backporting schedule_preempt_disabled()
as well but that just consisted of carrying over the original
implementation. Since its not exported we need to reimplement
it to make it available to our internal core ww mutex port.
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 040a0a371
v3.11-rc1~147^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains a41b56ef
v3.11-rc1~147^2~6
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 2bd2c92c
v3.10-rc1~200^2~3
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 41fcb9f2
v3.10-rc1~200^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains c5491ea7
v3.4-rc1~3^2~27
commit 040a0a37100563754bb1fee6ff6427420bcfa609
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Mon Jun 24 10:30:04 2013 +0200
mutex: Add support for wound/wait style locks
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.
For full documentation please read Documentation/ww-mutex-design.txt.
References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Rob Clark <robdclark(a)gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit a41b56efa70e060f650aeb54740aaf52044a1ead
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Thu Jun 20 13:31:05 2013 +0200
arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.
Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.
This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: robclark(a)gmail.com
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:13 2013 -0400
mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
<-- snip -->
commit 41fcb9f230bf773656d1768b73000ef720bf00c3
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:11 2013 -0400
mutex: Move mutex spinning code from sched/core.c back to mutex.c
<-- snip -->
commit c5491ea779793f977d282754db478157cc409d82
Author: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon Mar 21 12:09:35 2011 +0100
sched/rt: Add schedule_preempt_disabled()
<-- snip -->
Cc: maarten.lankhorst(a)canonical.com
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Rob Clark <robdclark(a)gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
---
backport/backport-include/linux/ww_mutex.h | 333 ++++++++++++++
backport/compat/Kconfig | 11 +
backport/compat/Makefile | 1 +
backport/compat/kernel/ww_mutex.c | 667 ++++++++++++++++++++++++++++
dependencies | 6 +
5 files changed, 1018 insertions(+)
create mode 100644 backport/backport-include/linux/ww_mutex.h
create mode 100644 backport/compat/kernel/ww_mutex.c
diff --git a/backport/backport-include/linux/ww_mutex.h b/backport/backport-include/linux/ww_mutex.h
new file mode 100644
index 0000000..0953939
--- /dev/null
+++ b/backport/backport-include/linux/ww_mutex.h
@@ -0,0 +1,333 @@
+#ifndef __BACKPORT_LINUX_WW_MUTEX_H
+#define __BACKPORT_LINUX_WW_MUTEX_H
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+#include_next <linux/ww_mutex.h>
+#else
+#ifdef CPTCFG_BACKPORT_BUILD_WW_MUTEX
+/*
+ * Wound/Wait Mutexes: blocking mutual exclusion locks with deadlock avoidance
+ *
+ * Original mutex implementation started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Wound/wait implementation:
+ * Copyright (C) 2013 Canonical Ltd.
+ *
+ * This file contains the main data structure and API definitions.
+ */
+
+#include <linux/mutex.h>
+
+struct ww_class {
+ atomic_long_t stamp;
+ struct lock_class_key acquire_key;
+ struct lock_class_key mutex_key;
+ const char *acquire_name;
+ const char *mutex_name;
+};
+
+struct ww_acquire_ctx {
+ struct task_struct *task;
+ unsigned long stamp;
+ unsigned acquired;
+};
+
+struct ww_mutex {
+ struct mutex base;
+ struct ww_acquire_ctx *ctx;
+};
+
+# define __WW_CLASS_MUTEX_INITIALIZER(lockname, ww_class)
+
+#define __WW_CLASS_INITIALIZER(ww_class) \
+ { .stamp = ATOMIC_LONG_INIT(0) \
+ , .acquire_name = #ww_class "_acquire" \
+ , .mutex_name = #ww_class "_mutex" }
+
+#define __WW_MUTEX_INITIALIZER(lockname, class) \
+ { .base = { \__MUTEX_INITIALIZER(lockname) } \
+ __WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
+
+#define DEFINE_WW_CLASS(classname) \
+ struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+
+#define DEFINE_WW_MUTEX(mutexname, ww_class) \
+ struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
+
+/**
+ * ww_mutex_init - initialize the w/w mutex
+ * @lock: the mutex to be initialized
+ * @ww_class: the w/w class the mutex should belong to
+ *
+ * Initialize the w/w mutex to unlocked state and associate it with the given
+ * class.
+ *
+ * It is not allowed to initialize an already locked mutex.
+ */
+#define ww_mutex_init LINUX_BACKPORT(ww_mutex_init)
+static inline void ww_mutex_init(struct ww_mutex *lock,
+ struct ww_class *ww_class)
+{
+ __mutex_init(&lock->base, ww_class->mutex_name, &ww_class->mutex_key);
+ lock->ctx = NULL;
+}
+
+/**
+ * ww_acquire_init - initialize a w/w acquire context
+ * @ctx: w/w acquire context to initialize
+ * @ww_class: w/w class of the context
+ *
+ * Initializes an context to acquire multiple mutexes of the given w/w class.
+ *
+ * Context-based w/w mutex acquiring can be done in any order whatsoever within
+ * a given lock class. Deadlocks will be detected and handled with the
+ * wait/wound logic.
+ *
+ * Mixing of context-based w/w mutex acquiring and single w/w mutex locking can
+ * result in undetected deadlocks and is so forbidden. Mixing different contexts
+ * for the same w/w class when acquiring mutexes can also result in undetected
+ * deadlocks, and is hence also forbidden. Both types of abuse will be caught by
+ * enabling CONFIG_PROVE_LOCKING.
+ *
+ * Nesting of acquire contexts for _different_ w/w classes is possible, subject
+ * to the usual locking rules between different lock classes.
+ *
+ * An acquire context must be released with ww_acquire_fini by the same task
+ * before the memory is freed. It is recommended to allocate the context itself
+ * on the stack.
+ */
+#define ww_acquire_init LINUX_BACKPORT(ww_acquire_init)
+static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
+ struct ww_class *ww_class)
+{
+ ctx->task = current;
+ ctx->stamp = atomic_long_inc_return(&ww_class->stamp);
+ ctx->acquired = 0;
+}
+
+/**
+ * ww_acquire_done - marks the end of the acquire phase
+ * @ctx: the acquire context
+ *
+ * Marks the end of the acquire phase, any further w/w mutex lock calls using
+ * this context are forbidden.
+ *
+ * Calling this function is optional, it is just useful to document w/w mutex
+ * code and clearly designated the acquire phase from actually using the locked
+ * data structures.
+ */
+#define ww_acquire_done LINUX_BACKPORT(ww_acquire_done)
+static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
+{
+}
+
+/**
+ * ww_acquire_fini - releases a w/w acquire context
+ * @ctx: the acquire context to free
+ *
+ * Releases a w/w acquire context. This must be called _after_ all acquired w/w
+ * mutexes have been released with ww_mutex_unlock.
+ */
+#define ww_acquire_fini LINUX_BACKPORT(ww_acquire_fini)
+static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
+{
+}
+
+#define __ww_mutex_lock LINUX_BACKPORT(__ww_mutex_lock)
+extern int __must_check __ww_mutex_lock(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+#define __ww_mutex_lock_interruptible LINUX_BACKPORT(__ww_mutex_lock_interruptible)
+extern int __must_check __ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+
+/**
+ * ww_mutex_lock - acquire the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context, or NULL to acquire only a single lock.
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow. Alternatively callers can opt to not acquire this
+ * lock and proceed with trying to acquire further w/w mutexes (e.g. when
+ * scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock LINUX_BACKPORT(ww_mutex_lock)
+static inline int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock(lock, ctx);
+
+ mutex_lock(&lock->base);
+ return 0;
+}
+
+/**
+ * ww_mutex_lock_interruptible - acquire the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired. If a
+ * signal arrives while waiting for the lock then this function returns -EINTR.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow_interruptible. Alternatively callers can opt to
+ * not acquire this lock and proceed with trying to acquire further w/w mutexes
+ * (e.g. when scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock_interruptible LINUX_BACKPORT(ww_mutex_lock_interruptible)
+static inline int __must_check ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock_interruptible(lock, ctx);
+ else
+ return mutex_lock_interruptible(&lock->base);
+}
+
+/**
+ * ww_mutex_lock_slow - slowpath acquiring of the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the context held. It is forbidden to call this on anything else than the
+ * contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock directly. This function here is simply to help w/w mutex
+ * locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow LINUX_BACKPORT(ww_mutex_lock_slow)
+static inline void
+ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+ ret = ww_mutex_lock(lock, ctx);
+ (void)ret;
+}
+
+/**
+ * ww_mutex_lock_slow_interruptible - slowpath acquiring of the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available and returns 0 when the lock has
+ * been acquired. If a signal arrives while waiting for the lock then this
+ * function returns -EINTR.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the given context held. It is forbidden to call this on anything else
+ * than the contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock_interruptible directly. This function here is simply to help
+ * w/w mutex locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow_interruptible LINUX_BACKPORT(ww_mutex_lock_slow_interruptible)
+static inline int __must_check
+ww_mutex_lock_slow_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return ww_mutex_lock_interruptible(lock, ctx);
+}
+
+#define ww_mutex_unlock LINUX_BACKPORT(ww_mutex_unlock)
+extern void ww_mutex_unlock(struct ww_mutex *lock);
+
+/**
+ * ww_mutex_trylock - tries to acquire the w/w mutex without acquire context
+ * @lock: mutex to lock
+ *
+ * Trylocks a mutex without acquire context, so no deadlock detection is
+ * possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
+ */
+#define ww_mutex_trylock LINUX_BACKPORT(ww_mutex_trylock)
+static inline int __must_check ww_mutex_trylock(struct ww_mutex *lock)
+{
+ return mutex_trylock(&lock->base);
+}
+
+/***
+ * ww_mutex_destroy - mark a w/w mutex unusable
+ * @lock: the mutex to be destroyed
+ *
+ * This function marks the mutex uninitialized, and any subsequent
+ * use of the mutex is forbidden. The mutex must not be locked when
+ * this function is called.
+ */
+#define ww_mutex_destroy LINUX_BACKPORT(ww_mutex_destroy)
+static inline void ww_mutex_destroy(struct ww_mutex *lock)
+{
+ mutex_destroy(&lock->base);
+}
+
+/**
+ * ww_mutex_is_locked - is the w/w mutex locked
+ * @lock: the mutex to be queried
+ *
+ * Returns 1 if the mutex is locked, 0 if unlocked.
+ */
+#define ww_mutex_is_locked LINUX_BACKPORT(ww_mutex_is_locked)
+static inline bool ww_mutex_is_locked(struct ww_mutex *lock)
+{
+ return mutex_is_locked(&lock->base);
+}
+
+#endif /* CPTCFG_BACKPORT_BUILD_WW_MUTEX */
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+#endif /* __BACKPORT_LINUX_WW_MUTEX_H */
diff --git a/backport/compat/Kconfig b/backport/compat/Kconfig
index e2f0cdd..f3c1ab3 100644
--- a/backport/compat/Kconfig
+++ b/backport/compat/Kconfig
@@ -185,6 +185,17 @@ config BACKPORT_LEDS_CLASS
config BACKPORT_LEDS_TRIGGERS
bool
+config BACKPORT_BUILD_WW_MUTEX
+ bool
+ # Build only if on kernels < 3.11
+ # For now only DRM drivers use ww mutexes.
+ depends on DRM && BACKPORT_KERNEL_3_11
+ default y if BACKPORT_USERSEL_BUILD_ALL
+ # probably a bad idea if you have these options given we
+ # ripped those options out.
+ depends on !DEBUG_MUTEXES
+ depends on !DEBUG_LOCK_ALLOC
+
config BACKPORT_BUILD_RADIX_HELPERS
bool
# You have selected to build backported DRM drivers
diff --git a/backport/compat/Makefile b/backport/compat/Makefile
index 252290e..fec01c4 100644
--- a/backport/compat/Makefile
+++ b/backport/compat/Makefile
@@ -41,3 +41,4 @@ compat-$(CPTCFG_BACKPORT_BUILD_KFIFO) += kfifo.o
compat-$(CPTCFG_BACKPORT_BUILD_GENERIC_ATOMIC64) += compat_atomic.o
compat-$(CPTCFG_BACKPORT_BUILD_DMA_SHARED_HELPERS) += dma-shared-helpers.o
compat-$(CPTCFG_BACKPORT_BUILD_RADIX_HELPERS) += lib-radix-tree-helpers.o
+compat-$(CPTCFG_BACKPORT_BUILD_WW_MUTEX) += kernel/ww_mutex.o
diff --git a/backport/compat/kernel/ww_mutex.c b/backport/compat/kernel/ww_mutex.c
new file mode 100644
index 0000000..257c2a4
--- /dev/null
+++ b/backport/compat/kernel/ww_mutex.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2013 Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
+ *
+ * Backport ww mutex for older kernels. This is not supported when
+ * DEBUG_MUTEXES or DEBUG_LOCK_ALLOC is enabled.
+ *
+ * Taken from: kernel/mutex.c - via linux-stable v3.11-rc2
+ *
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
+ * David Howells for suggestions and improvements.
+ *
+ * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ * from the -rt tree, where it was originally implemented for rtmutexes
+ * by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ * and Sven Dietrich.
+ *
+ * Also see Documentation/mutex-design.txt.
+ */
+
+#include <linux/mutex.h>
+#include <linux/ww_mutex.h>
+#include <asm/mutex.h>
+#include <linux/sched.h>
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0)
+#include <linux/sched/rt.h>
+#endif
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+#include <linux/version.h>
+
+/*
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
+ */
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+
+#define spin_lock_mutex(lock, flags) \
+ do { spin_lock(lock); (void)(flags); } while (0)
+#define spin_unlock_mutex(lock, flags) \
+ do { spin_unlock(lock); (void)(flags); } while (0)
+#define mutex_remove_waiter(lock, waiter, ti) \
+ __list_del((waiter)->list.prev, (waiter)->list.next)
+
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+ lock->owner = current;
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+ lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) /* 2bd2c92c and 41fcb9f2 */
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
+static noinline
+#endif
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
+#else /* Backport 2bd2c92c: help keep backport_mutex_lock_common() clean */
+
+struct mspin_node {
+};
+#define MLOCK(mutex) NULL
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ return 1;
+}
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) */
+#endif /* CONFIG_MUTEX_SPIN_ON_OWNER */
+
+/*
+ * Release the lock, slowpath:
+ */
+static inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
+{
+ struct mutex *lock = container_of(lock_count, struct mutex, count);
+ unsigned long flags;
+
+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ /* debug_mutex_unlock(lock); */
+
+ /*
+ * some architectures leave the lock unlocked in the fastpath failure
+ * case, others need to leave it locked. In the later case we have to
+ * unlock it here
+ */
+ if (__mutex_slowpath_needs_to_unlock())
+ atomic_set(&lock->count, 1);
+
+ if (!list_empty(&lock->wait_list)) {
+ /* get the first entry from the wait-list: */
+ struct mutex_waiter *waiter =
+ list_entry(lock->wait_list.next,
+ struct mutex_waiter, list);
+
+ /* debug_mutex_wake_waiter(lock, waiter); */
+
+ wake_up_process(waiter->task);
+ }
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+}
+
+/*
+ * Release the lock, slowpath:
+ */
+static __used noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+/**
+ * ww_mutex_unlock - release the w/w mutex
+ * @lock: the mutex to be released
+ *
+ * Unlock a mutex that has been locked by this task previously with any of the
+ * ww_mutex_lock* functions (with or without an acquire context). It is
+ * forbidden to release the locks after releasing the acquire context.
+ *
+ * This function must not be used in interrupt context. Unlocking
+ * of a unlocked mutex is not allowed.
+ */
+void __sched ww_mutex_unlock(struct ww_mutex *lock)
+{
+ /*
+ * The unlocking fastpath is the 0->1 transition from 'locked'
+ * into 'unlocked' state:
+ */
+ if (lock->ctx) {
+ if (lock->ctx->acquired > 0)
+ lock->ctx->acquired--;
+ lock->ctx = NULL;
+ }
+
+ __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+}
+EXPORT_SYMBOL_GPL(ww_mutex_unlock);
+
+static inline int __sched
+__mutex_lock_check_stamp(struct mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+ struct ww_acquire_ctx *hold_ctx = ACCESS_ONCE(ww->ctx);
+
+ if (!hold_ctx)
+ return 0;
+
+ if (unlikely(ctx == hold_ctx))
+ return -EALREADY;
+
+ if (ctx->stamp - hold_ctx->stamp <= LONG_MAX &&
+ (ctx->stamp != hold_ctx->stamp || ctx > hold_ctx)) {
+ return -EDEADLK;
+ }
+
+ return 0;
+}
+
+static __always_inline void ww_mutex_lock_acquired(struct ww_mutex *ww,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ ww_ctx->acquired++;
+}
+
+/*
+ * after acquiring lock with fastpath or when we lost out in contested
+ * slowpath, set ctx and wake up any waiters so they can recheck.
+ *
+ * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
+ * as the fastpath and opportunistic spinning are disabled in that case.
+ */
+static __always_inline void
+ww_mutex_set_context_fastpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ unsigned long flags;
+ struct mutex_waiter *cur;
+
+ ww_mutex_lock_acquired(lock, ctx);
+
+ lock->ctx = ctx;
+
+ /*
+ * The lock->ctx update should be visible on all cores before
+ * the atomic read is done, otherwise contended waiters might be
+ * missed. The contended waiters will either see ww_ctx == NULL
+ * and keep spinning, or it will acquire wait_lock, add itself
+ * to waiter list and sleep.
+ */
+ smp_mb(); /* ^^^ */
+
+ /*
+ * Check if lock is contended, if not there is nobody to wake up
+ */
+ if (likely(atomic_read(&lock->base.count) == 0))
+ return;
+
+ /*
+ * Uh oh, we raced in fastpath, wake up everyone in this case,
+ * so they can see the new lock->ctx.
+ */
+ spin_lock_mutex(&lock->base.wait_lock, flags);
+ list_for_each_entry(cur, &lock->base.wait_list, list) {
+ /* debug_mutex_wake_waiter(&lock->base, cur); */
+ wake_up_process(cur->task);
+ }
+ spin_unlock_mutex(&lock->base.wait_lock, flags);
+}
+
+/**
+ * backport_schedule_preempt_disabled - called with preemption disabled
+ *
+ * Backports c5491ea7. This is not exported so we leave it
+ * here as this is the only current core user on backports.
+ * Although available on >= 3.4 its only for in-kernel code so
+ * we provide our own.
+ *
+ * Returns with preemption disabled. Note: preempt_count must be 1
+ */
+static void __sched backport_schedule_preempt_disabled(void)
+{
+ preempt_enable_no_resched();
+ schedule();
+ preempt_disable();
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static __always_inline int __sched
+__backport_mutex_lock_common(struct mutex *lock, long state,
+ unsigned int subclass,
+ struct lockdep_map *nest_lock, unsigned long ip,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ struct task_struct *task = current;
+ struct mutex_waiter waiter;
+ unsigned long flags;
+ int ret;
+
+ preempt_disable();
+ mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ /*
+ * Optimistic spinning.
+ *
+ * We try to spin for acquisition when we find that there are no
+ * pending waiters and the lock owner is currently running on a
+ * (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
+ * doesn't track the owner atomically in the lock field, we need to
+ * track it non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
+ * to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
+ */
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;
+
+ for (;;) {
+ struct task_struct *owner;
+ struct mspin_node node;
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ struct ww_mutex *ww;
+
+ ww = container_of(lock, struct ww_mutex, base);
+ /*
+ * If ww->ctx is set the contents are undefined, only
+ * by acquiring wait_lock there is a guarantee that
+ * they are not invalid when reading.
+ *
+ * As such, when deadlock detection needs to be
+ * performed the optimistic spinning cannot be done.
+ */
+ if (ACCESS_ONCE(ww->ctx))
+ break;
+ }
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ mspin_lock(MLOCK(lock), &node);
+ owner = ACCESS_ONCE(lock->owner);
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
+ break;
+ }
+
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
+ lock_acquired(&lock->dep_map, ip);
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww;
+ ww = container_of(lock, struct ww_mutex, base);
+
+ ww_mutex_set_context_fastpath(ww, ww_ctx);
+ }
+
+ mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
+ preempt_enable();
+ return 0;
+ }
+ mspin_unlock(MLOCK(lock), &node);
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+ }
+slowpath:
+#endif
+ spin_lock_mutex(&lock->wait_lock, flags);
+
+ /* We don't support DEBUG_MUTEXES on the backport */
+ /* debug_mutex_lock_common(lock, &waiter); */
+ /* debug_mutex_add_waiter(lock, &waiter, task_thread_info(task)); */
+
+ /* add waiting tasks to the end of the waitqueue (FIFO): */
+ list_add_tail(&waiter.list, &lock->wait_list);
+ waiter.task = task;
+
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
+ goto done;
+
+ lock_contended(&lock->dep_map, ip);
+
+ for (;;) {
+ /*
+ * Lets try to take the lock again - this is needed even if
+ * we get here for the first time (shortly after failing to
+ * acquire the lock), to make sure that we get a wakeup once
+ * it's unlocked. Later on, if we sleep, this is the
+ * operation that gives us the lock. We xchg it to -1, so
+ * that when we release the lock, we properly wake up the
+ * other waiters:
+ */
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
+ break;
+
+ /*
+ * got a signal? (This code gets eliminated in the
+ * TASK_UNINTERRUPTIBLE case.)
+ */
+ if (unlikely(signal_pending_state(state, task))) {
+ ret = -EINTR;
+ goto err;
+ }
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ ret = __mutex_lock_check_stamp(lock, ww_ctx);
+ if (ret)
+ goto err;
+ }
+
+ __set_task_state(task, state);
+
+ /* didn't get the lock, go to sleep: */
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ backport_schedule_preempt_disabled();
+ spin_lock_mutex(&lock->wait_lock, flags);
+ }
+
+done:
+ lock_acquired(&lock->dep_map, ip);
+ /* got the lock - rejoice! */
+ mutex_remove_waiter(lock, &waiter, current_thread_info());
+ mutex_set_owner(lock);
+
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww = container_of(lock,
+ struct ww_mutex,
+ base);
+ struct mutex_waiter *cur;
+
+ /*
+ * This branch gets optimized out for the common case,
+ * and is only important for ww_mutex_lock.
+ */
+
+ ww_mutex_lock_acquired(ww, ww_ctx);
+ ww->ctx = ww_ctx;
+
+ /*
+ * Give any possible sleeping processes the chance to wake up,
+ * so they can recheck if they have to back off.
+ */
+ list_for_each_entry(cur, &lock->wait_list, list) {
+ /* debug_mutex_wake_waiter(lock, cur); */
+ wake_up_process(cur->task);
+ }
+ }
+
+ /* set it to 0 if there are no waiters left: */
+ if (likely(list_empty(&lock->wait_list)))
+ atomic_set(&lock->count, 0);
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+
+ /* debug_mutex_free_waiter(&waiter); */
+ preempt_enable();
+
+ return 0;
+
+err:
+ mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ /* debug_mutex_free_waiter(&waiter); */
+ mutex_release(&lock->dep_map, 1, ip);
+ preempt_enable();
+ return ret;
+}
+
+static noinline int __sched
+__ww_mutex_lock_slowpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+static noinline int __sched
+__ww_mutex_lock_interruptible_slowpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+/**
+ * __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ * from 1 to a 0 value
+ * @count: pointer of type atomic_t
+ *
+ * For backporting purposes we can't use the older kernel's
+ * __mutex_fastpath_lock_retval() since upon failure of a fastpath
+ * lock we want to call our a failure routine with more than one argument, in
+ * this case the context for ww mutexes. Refer to commit a41b56ef the
+ * argument increase. It'd be painful to backport all asm code for the
+ * supported architectures so instead lets penalize the backport ww mutex
+ * fastpath lock with the not so efficient generic atomic_dec_return()
+ * implementation.
+ *
+ * Change the count from 1 to a value lower than 1. This function returns 0
+ * if the fastpath succeeds, or -1 otherwise.
+ */
+static inline int
+__backport_mutex_fastpath_lock_retval(atomic_t *count)
+{
+ if (unlikely(atomic_dec_return(count) < 0))
+ return -1;
+ return 0;
+}
+
+int __sched
+__ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock);
+
+int __sched
+__ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
diff --git a/dependencies b/dependencies
index 6841613..5c50571 100644
--- a/dependencies
+++ b/dependencies
@@ -48,6 +48,12 @@ MWIFIEX 2.6.27
# DRM stuff
HDMI 3.2
DRM 3.2
+# As of 3.11 DRM depends on the new ww_mutex which is
+# backported via BACKPORT_BUILD_WW_MUTEX. This backported
+# feature however has does not yet have support for
+# DEBUG_MUTEXES and DEBUG_LOCK_ALLOC.
+DRM kconfig: !BACKPORT_KERNEL_3_11 || !DEBUG_MUTEXES
+DRM kconfig: !BACKPORT_KERNEL_3_11 || !DEBUG_LOCK_ALLOC
DRM_TTM 3.2
# See e2bdb933, this was added on v3.3, in order to
# support DRM_QXL on 3.2 you'd have to backport 78c1d7848
--
1.7.10.4
From: "Luis R. Rodriguez" <mcgrof(a)do-not-panic.com>
This backports the kernel's wound/wait style locks 040a0a371,
using the linux-stable v3.11-rc2 as a base for development.
Given the complexity to support debugging mutexes this backport
implementation is simplified by only making this feature availabe
if you to have DEBUG_MUTEXES and DEBUG_LOCK_ALLOC disabled. Given
that ww mutex is required for DRM this also means we must update
the kconfig for DRM and require you to also not be able to build
DRM if you have either of these options enabled. Support for
DEBUG_MUTEXES and DEBUG_LOCK_ALLOC can be added later by anyone
daring.
Part of the ww mutex addition to the kernel required modifying
the fast path mutex locking scheme by requiring you to deal
with the slow path alternatives on your own (refer to a41b56ef).
The reason for this change was that the mutex fastpath implementation
assumed your slowpath alternative can only be passed one argument
and the addition of ww mutexes requires dealing with the slow
path with a context passed.
It'd be painful to backport all asm for an optimized fastpath
implementation so we penalize the backport ww mutex fast path
by using the generic atomic_dec_return().
To backport a clean our own mutex_lock_common() with the least
amount of changes against upstream commits 2bd2c92c and 41fcb9f2
also needed to be backported. Commit 2bd2c92c dealt with adding
support for queue mutex spinners with an MCS lock, since this
cannot be backported for older kernels we provide empty inlines.
Commit 41fcb9f2 just removed SCHED_FEAT_OWNER_SPIN as it was an
early hack, the only thing required to backport this commit was
to provide an alternative declaration for mutex_spin_on_owner()
as it was declared non-inline for older kernels.
Finally c5491ea7 required backporting schedule_preempt_disabled()
as well but that just consisted of carrying over the original
implementation. Since its not exported we need to reimplement
it to make it available to our internal core ww mutex port.
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 040a0a371
v3.11-rc1~147^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains a41b56ef
v3.11-rc1~147^2~6
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 2bd2c92c
v3.10-rc1~200^2~3
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 41fcb9f2
v3.10-rc1~200^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains c5491ea7
v3.4-rc1~3^2~27
commit 040a0a37100563754bb1fee6ff6427420bcfa609
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Mon Jun 24 10:30:04 2013 +0200
mutex: Add support for wound/wait style locks
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.
For full documentation please read Documentation/ww-mutex-design.txt.
References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Rob Clark <robdclark(a)gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit a41b56efa70e060f650aeb54740aaf52044a1ead
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Thu Jun 20 13:31:05 2013 +0200
arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.
Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.
This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: robclark(a)gmail.com
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:13 2013 -0400
mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
<-- snip -->
commit 41fcb9f230bf773656d1768b73000ef720bf00c3
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:11 2013 -0400
mutex: Move mutex spinning code from sched/core.c back to mutex.c
<-- snip -->
commit c5491ea779793f977d282754db478157cc409d82
Author: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon Mar 21 12:09:35 2011 +0100
sched/rt: Add schedule_preempt_disabled()
<-- snip -->
Cc: maarten.lankhorst(a)canonical.com
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Rob Clark <robdclark(a)gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
---
backport/backport-include/linux/ww_mutex.h | 333 ++++++++++++++
backport/compat/Kconfig | 11 +
backport/compat/Makefile | 1 +
backport/compat/kernel/ww_mutex.c | 667 ++++++++++++++++++++++++++++
4 files changed, 1012 insertions(+)
create mode 100644 backport/backport-include/linux/ww_mutex.h
create mode 100644 backport/compat/kernel/ww_mutex.c
diff --git a/backport/backport-include/linux/ww_mutex.h b/backport/backport-include/linux/ww_mutex.h
new file mode 100644
index 0000000..0953939
--- /dev/null
+++ b/backport/backport-include/linux/ww_mutex.h
@@ -0,0 +1,333 @@
+#ifndef __BACKPORT_LINUX_WW_MUTEX_H
+#define __BACKPORT_LINUX_WW_MUTEX_H
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+#include_next <linux/ww_mutex.h>
+#else
+#ifdef CPTCFG_BACKPORT_BUILD_WW_MUTEX
+/*
+ * Wound/Wait Mutexes: blocking mutual exclusion locks with deadlock avoidance
+ *
+ * Original mutex implementation started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Wound/wait implementation:
+ * Copyright (C) 2013 Canonical Ltd.
+ *
+ * This file contains the main data structure and API definitions.
+ */
+
+#include <linux/mutex.h>
+
+struct ww_class {
+ atomic_long_t stamp;
+ struct lock_class_key acquire_key;
+ struct lock_class_key mutex_key;
+ const char *acquire_name;
+ const char *mutex_name;
+};
+
+struct ww_acquire_ctx {
+ struct task_struct *task;
+ unsigned long stamp;
+ unsigned acquired;
+};
+
+struct ww_mutex {
+ struct mutex base;
+ struct ww_acquire_ctx *ctx;
+};
+
+# define __WW_CLASS_MUTEX_INITIALIZER(lockname, ww_class)
+
+#define __WW_CLASS_INITIALIZER(ww_class) \
+ { .stamp = ATOMIC_LONG_INIT(0) \
+ , .acquire_name = #ww_class "_acquire" \
+ , .mutex_name = #ww_class "_mutex" }
+
+#define __WW_MUTEX_INITIALIZER(lockname, class) \
+ { .base = { \__MUTEX_INITIALIZER(lockname) } \
+ __WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
+
+#define DEFINE_WW_CLASS(classname) \
+ struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+
+#define DEFINE_WW_MUTEX(mutexname, ww_class) \
+ struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
+
+/**
+ * ww_mutex_init - initialize the w/w mutex
+ * @lock: the mutex to be initialized
+ * @ww_class: the w/w class the mutex should belong to
+ *
+ * Initialize the w/w mutex to unlocked state and associate it with the given
+ * class.
+ *
+ * It is not allowed to initialize an already locked mutex.
+ */
+#define ww_mutex_init LINUX_BACKPORT(ww_mutex_init)
+static inline void ww_mutex_init(struct ww_mutex *lock,
+ struct ww_class *ww_class)
+{
+ __mutex_init(&lock->base, ww_class->mutex_name, &ww_class->mutex_key);
+ lock->ctx = NULL;
+}
+
+/**
+ * ww_acquire_init - initialize a w/w acquire context
+ * @ctx: w/w acquire context to initialize
+ * @ww_class: w/w class of the context
+ *
+ * Initializes an context to acquire multiple mutexes of the given w/w class.
+ *
+ * Context-based w/w mutex acquiring can be done in any order whatsoever within
+ * a given lock class. Deadlocks will be detected and handled with the
+ * wait/wound logic.
+ *
+ * Mixing of context-based w/w mutex acquiring and single w/w mutex locking can
+ * result in undetected deadlocks and is so forbidden. Mixing different contexts
+ * for the same w/w class when acquiring mutexes can also result in undetected
+ * deadlocks, and is hence also forbidden. Both types of abuse will be caught by
+ * enabling CONFIG_PROVE_LOCKING.
+ *
+ * Nesting of acquire contexts for _different_ w/w classes is possible, subject
+ * to the usual locking rules between different lock classes.
+ *
+ * An acquire context must be released with ww_acquire_fini by the same task
+ * before the memory is freed. It is recommended to allocate the context itself
+ * on the stack.
+ */
+#define ww_acquire_init LINUX_BACKPORT(ww_acquire_init)
+static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
+ struct ww_class *ww_class)
+{
+ ctx->task = current;
+ ctx->stamp = atomic_long_inc_return(&ww_class->stamp);
+ ctx->acquired = 0;
+}
+
+/**
+ * ww_acquire_done - marks the end of the acquire phase
+ * @ctx: the acquire context
+ *
+ * Marks the end of the acquire phase, any further w/w mutex lock calls using
+ * this context are forbidden.
+ *
+ * Calling this function is optional, it is just useful to document w/w mutex
+ * code and clearly designated the acquire phase from actually using the locked
+ * data structures.
+ */
+#define ww_acquire_done LINUX_BACKPORT(ww_acquire_done)
+static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
+{
+}
+
+/**
+ * ww_acquire_fini - releases a w/w acquire context
+ * @ctx: the acquire context to free
+ *
+ * Releases a w/w acquire context. This must be called _after_ all acquired w/w
+ * mutexes have been released with ww_mutex_unlock.
+ */
+#define ww_acquire_fini LINUX_BACKPORT(ww_acquire_fini)
+static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
+{
+}
+
+#define __ww_mutex_lock LINUX_BACKPORT(__ww_mutex_lock)
+extern int __must_check __ww_mutex_lock(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+#define __ww_mutex_lock_interruptible LINUX_BACKPORT(__ww_mutex_lock_interruptible)
+extern int __must_check __ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+
+/**
+ * ww_mutex_lock - acquire the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context, or NULL to acquire only a single lock.
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow. Alternatively callers can opt to not acquire this
+ * lock and proceed with trying to acquire further w/w mutexes (e.g. when
+ * scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock LINUX_BACKPORT(ww_mutex_lock)
+static inline int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock(lock, ctx);
+
+ mutex_lock(&lock->base);
+ return 0;
+}
+
+/**
+ * ww_mutex_lock_interruptible - acquire the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired. If a
+ * signal arrives while waiting for the lock then this function returns -EINTR.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow_interruptible. Alternatively callers can opt to
+ * not acquire this lock and proceed with trying to acquire further w/w mutexes
+ * (e.g. when scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock_interruptible LINUX_BACKPORT(ww_mutex_lock_interruptible)
+static inline int __must_check ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock_interruptible(lock, ctx);
+ else
+ return mutex_lock_interruptible(&lock->base);
+}
+
+/**
+ * ww_mutex_lock_slow - slowpath acquiring of the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the context held. It is forbidden to call this on anything else than the
+ * contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock directly. This function here is simply to help w/w mutex
+ * locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow LINUX_BACKPORT(ww_mutex_lock_slow)
+static inline void
+ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+ ret = ww_mutex_lock(lock, ctx);
+ (void)ret;
+}
+
+/**
+ * ww_mutex_lock_slow_interruptible - slowpath acquiring of the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available and returns 0 when the lock has
+ * been acquired. If a signal arrives while waiting for the lock then this
+ * function returns -EINTR.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the given context held. It is forbidden to call this on anything else
+ * than the contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock_interruptible directly. This function here is simply to help
+ * w/w mutex locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow_interruptible LINUX_BACKPORT(ww_mutex_lock_slow_interruptible)
+static inline int __must_check
+ww_mutex_lock_slow_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return ww_mutex_lock_interruptible(lock, ctx);
+}
+
+#define ww_mutex_unlock LINUX_BACKPORT(ww_mutex_unlock)
+extern void ww_mutex_unlock(struct ww_mutex *lock);
+
+/**
+ * ww_mutex_trylock - tries to acquire the w/w mutex without acquire context
+ * @lock: mutex to lock
+ *
+ * Trylocks a mutex without acquire context, so no deadlock detection is
+ * possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
+ */
+#define ww_mutex_trylock LINUX_BACKPORT(ww_mutex_trylock)
+static inline int __must_check ww_mutex_trylock(struct ww_mutex *lock)
+{
+ return mutex_trylock(&lock->base);
+}
+
+/***
+ * ww_mutex_destroy - mark a w/w mutex unusable
+ * @lock: the mutex to be destroyed
+ *
+ * This function marks the mutex uninitialized, and any subsequent
+ * use of the mutex is forbidden. The mutex must not be locked when
+ * this function is called.
+ */
+#define ww_mutex_destroy LINUX_BACKPORT(ww_mutex_destroy)
+static inline void ww_mutex_destroy(struct ww_mutex *lock)
+{
+ mutex_destroy(&lock->base);
+}
+
+/**
+ * ww_mutex_is_locked - is the w/w mutex locked
+ * @lock: the mutex to be queried
+ *
+ * Returns 1 if the mutex is locked, 0 if unlocked.
+ */
+#define ww_mutex_is_locked LINUX_BACKPORT(ww_mutex_is_locked)
+static inline bool ww_mutex_is_locked(struct ww_mutex *lock)
+{
+ return mutex_is_locked(&lock->base);
+}
+
+#endif /* CPTCFG_BACKPORT_BUILD_WW_MUTEX */
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+#endif /* __BACKPORT_LINUX_WW_MUTEX_H */
diff --git a/backport/compat/Kconfig b/backport/compat/Kconfig
index e2f0cdd..f3c1ab3 100644
--- a/backport/compat/Kconfig
+++ b/backport/compat/Kconfig
@@ -185,6 +185,17 @@ config BACKPORT_LEDS_CLASS
config BACKPORT_LEDS_TRIGGERS
bool
+config BACKPORT_BUILD_WW_MUTEX
+ bool
+ # Build only if on kernels < 3.11
+ # For now only DRM drivers use ww mutexes.
+ depends on DRM && BACKPORT_KERNEL_3_11
+ default y if BACKPORT_USERSEL_BUILD_ALL
+ # probably a bad idea if you have these options given we
+ # ripped those options out.
+ depends on !DEBUG_MUTEXES
+ depends on !DEBUG_LOCK_ALLOC
+
config BACKPORT_BUILD_RADIX_HELPERS
bool
# You have selected to build backported DRM drivers
diff --git a/backport/compat/Makefile b/backport/compat/Makefile
index 252290e..fec01c4 100644
--- a/backport/compat/Makefile
+++ b/backport/compat/Makefile
@@ -41,3 +41,4 @@ compat-$(CPTCFG_BACKPORT_BUILD_KFIFO) += kfifo.o
compat-$(CPTCFG_BACKPORT_BUILD_GENERIC_ATOMIC64) += compat_atomic.o
compat-$(CPTCFG_BACKPORT_BUILD_DMA_SHARED_HELPERS) += dma-shared-helpers.o
compat-$(CPTCFG_BACKPORT_BUILD_RADIX_HELPERS) += lib-radix-tree-helpers.o
+compat-$(CPTCFG_BACKPORT_BUILD_WW_MUTEX) += kernel/ww_mutex.o
diff --git a/backport/compat/kernel/ww_mutex.c b/backport/compat/kernel/ww_mutex.c
new file mode 100644
index 0000000..257c2a4
--- /dev/null
+++ b/backport/compat/kernel/ww_mutex.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2013 Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
+ *
+ * Backport ww mutex for older kernels. This is not supported when
+ * DEBUG_MUTEXES or DEBUG_LOCK_ALLOC is enabled.
+ *
+ * Taken from: kernel/mutex.c - via linux-stable v3.11-rc2
+ *
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
+ * David Howells for suggestions and improvements.
+ *
+ * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ * from the -rt tree, where it was originally implemented for rtmutexes
+ * by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ * and Sven Dietrich.
+ *
+ * Also see Documentation/mutex-design.txt.
+ */
+
+#include <linux/mutex.h>
+#include <linux/ww_mutex.h>
+#include <asm/mutex.h>
+#include <linux/sched.h>
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0)
+#include <linux/sched/rt.h>
+#endif
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+#include <linux/version.h>
+
+/*
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
+ */
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+
+#define spin_lock_mutex(lock, flags) \
+ do { spin_lock(lock); (void)(flags); } while (0)
+#define spin_unlock_mutex(lock, flags) \
+ do { spin_unlock(lock); (void)(flags); } while (0)
+#define mutex_remove_waiter(lock, waiter, ti) \
+ __list_del((waiter)->list.prev, (waiter)->list.next)
+
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+ lock->owner = current;
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+ lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) /* 2bd2c92c and 41fcb9f2 */
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
+static noinline
+#endif
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
+#else /* Backport 2bd2c92c: help keep backport_mutex_lock_common() clean */
+
+struct mspin_node {
+};
+#define MLOCK(mutex) NULL
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ return 1;
+}
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) */
+#endif /* CONFIG_MUTEX_SPIN_ON_OWNER */
+
+/*
+ * Release the lock, slowpath:
+ */
+static inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
+{
+ struct mutex *lock = container_of(lock_count, struct mutex, count);
+ unsigned long flags;
+
+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ /* debug_mutex_unlock(lock); */
+
+ /*
+ * some architectures leave the lock unlocked in the fastpath failure
+ * case, others need to leave it locked. In the later case we have to
+ * unlock it here
+ */
+ if (__mutex_slowpath_needs_to_unlock())
+ atomic_set(&lock->count, 1);
+
+ if (!list_empty(&lock->wait_list)) {
+ /* get the first entry from the wait-list: */
+ struct mutex_waiter *waiter =
+ list_entry(lock->wait_list.next,
+ struct mutex_waiter, list);
+
+ /* debug_mutex_wake_waiter(lock, waiter); */
+
+ wake_up_process(waiter->task);
+ }
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+}
+
+/*
+ * Release the lock, slowpath:
+ */
+static __used noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+/**
+ * ww_mutex_unlock - release the w/w mutex
+ * @lock: the mutex to be released
+ *
+ * Unlock a mutex that has been locked by this task previously with any of the
+ * ww_mutex_lock* functions (with or without an acquire context). It is
+ * forbidden to release the locks after releasing the acquire context.
+ *
+ * This function must not be used in interrupt context. Unlocking
+ * of a unlocked mutex is not allowed.
+ */
+void __sched ww_mutex_unlock(struct ww_mutex *lock)
+{
+ /*
+ * The unlocking fastpath is the 0->1 transition from 'locked'
+ * into 'unlocked' state:
+ */
+ if (lock->ctx) {
+ if (lock->ctx->acquired > 0)
+ lock->ctx->acquired--;
+ lock->ctx = NULL;
+ }
+
+ __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+}
+EXPORT_SYMBOL_GPL(ww_mutex_unlock);
+
+static inline int __sched
+__mutex_lock_check_stamp(struct mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+ struct ww_acquire_ctx *hold_ctx = ACCESS_ONCE(ww->ctx);
+
+ if (!hold_ctx)
+ return 0;
+
+ if (unlikely(ctx == hold_ctx))
+ return -EALREADY;
+
+ if (ctx->stamp - hold_ctx->stamp <= LONG_MAX &&
+ (ctx->stamp != hold_ctx->stamp || ctx > hold_ctx)) {
+ return -EDEADLK;
+ }
+
+ return 0;
+}
+
+static __always_inline void ww_mutex_lock_acquired(struct ww_mutex *ww,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ ww_ctx->acquired++;
+}
+
+/*
+ * after acquiring lock with fastpath or when we lost out in contested
+ * slowpath, set ctx and wake up any waiters so they can recheck.
+ *
+ * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
+ * as the fastpath and opportunistic spinning are disabled in that case.
+ */
+static __always_inline void
+ww_mutex_set_context_fastpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ unsigned long flags;
+ struct mutex_waiter *cur;
+
+ ww_mutex_lock_acquired(lock, ctx);
+
+ lock->ctx = ctx;
+
+ /*
+ * The lock->ctx update should be visible on all cores before
+ * the atomic read is done, otherwise contended waiters might be
+ * missed. The contended waiters will either see ww_ctx == NULL
+ * and keep spinning, or it will acquire wait_lock, add itself
+ * to waiter list and sleep.
+ */
+ smp_mb(); /* ^^^ */
+
+ /*
+ * Check if lock is contended, if not there is nobody to wake up
+ */
+ if (likely(atomic_read(&lock->base.count) == 0))
+ return;
+
+ /*
+ * Uh oh, we raced in fastpath, wake up everyone in this case,
+ * so they can see the new lock->ctx.
+ */
+ spin_lock_mutex(&lock->base.wait_lock, flags);
+ list_for_each_entry(cur, &lock->base.wait_list, list) {
+ /* debug_mutex_wake_waiter(&lock->base, cur); */
+ wake_up_process(cur->task);
+ }
+ spin_unlock_mutex(&lock->base.wait_lock, flags);
+}
+
+/**
+ * backport_schedule_preempt_disabled - called with preemption disabled
+ *
+ * Backports c5491ea7. This is not exported so we leave it
+ * here as this is the only current core user on backports.
+ * Although available on >= 3.4 its only for in-kernel code so
+ * we provide our own.
+ *
+ * Returns with preemption disabled. Note: preempt_count must be 1
+ */
+static void __sched backport_schedule_preempt_disabled(void)
+{
+ preempt_enable_no_resched();
+ schedule();
+ preempt_disable();
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static __always_inline int __sched
+__backport_mutex_lock_common(struct mutex *lock, long state,
+ unsigned int subclass,
+ struct lockdep_map *nest_lock, unsigned long ip,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ struct task_struct *task = current;
+ struct mutex_waiter waiter;
+ unsigned long flags;
+ int ret;
+
+ preempt_disable();
+ mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ /*
+ * Optimistic spinning.
+ *
+ * We try to spin for acquisition when we find that there are no
+ * pending waiters and the lock owner is currently running on a
+ * (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
+ * doesn't track the owner atomically in the lock field, we need to
+ * track it non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
+ * to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
+ */
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;
+
+ for (;;) {
+ struct task_struct *owner;
+ struct mspin_node node;
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ struct ww_mutex *ww;
+
+ ww = container_of(lock, struct ww_mutex, base);
+ /*
+ * If ww->ctx is set the contents are undefined, only
+ * by acquiring wait_lock there is a guarantee that
+ * they are not invalid when reading.
+ *
+ * As such, when deadlock detection needs to be
+ * performed the optimistic spinning cannot be done.
+ */
+ if (ACCESS_ONCE(ww->ctx))
+ break;
+ }
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ mspin_lock(MLOCK(lock), &node);
+ owner = ACCESS_ONCE(lock->owner);
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
+ break;
+ }
+
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
+ lock_acquired(&lock->dep_map, ip);
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww;
+ ww = container_of(lock, struct ww_mutex, base);
+
+ ww_mutex_set_context_fastpath(ww, ww_ctx);
+ }
+
+ mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
+ preempt_enable();
+ return 0;
+ }
+ mspin_unlock(MLOCK(lock), &node);
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+ }
+slowpath:
+#endif
+ spin_lock_mutex(&lock->wait_lock, flags);
+
+ /* We don't support DEBUG_MUTEXES on the backport */
+ /* debug_mutex_lock_common(lock, &waiter); */
+ /* debug_mutex_add_waiter(lock, &waiter, task_thread_info(task)); */
+
+ /* add waiting tasks to the end of the waitqueue (FIFO): */
+ list_add_tail(&waiter.list, &lock->wait_list);
+ waiter.task = task;
+
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
+ goto done;
+
+ lock_contended(&lock->dep_map, ip);
+
+ for (;;) {
+ /*
+ * Lets try to take the lock again - this is needed even if
+ * we get here for the first time (shortly after failing to
+ * acquire the lock), to make sure that we get a wakeup once
+ * it's unlocked. Later on, if we sleep, this is the
+ * operation that gives us the lock. We xchg it to -1, so
+ * that when we release the lock, we properly wake up the
+ * other waiters:
+ */
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
+ break;
+
+ /*
+ * got a signal? (This code gets eliminated in the
+ * TASK_UNINTERRUPTIBLE case.)
+ */
+ if (unlikely(signal_pending_state(state, task))) {
+ ret = -EINTR;
+ goto err;
+ }
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ ret = __mutex_lock_check_stamp(lock, ww_ctx);
+ if (ret)
+ goto err;
+ }
+
+ __set_task_state(task, state);
+
+ /* didn't get the lock, go to sleep: */
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ backport_schedule_preempt_disabled();
+ spin_lock_mutex(&lock->wait_lock, flags);
+ }
+
+done:
+ lock_acquired(&lock->dep_map, ip);
+ /* got the lock - rejoice! */
+ mutex_remove_waiter(lock, &waiter, current_thread_info());
+ mutex_set_owner(lock);
+
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww = container_of(lock,
+ struct ww_mutex,
+ base);
+ struct mutex_waiter *cur;
+
+ /*
+ * This branch gets optimized out for the common case,
+ * and is only important for ww_mutex_lock.
+ */
+
+ ww_mutex_lock_acquired(ww, ww_ctx);
+ ww->ctx = ww_ctx;
+
+ /*
+ * Give any possible sleeping processes the chance to wake up,
+ * so they can recheck if they have to back off.
+ */
+ list_for_each_entry(cur, &lock->wait_list, list) {
+ /* debug_mutex_wake_waiter(lock, cur); */
+ wake_up_process(cur->task);
+ }
+ }
+
+ /* set it to 0 if there are no waiters left: */
+ if (likely(list_empty(&lock->wait_list)))
+ atomic_set(&lock->count, 0);
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+
+ /* debug_mutex_free_waiter(&waiter); */
+ preempt_enable();
+
+ return 0;
+
+err:
+ mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ /* debug_mutex_free_waiter(&waiter); */
+ mutex_release(&lock->dep_map, 1, ip);
+ preempt_enable();
+ return ret;
+}
+
+static noinline int __sched
+__ww_mutex_lock_slowpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+static noinline int __sched
+__ww_mutex_lock_interruptible_slowpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+/**
+ * __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ * from 1 to a 0 value
+ * @count: pointer of type atomic_t
+ *
+ * For backporting purposes we can't use the older kernel's
+ * __mutex_fastpath_lock_retval() since upon failure of a fastpath
+ * lock we want to call our a failure routine with more than one argument, in
+ * this case the context for ww mutexes. Refer to commit a41b56ef the
+ * argument increase. It'd be painful to backport all asm code for the
+ * supported architectures so instead lets penalize the backport ww mutex
+ * fastpath lock with the not so efficient generic atomic_dec_return()
+ * implementation.
+ *
+ * Change the count from 1 to a value lower than 1. This function returns 0
+ * if the fastpath succeeds, or -1 otherwise.
+ */
+static inline int
+__backport_mutex_fastpath_lock_retval(atomic_t *count)
+{
+ if (unlikely(atomic_dec_return(count) < 0))
+ return -1;
+ return 0;
+}
+
+int __sched
+__ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock);
+
+int __sched
+__ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
--
1.7.10.4
In Tegra SoC, IOMMU can set some mapping attribute for each page, for
exmaple, READable, and WRITEable. We'd like to use this feature
*newly* for robustness, where read-only pages can be protected from
being overwritten by wrong operation. We have been using IOMMU via DMA
mapping API(ARM). DMA mapping API currently doesn't use "prot"
parameter when calling IOMMU API. So this series tries to pass "struct
dma_attrs *attrs" via "int prot" in IOMMU API. I'm not so sure if this
implementation is right or not because:
- Casting (struct dma_attrs *) to (int) in DMA API doesn't look nice.
- IOMMU layer needs to cast (int) back to (struct dma_attrs *) again,
which can be considered as violation of layers.
If you have any implementations/suggestions, it would be really helpful.
This series isn't applied cleanly but this is posted to request for
comments.
Hiroshi Doyu (3):
common: DMA-mapping: add DMA_ATTR_READ_ONLY attribute
ARM: dma-mapping: Pass DMA attrs as IOMMU prot
iommu/tegra: smmu: Support read-only mapping
arch/arm/mm/dma-mapping.c | 34 +++++++++++++++++++++-------------
drivers/iommu/tegra-smmu.c | 41 +++++++++++++++++++++++++++++------------
include/linux/dma-attrs.h | 1 +
3 files changed, 51 insertions(+), 25 deletions(-)
--
1.8.1.5
In Tegra SoC, IOMMU can set some mapping attribute for each page, for
exmaple, READable, and WRITEable. We'd like to use this feature
*newly* for robustness, where read-only pages can be protected from
being overwritten by wrong operation. We have been using IOMMU via DMA
mapping API(ARM). DMA mapping API currently doesn't use "prot"
parameter when calling IOMMU API. So this series tries to pass "struct
dma_attrs *attrs" via "int prot" in IOMMU API. I'm not so sure if this
implementation is right or not because:
- Casting (struct dma_attrs *) to (int) in DMA API doesn't look nice.
- IOMMU layer needs to cast (int) back to (struct dma_attrs *) again,
which can be considered as violation of layers.
If you have any implementations/suggestions, it would be really helpful.
This series isn't applied cleanly but this is posted to request for
comments.
Hiroshi Doyu (3):
common: DMA-mapping: add DMA_ATTR_READ_ONLY attribute
ARM: dma-mapping: Pass DMA attrs as IOMMU prot
iommu/tegra: smmu: Support read-only mapping
arch/arm/mm/dma-mapping.c | 34 +++++++++++++++++++++-------------
drivers/iommu/tegra-smmu.c | 41 +++++++++++++++++++++++++++++------------
include/linux/dma-attrs.h | 1 +
3 files changed, 51 insertions(+), 25 deletions(-)
--
1.8.1.5
Earlier version of dma-contig allocator in user ptr mode assumed that in
all cases DMA address equals physical address. This was just a special case.
Commit e15dab752d4c588544ccabdbe020a7cc092e23c8 introduced correct support
for converting userpage to dma address, but unfortunately it broke the
support for simple dma address = physical address for the case, when given
physical frame has no struct page associated with it (this happens if one
use for example dma_declare_coherent api or other reserved memory approach).
This commit restores support for such cases.
Signed-off-by: Marek Szyprowski <m.szyprowski(a)samsung.com>
---
drivers/media/v4l2-core/videobuf2-dma-contig.c | 87 ++++++++++++++++++++++--
1 file changed, 82 insertions(+), 5 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf2-dma-contig.c b/drivers/media/v4l2-core/videobuf2-dma-contig.c
index fd56f25..1382749 100644
--- a/drivers/media/v4l2-core/videobuf2-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf2-dma-contig.c
@@ -423,6 +423,39 @@ static inline int vma_is_io(struct vm_area_struct *vma)
return !!(vma->vm_flags & (VM_IO | VM_PFNMAP));
}
+static int vb2_dc_get_user_pfn(unsigned long start, int n_pages,
+ struct vm_area_struct *vma, unsigned long *res)
+{
+ unsigned long pfn, start_pfn, prev_pfn;
+ unsigned int i;
+ int ret;
+
+ if (!vma_is_io(vma))
+ return -EFAULT;
+
+ ret = follow_pfn(vma, start, &pfn);
+ if (ret)
+ return ret;
+
+ start_pfn = pfn;
+ start += PAGE_SIZE;
+
+ for (i = 1; i < n_pages; ++i, start += PAGE_SIZE) {
+ prev_pfn = pfn;
+ ret = follow_pfn(vma, start, &pfn);
+
+ if (ret) {
+ pr_err("no page for address %lu\n", start);
+ return ret;
+ }
+ if (pfn != prev_pfn + 1)
+ return -EINVAL;
+ }
+
+ *res = start_pfn;
+ return 0;
+}
+
static int vb2_dc_get_user_pages(unsigned long start, struct page **pages,
int n_pages, struct vm_area_struct *vma, int write)
{
@@ -433,6 +466,9 @@ static int vb2_dc_get_user_pages(unsigned long start, struct page **pages,
unsigned long pfn;
int ret = follow_pfn(vma, start, &pfn);
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
if (ret) {
pr_err("no page for address %lu\n", start);
return ret;
@@ -468,16 +504,49 @@ static void vb2_dc_put_userptr(void *buf_priv)
struct vb2_dc_buf *buf = buf_priv;
struct sg_table *sgt = buf->dma_sgt;
- dma_unmap_sg(buf->dev, sgt->sgl, sgt->orig_nents, buf->dma_dir);
- if (!vma_is_io(buf->vma))
- vb2_dc_sgt_foreach_page(sgt, vb2_dc_put_dirty_page);
+ if (sgt) {
+ dma_unmap_sg(buf->dev, sgt->sgl, sgt->orig_nents, buf->dma_dir);
+ if (!vma_is_io(buf->vma))
+ vb2_dc_sgt_foreach_page(sgt, vb2_dc_put_dirty_page);
- sg_free_table(sgt);
- kfree(sgt);
+ sg_free_table(sgt);
+ kfree(sgt);
+ }
vb2_put_vma(buf->vma);
kfree(buf);
}
+/*
+ * For some kind of reserved memory there might be no struct page available,
+ * so all that can be done to support such 'pages' is to try to convert
+ * pfn to dma address or at the last resort just assume that
+ * dma address == physical address (like it has been assumed in earlier version
+ * of videobuf2-dma-contig
+ */
+
+#ifdef __arch_pfn_to_dma
+static inline dma_addr_t vb2_dc_pfn_to_dma(struct device *dev, unsigned long pfn)
+{
+ return (dma_addr_t)__arch_pfn_to_dma(dev, pfn);
+}
+#elsif defined(__pfn_to_bus)
+static inline dma_addr_t vb2_dc_pfn_to_dma(struct device *dev, unsigned long pfn)
+{
+ return (dma_addr_t)__pfn_to_bus(pfn);
+}
+#elsif defined(__pfn_to_phys)
+static inline dma_addr_t vb2_dc_pfn_to_dma(struct device *dev, unsigned long pfn)
+{
+ return (dma_addr_t)__pfn_to_phys(pfn);
+}
+#else
+static inline dma_addr_t vb2_dc_pfn_to_dma(struct device *dev, unsigned long pfn)
+{
+ /* really, we cannot do anything better at this point */
+ return (dma_addr_t)(pfn) << PAGE_SHIFT;
+}
+#endif
+
static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
unsigned long size, int write)
{
@@ -548,6 +617,14 @@ static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
/* extract page list from userspace mapping */
ret = vb2_dc_get_user_pages(start, pages, n_pages, vma, write);
if (ret) {
+ unsigned long pfn;
+ if (vb2_dc_get_user_pfn(start, n_pages, vma, &pfn) == 0) {
+ buf->dma_addr = vb2_dc_pfn_to_dma(buf->dev, pfn);
+ buf->size = size;
+ kfree(pages);
+ return buf;
+ }
+
pr_err("failed to get user pages\n");
goto fail_vma;
}
--
1.7.9.5
Hi,
Currently ARM64 supports swiotlb only as below[1]. AFAIU, swiotlb
provides some kind of bounce buffer for some restricted H/Ws, and it
cannot replace IOMMU H/W completely. So I'm wondering that we would
need the IOMMU versions of dma_map_ops as Marek did for ARM32. If my
understanding is correct, do you guys have any idea how it's going to
be implemented? Can we reuse the current version of "iommu_ops" in
"arch/arm/mm/dma-mapping.c" for ARM64 as well? Or do we need to
rewrite 64bit version of iommu_ops completely in the same file as one
with swiotlb, "arch/arm64/mm/dma-mapping.c"?
Any feedback would be really appreciated.
[1]:
commit 09b55412469dfe6797244dc5836c17ed0c2f191b
Author: Catalin Marinas <catalin.marinas(a)arm.com>
Date: Mon Mar 5 11:49:30 2012 +0000
arm64: DMA mapping API
This patch adds support for the DMA mapping API. It uses dma_map_ops for
flexibility and it currently supports swiotlb. This patch could be
simplified further if the DMA accesses are coherent (not mandated by the
architecture) or if corresponding hooks are placed in the generic
swiotlb code to deal with cache maintenance.