Linaro-mm-sig August 2013

linaro-mm-sig@lists.linaro.org

26 participants
22 discussions

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > >> > Turning to DRM/KMS, it seems the supported formats of a plane can be >> > queried using drm_mode_get_plane. However, there doesn't seem to be a >> > way to query the supported formats of a crtc? If display HW only >> > supports scanning out from a single buffer (like pl111 does), I think >> > it won't have any planes and a fb can only be set on the crtc. In >> > which case, how should user-space query which pixel formats that crtc >> > supports? >> >> it is exposed for drm plane's. What is missing is to expose the >> primary-plane associated with the crtc. > > Cool - so a patch which adds a way to query the what formats a crtc > supports would be welcome? well, I kinda think we want something that exposes the "primary plane" of the crtc.. I'm thinking something roughly like: --------- diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index 53db7ce..c7ffca8 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -157,6 +157,12 @@ struct drm_mode_get_plane { struct drm_mode_get_plane_res { __u64 plane_id_ptr; __u32 count_planes; + /* The primary planes are in matching order to crtc_id_ptr in + * drm_mode_card_res (and same length). For crtc_id[n], it's + * primary plane is given by primary_plane_id[n]. + */ + __u32 count_primary_planes; + __u64 primary_plane_id_ptr; }; #define DRM_MODE_ENCODER_NONE 0 --------- then use the existing GETPLANE ioctl to query the capabilities > What about a way to query the stride alignment constraints? > > Presumably using the drm_mode_get_property mechanism would be the > right way to implement that? > I suppose you could try that.. typically that is in userspace, however. It seems like get_property would get messy quickly (ie. is it a pitch alignment constraint, or stride alignment? What if this is different for different formats (in particular tiled)? etc) > >> > As with v4l2, DRM doesn't appear to have a way to query the stride >> > constraints? Assuming there is a way to query the stride constraints, >> > there also isn't a way to specify them when creating a buffer with >> > DRM, though perhaps the existing pitch parameter of >> > drm_mode_create_dumb could be used to allow user-space to pass in a >> > minimum stride as well as receive the allocated stride? >> > >> >> well, you really shouldn't be using create_dumb.. you should have a >> userspace piece that is specific to the drm driver, and knows how to >> use that driver's gem allocate ioctl. > > Sorry, why does this need a driver-specific allocation function? It's > just a display controller driver and I just want to allocate a scan- > out buffer - all I'm asking is for the display controller driver to > use a minimum stride alignment so I can export the buffer and use > another device to fill it with data. Sure.. but userspace has more information readily available to make a better choice. For example, for omapdrm I'd do things differently depending on whether I wanted to scan out that buffer (or a portion of it) rotated. This is something I know in the DDX driver, but not in the kernel. And it is quite likely something that is driver specific. Sure, we could add that to a generic "allocate me a buffer" ioctl. But that doesn't really seem better, and it becomes a problem as soon as we come across some hw that needs to know something different. In userspace, you have a lot more flexibility, since you don't really need to commit to an API for life. And to bring back the GStreamer argument (since that seems a fitting example when you start talking about sharing buffers between many devices, for example camera+codec+display), it would already be negotiating format between v4l2src + fooencoder + displaysink.. the pitch/stride is part of that format information. If the display isn't the one with the strictest requirements, we don't want the display driver deciding what pitch to use. > The whole point is to be able to allocate the buffer in such a way > that another device can access it. So the driver _can't_ use a > special, device specific format, nor can it allocate it from a > private memory pool because doing so would preclude it from being > shared with another device. > > That other device doesn't need to be a GPU wither, it could just as > easily be a camera/ISP or video decoder. > > > >> >> > So presumably you're talking about a GPU driver being the exporter >> >> > here? If so, how could the GPU driver do these kind of tricks on >> >> > memory shared with another device? >> >> >> >> Yes, that is gpu-as-exporter. If someone else is allocating >> >> buffers, it is up to them to do these tricks or not. Probably >> >> there is a pretty good chance that if you aren't a GPU you don't >> >> need those sort of tricks for fast allocation of transient upload >> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't >> >> really think a v4l camera or video decoder would benefit from that >> >> sort of optimization. >> > >> > Right - but none of those are really buffers you'd want to export >> >> > with dma_buf to share with another device are they? In which case, >> > why not just have dma_buf figure out the constraints and allocate >> > the memory? >> >> maybe not.. but (a) you don't necessarily know at creation time if it >> is going to be exported (maybe you know if it is definitely not going >> to be exported, but the converse is not true), > > I can't actually think of an example where you would not know if a > buffer was going to be exported or not at allocation time? Do you have > a case in mind? yeah, dri2.. when the front buffer is allocated it is just a regular pixmap. If you swap/flip it becomes the back buffer and now shared ;-) And pixmaps are allocated w/ enough frequency that it is the sort of thing you might want to optimize. And even when you know it will be shared, you don't know with who. > Regardless, you'd certainly have to know if a buffer will be exported > pretty quickly, before it's used so that you can import it into > whatever devices are going to access it. Otherwise if it gets > allocated before you export it, the allocation won't satisfy the > constraints of the other devices which will need to access it and > importing will fail. Assuming of course deferred allocation of the > backing pages as discussed earlier in the thread. > > > >> and (b) there isn't >> really any reason to special case the allocation in the driver because >> it is going to be exported. > > Not sure I follow you here? Surely you absolutely have to special-case > the allocation if the buffer is to be exported because you have to > take the other devices' constraints into account when you allocate? Or > do you mean you don't need to special-case the GEM buffer object > creation, only the allocation of the backing pages? Though I'm not > sure how that distinction is useful - at the end of the day, you need > to special-case allocation of the backing pages. > well, you need to consider separately what is (a) in the pages, and (b) where the pages come from. By moving the allocation into dmabuf you restrict (b). For sharing buffers, (a) may be restricted, but there is at least some examples of hardware where (b) would not otherwise be restricted by sharing. > >> helpers that can be used by simple drivers, yes. Forcing the way the >> buffer is allocated, for sure not. Currently, for example, there is >> no issue to export a buffer allocated from stolen-mem. > > Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk > of memory reserved at boot for the GPU which the OS can't touch? I > guess I view such memory as accessible to all media devices on the > system and as such, needs to be managed by a central allocator which > dma_buf can use to allocate from. think carve-out created by bios. In all the cases I am aware of, the drm driver handles allocation of buffer(s) from the carveout. > I guess if that stolen-mem is managed by a single device then in > essence that device becomes the central allocator you have to use to > be able to allocate from that stolen mem? > > >> > If a driver needs to allocate memory in a special way for a >> > particular device, I can't really imagine how it would be able >> > to share that buffer with another device using dma_buf? I guess >> > a driver is likely to need some magic voodoo to configure access >> > to the buffer for its device, but surely that would be done by >> > the dma_mapping framework when dma_buf_map happens? >> > >> >> if, what it has to configure actually manages to fit in the >> dma-mapping framework > > But if it doesn't, surely that's an issue which needs to be addressed > in the dma_mapping framework or else you won't be able to import > buffers for use by that device anyway? > I'm not sure if we have to fit everything in dma-mapping framework, at least in cases where you have something that is specific to one platform. Currently dma-buf provides enough flexibility for other drivers to be able to import these buffers. > >> anyways, where the pages come from has nothing to do with whether a >> buffer can be shared or not > > Sure, but where they are located in physical memory really does > matter. > s/does/can/ it doesn't always matter. And in cases where it does matter, as long as we can express the restrictions in dma_parms (which we can already for the case of range-of-memory restrictions) we are covered > >> >> At any rate, for both xorg and wayland/gbm, you know when a buffer >> >> is going to be a scanout buffer. What I'd recommend is define a >> >> small userspace API that your customers (the SoC vendors) implement >> >> to allocate a scanout buffer and hand you back a dmabuf fd. That >> >> could be used both for x11 and for gbm. Inputs should be requested >> >> width/height and format. And outputs pitch plus dmabuf fd. >> >> >> >> (Actually you might even just want to use gbm as your starting >> >> point. You could probably just use gbm from xf86-video-armsoc for >> >> allocation, to have one thing that works for both wayland and x11. >> >> Scanout and cursor buffers should go to vendor/SoC specific fxn, >> >> rest can be allocated from mali kernel driver.) >> > >> > What does that buy us over just using drm_mode_create_dumb on the >> > display's DRM driver? >> >> well, for example, if there was actually some hw w/ omap's dss + mali, >> you could actually have mali render transparently to tiled buffers >> which could be scanned out rotated. Which would not be possible w/ >> dumb buffers. > > Why not? As you said earlier, the format is defined when you setup the > fb with drm_mode_fb_cmd2. If you wanted to share the buffer between > devices, you have to be explicit about what format that buffer is in, > so you'd have to add an entry to drm_fourcc.h for the tiled format. no, that doesn't really work in this case, the format to any device (or userspace) accessing the buffer is not tiled. (Ie. it would look like normal NV12 or whatever). But there are some different requirements on stride. And there are cases where you would prefer not to use tiled buffers, but the kernel doesn't know enough in the dumb-buffer alloc ioctl to make the correct decision. > So userspace queries what formats the GPU DRM supports and what > formats the OMAP DSS DRM supports, selects the tiled format and then > uses drm_mode_create_dumb to allocate a buffer of the correct size and > sets the appropriate drm_fourcc.h enum value when creating an fb for > that buffer. Or have I missed something? > > > >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> >> (where phys contig is not required for scanout), but causes CMA >> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> >> It just knows that it wants to be able to scanout that particular >> >> >> buffer. >> >> > >> >> > I think that's the idea? The omap3's allocator driver would use >> >> > contiguous memory when it detects the SCANOUT flag whereas the >> >> > omap4 allocator driver wouldn't have to. No complex negotiation >> >> > of constraints - it just "knows". >> >> > >> >> >> >> well, it is same allocating driver in both cases (although maybe >> >> that is unimportant). The "it" that just knows it wants to scanout >> >> is userspace. The "it" that just knows that scanout translates to >> >> contiguous (or not) is the kernel. Perhaps we are saying the same >> >> thing ;-) >> > >> > Yeah - I think we are... so what's the issue with having a per-SoC >> > allocation driver again? >> > >> >> In a way the display driver is a per-SoC allocator. But not >> necessarily the *central* allocator for everything. Ie. no need for >> display driver to allocate vertex buffers for a separate gpu driver, >> and that sort of thing. > > Again, I'm only talking about allocating buffers which will be shared > between different devices. At no point have I mentioned the allocation > of buffers which aren't to be shared between devices. Sorry if that's > not been clear. ok, I guess we were talking about slightly different things ;-) > So for buffers which are to be shared between devices, your suggesting > that the display driver is the per-SoC allocator? But as I say, and > how this thread got started, the same display driver can be used on > different SoCs, so having _it_ be the central allocator isn't ideal. > Though this is our current solution and why we're "abusing" the dumb > buffer allocation functions. :-) > which is why you want to let userspace figure out the pitch and then tell the display driver what size it wants, rather than using dumb buffer ioctl ;-) Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or property or what have you.. but I think that would be hard to get right for all cases, and most people don't really care about that because they already need a gpu/display specific xorg driver and/or gl/egl talking to their kernel driver. You are in a slightly special case, since you are providing GL driver independently of the display driver. But I think that is easier to handle by just telling your customers "here, fill out this function(s) to allocate buffer for scanout" (and, well, I guess you'd need one to query for pitch/stride), rather than trying to cram everything into the kernel. BR, -R > > > Cheers, > > Tom > > > > >

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > perhaps yes. But the device is associated w/ the iommu it is attached to, so this shouldn't be a problem > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? it is exposed for drm plane's. What is missing is to expose the primary-plane associated with the crtc. > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? Right, the gem object has no inherent format, it is just some bytes. The format/width/height/pitch are all attributes of the fb. > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > well, you really shouldn't be using create_dumb.. you should have a userspace piece that is specific to the drm driver, and knows how to use that driver's gem allocate ioctl. > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? maybe not.. but (a) you don't necessarily know at creation time if it is going to be exported (maybe you know if it is definitely not going to be exported, but the converse is not true), and (b) there isn't really any reason to special case the allocation in the driver because it is going to be exported. helpers that can be used by simple drivers, yes. Forcing the way the buffer is allocated, for sure not. Currently, for example, there is no issue to export a buffer allocated from stolen-mem. If we put the page allocation in dma-buf, this would not be possible. That is just one quick example off the top of my head, I'm sure there are plenty more. But we definitely do not want the allocate in dma_buf itself. > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > if, what it has to configure actually manages to fit in the dma-mapping framework anyways, where the pages come from has nothing to do with whether a buffer can be shared or not > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? afaiu it can be a, b, or c (ie. either gpu can have the display or there can be a mux).. > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > > > >> At any rate, for both xorg and wayland/gbm, you know when a buffer is >> going to be a scanout buffer. What I'd recommend is define a small >> userspace API that your customers (the SoC vendors) implement to >> allocate a scanout buffer and hand you back a dmabuf fd. That could >> be used both for x11 and for gbm. Inputs should be requested >> width/height and format. And outputs pitch plus dmabuf fd. >> >> (Actually you might even just want to use gbm as your starting point. >> You could probably just use gbm from xf86-video-armsoc for allocation, >> to have one thing that works for both wayland and x11. Scanout and >> cursor buffers should go to vendor/SoC specific fxn, rest can be >> allocated from mali kernel driver.) > > What does that buy us over just using drm_mode_create_dumb on the > display's DRM driver? > well, for example, if there was actually some hw w/ omap's dss + mali, you could actually have mali render transparently to tiled buffers which could be scanned out rotated. Which would not be possible w/ dumb buffers. > > >> >> > wouldn't need a way to programmatically describe the constraints >> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> >> > just "know" that on this SoC, that buffer needs to be physically >> >> > contiguous for example. >> >> >> >> not really.. it just knows it wants to scanout the buffer, and tells >> >> this as a hint to the kernel. >> >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> (where phys contig is not required for scanout), but causes CMA >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> It just knows that it wants to be able to scanout that particular >> >> buffer. >> > >> > I think that's the idea? The omap3's allocator driver would use >> > contiguous memory when it detects the SCANOUT flag whereas the omap4 >> > allocator driver wouldn't have to. No complex negotiation of >> > constraints - it just "knows". >> > >> >> well, it is same allocating driver in both cases (although maybe that >> is unimportant). The "it" that just knows it wants to scanout is >> userspace. The "it" that just knows that scanout translates to >> contiguous (or not) is the kernel. Perhaps we are saying the same >> thing ;-) > > Yeah - I think we are... so what's the issue with having a per-SoC > allocation driver again? > In a way the display driver is a per-SoC allocator. But not necessarily the *central* allocator for everything. Ie. no need for display driver to allocate vertex buffers for a separate gpu driver, and that sort of thing. BR, -R > > > Cheers, > > Tom > > > > >

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Alex Deucher

On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? > > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? > > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? > > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? > > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > In the early days, there was a MUX to switch the displays between the two GPUs, but most modern systems are MUX-less and the dGPU is either connected to no displays or in some cases the local panel is attached to the integrated GPU and the external displays are connected to the dGPU. In the MUX-less case, the dGPU can be used to render, and then the result is copied to the integrated GPU for display. Alex

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > >> >> ... This is the purpose of the attach step, >> >> so you know all the devices involved in sharing up front before >> >> allocating the backing pages. (Or in the worst case, if you have a >> >> "late attacher" you at least know when no device is doing dma access >> >> to a buffer and can reallocate and move the buffer.) A long time >> >> back, I had a patch that added a field or two to 'struct >> >> device_dma_parameters' so that it could be known if a device >> >> required contiguous buffers.. looks like that never got merged, so >> >> I'd need to dig that back up and resend it. But the idea was to >> >> have the 'struct device' encapsulate all the information that would >> >> be needed to do-the-right-thing when it comes to placement. >> > >> > As I understand it, it's up to the exporting device to allocate the >> > memory backing the dma_buf buffer. I guess the latest possible point >> > you can allocate the backing pages is when map_dma_buf is first >> > called? At that point the exporter can iterate over the current set >> > of attachments, programmatically determine the all the constraints of >> > all the attached drivers and attempt to allocate the backing pages >> > in such a way as to satisfy all those constraints? >> >> yes, this is the idea.. possibly some room for some helpers to help >> out with this, but that is all under the hood from userspace >> perspective >> >> > Didn't you say that programmatically describing device placement >> > constraints was an unbounded problem? I guess we would have to >> > accept that it's not possible to describe all possible constraints >> > and instead find a way to describe the common ones? >> >> well, the point I'm trying to make, is by dividing your constraints >> into two groups, one that impacts and is handled by userspace, and one >> that is in the kernel (ie. where the pages go), you cut down the >> number of permutations that the kernel has to care about considerably. >> And kernel already cares about, for example, what range of addresses >> that a device can dma to/from. I think really the only thing missing >> is the max # of sglist entries (contiguous or not) > > I think it's more than physically contiguous or not. > > For example, it can be more efficient to use large page sizes on > devices with IOMMUs to reduce TLB traffic. I think the size and even > the availability of large pages varies between different IOMMUs. sure.. but I suppose if we can spiff out dma_params to express "I need contiguous", perhaps we can add some way to express "I prefer as-contiguous-as-possible".. either way, this is about where the pages are placed, and not about the layout of pixels within the page, so should be in kernel. It's something that is missing, but I believe that it belongs in dma_params and hidden behind dma_alloc_*() for simple drivers. > There's also the issue of buffer stride alignment. As I say, if the > buffer is to be written by a tile-based GPU like Mali, it's more > efficient if the buffer's stride is aligned to the max AXI bus burst > length. Though I guess a buffer stride only makes sense as a concept > when interpreting the data as a linear-layout 2D image, so perhaps > belongs in user-space along with format negotiation? > Yeah.. this isn't about where the pages go, but about the arrangement within a page. And, well, except for hw that supports the same tiling (or compressed-fb) in display+gpu, you probably aren't sharing tiled buffers. > >> > One problem with this is it duplicates a lot of logic in each >> > driver which can export a dma_buf buffer. Each exporter will need to >> > do pretty much the same thing: iterate over all the attachments, >> > determine of all the constraints (assuming that can be done) and >> > allocate pages such that the lowest-common-denominator is satisfied. >> > >> > Perhaps rather than duplicating that logic in every driver, we could >> > Instead move allocation of the backing pages into dma_buf itself? >> > >> >> I tend to think it is better to add helpers as we see common patterns >> emerge, which drivers can opt-in to using. I don't think that we >> should move allocation into dma_buf itself, but it would perhaps be >> useful to have dma_alloc_*() variants that could allocate for multiple >> devices. > > A helper could work I guess, though I quite like the idea of having > dma_alloc_*() variants which take a list of devices to allocate memory > for. > > >> That would help for simple stuff, although I'd suspect >> eventually a GPU driver will move away from that. (Since you probably >> want to play tricks w/ pools of pages that are pre-zero'd and in the >> correct cache state, use spare cycles on the gpu or dma engine to >> pre-zero uncached pages, and games like that.) > > So presumably you're talking about a GPU driver being the exporter > here? If so, how could the GPU driver do these kind of tricks on > memory shared with another device? > Yes, that is gpu-as-exporter. If someone else is allocating buffers, it is up to them to do these tricks or not. Probably there is a pretty good chance that if you aren't a GPU you don't need those sort of tricks for fast allocation of transient upload buffers, staging textures, temporary pixmaps, etc. Ie. I don't really think a v4l camera or video decoder would benefit from that sort of optimization. > >> >> > Anyway, assuming user-space can figure out how a buffer should be >> >> > stored in memory, how does it indicate this to a kernel driver and >> >> > actually allocate it? Which ioctl on which device does user-space >> >> > call, with what parameters? Are you suggesting using something >> >> > like ION which exposes the low-level details of how buffers are >> > >> laid out in physical memory to userspace? If not, what? >> >> > >> >> >> >> no, userspace should not need to know this. And having a central >> >> driver that knows this for all the other drivers in the system >> >> doesn't really solve anything and isn't really scalable. At best >> >> you might want, in some cases, a flag you can pass when allocating. >> >> For example, some of the drivers have a 'SCANOUT' flag that can be >> >> passed when allocating a GEM buffer, as a hint to the kernel that >> >> 'if this hw requires contig memory for scanout, allocate this >> >> buffer contig'. But really, when it comes to sharing buffers >> >> between devices, we want this sort of information in dev->dma_params >> >> of the importing device(s). >> > >> > If you had a single driver which knew the constraints of all devices >> > on that particular SoC and the interface allowed user-space to >> > specify which devices a buffer is intended to be used with, I guess >> > it could pretty trivially allocate pages which satisfy those > constraints? >> >> keep in mind, even a number of SoC's come with pcie these days. You >> already have things like >> >> https://developer.nvidia.com/content/kayla-platform >> >> You probably want to get out of the SoC mindset, otherwise you are >> going to make bad assumptions that come back to bite you later on. > > Sure - there are always going to be PC-like devices where the > hardware configuration isn't fixed like it is on a traditional SoC. > But I'd rather have a simple solution which works on traditional SoCs > than no solution at all. Today our solution is to over-load the dumb > buffer alloc functions of the display's DRM driver - For now I'm just > looking for the next step up from that! ;-) > True.. the original intention, which is perhaps a bit desktop-centric, really was for there to be a userspace component talking to the drm driver for allocation, ie. xf86-video-foo and/or src/gallium/drivers/foo (for example) ;-) Which means for x11 having a SoC vendor specific xf86-video-foo for x11.. or vendor specific gbm implementation for wayland. (Although at least in the latter case it is a pretty small piece of code.) But that is probably what you are trying to avoid. At any rate, for both xorg and wayland/gbm, you know when a buffer is going to be a scanout buffer. What I'd recommend is define a small userspace API that your customers (the SoC vendors) implement to allocate a scanout buffer and hand you back a dmabuf fd. That could be used both for x11 and for gbm. Inputs should be requested width/height and format. And outputs pitch plus dmabuf fd. (Actually you might even just want to use gbm as your starting point. You could probably just use gbm from xf86-video-armsoc for allocation, to have one thing that works for both wayland and x11. Scanout and cursor buffers should go to vendor/SoC specific fxn, rest can be allocated from mali kernel driver.) > >> > wouldn't need a way to programmatically describe the constraints >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> > just "know" that on this SoC, that buffer needs to be physically >> > contiguous for example. >> >> not really.. it just knows it wants to scanout the buffer, and tells >> this as a hint to the kernel. >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> (where phys contig is not required for scanout), but causes CMA >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just >> knows that it wants to be able to scanout that particular buffer. > > I think that's the idea? The omap3's allocator driver would use > contiguous memory when it detects the SCANOUT flag whereas the omap4 > allocator driver wouldn't have to. No complex negotiation of > constraints - it just "knows". > well, it is same allocating driver in both cases (although maybe that is unimportant). The "it" that just knows it wants to scanout is userspace. The "it" that just knows that scanout translates to contiguous (or not) is the kernel. Perhaps we are saying the same thing ;-) BR, -R > > Cheers, > > Tom > > > >

12 years, 2 months

[PATCH v4 0/4] Device Tree support for CMA (Contiguous Memory Allocator)

by Marek Szyprowski

Hello, This is a fourth version of my proposal for device tree integration for reserved memory and Contiguous Memory Allocator. After the comments from Grant Likely I moved back memory region definitions back to /memory node (as it was in the first version of this proposal). I've also extended the code and made it more generic, added support for so called reserved dma memory (special dma memory regions created by dma_alloc_coherent() function, for exclusive usage for dma allocation for the given device). Just a few words for those who see this code for the first time: The proposed bindings allows to define contiguous memory regions of specified base address and size. Then, the defined regions can be assigned to the given device(s) by adding a property with a phanle to the defined contiguous memory region. From the device tree perspective that's all. Once the bindings are added, all the memory allocations from dma-mapping subsystem will be served from the defined contiguous memory regions. Contiguous Memory Allocator is a framework, which lets to provide a large contiguous memory buffers for (usually a multimedia) devices. The contiguous memory is reserved during early boot and then shared with kernel, which is allowed to allocate it for movable pages. Then, when device driver requests a contigouous buffer, the framework migrates movable pages out of contiguous region and gives it to the driver. When device driver frees the buffer, it is added to kernel memory pool again. For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57 ("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895 (CMA merge commit). Why we need device tree bindings for CMA at all? Older ARM kernels used so called board-based initialization. Those board files contained a definition of all hardware blocks available on the target system and particular kernel and driver software configuration selected by the board maintainer. In the new approach the board files will be removed completely and Device Tree approach is used to describe all hardware blocks available on the target system. By definition, the bindings should be software independent, so at least in theory it should be possible to use those bindings with other operating systems than Linux kernel. Reserved memory configuration belongs to the grey area. It might depend on hardware restriction of the board or modules and low-level configuration done by bootloader. Putting reserved and contiguous memory regions to /memory node and having phandles to those regions in the device nodes however matches well with the device-tree typical style of linking devices with other resources like clocks, interrupts, regulators, power domains, etc. This is the main reason to use such approach instead of putting everything to /chosen node as it has been proposed in v2 and v3. Best regards Marek Szyprowski Samsung R&D Institute Poland Changelog: v4: - moved back contiguous-memory bindings from /chosen/contiguous-memory to /memory nodes as suggested by Grant (see http://article.gmane.org/gmane.linux.drivers.devicetree/41030 for more details) - added support for DMA reserved memory with dma_declare_coherent() - moved code to drivers/of/of_reserved_mem.c - added generic code to scan specific path in flat device tree v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/ - fixed issues pointed by Laura and updated documentation v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075 - moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/ node to avoid spreading Linux specific parameters over the whole device tree definitions - added support for autoconfigured regions (use zero base) - fixes minor bugs v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/ - initial proposal Patch summary: Marek Szyprowski (4): drivers: dma-contiguous: clean source code and prepare for device tree drivers: of: add function to scan fdt nodes given by path drivers: of: add initialization code for dma reserved memory ARM: init: add support for reserved memory defined by device tree Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++++ arch/arm/mm/init.c | 3 + drivers/base/dma-contiguous.c | 147 +++++++++++----------- drivers/of/Kconfig | 6 + drivers/of/Makefile | 1 + drivers/of/fdt.c | 76 +++++++++++ drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++ include/asm-generic/dma-coherent.h | 6 + include/asm-generic/dma-contiguous.h | 2 - include/linux/dma-contiguous.h | 49 +++++++- include/linux/of_fdt.h | 3 + 11 files changed, 541 insertions(+), 79 deletions(-) create mode 100644 Documentation/devicetree/bindings/memory.txt create mode 100644 drivers/of/of_reserved_mem.c -- 1.7.9.5

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz(a)linaro.org> wrote: > On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark(a)gmail.com> wrote: >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. >> >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time > > One concern I know the Android folks have expressed previously (and > correct me if its no longer an objection), is that this attach time > in-kernel constraint solving / moving or reallocating buffers is > likely to hurt determinism. If I understood, their perspective was > that userland knows the device path the buffers will travel through, > so why not leverage that knowledge, rather then having the kernel have > to sort it out for itself after the fact. If you know the device path, then attach the buffer at all the devices before you start using it. Problem solved.. kernel knows all devices before pages need be allocated ;-) BR, -R

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > Hi Rob, > >> >> > We may also then have additional constraints when sharing buffers >> >> > between the display HW and video decode or even camera ISP HW. >> >> > Programmatically describing buffer allocation constraints is very >> >> > difficult and I'm not sure you can actually do it - there's some >> >> > pretty complex constraints out there! E.g. I believe there's a >> >> > platform where Y and UV planes of the reference frame need to be >> >> > in separate DRAM banks for real-time 1080p decode, or something >> >> > like that? >> >> >> >> yes, this was discussed. This is different from pitch/format/size >> >> constraints.. it is really just a placement constraint (ie. where >> >> do the physical pages go). IIRC the conclusion was to use a dummy >> >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> >> >> > Anyway, I guess my point is that even if we solve how to allocate >> >> > buffers which will be shared between the GPU and display HW such >> >> > that both sets of constraints are satisfied, that may not be the >> >> > end of the story. >> >> > >> >> >> >> that was part of the reason to punt this problem to userspace ;-) >> >> >> >> In practice, the kernel drivers doesn't usually know too much about >> >> the dimensions/format/etc.. that is really userspace level >> >> knowledge. There are a few exceptions when the kernel needs to know >> >> how to setup GTT/etc for tiled buffers, but normally this sort of >> >> information is up at the next level up (userspace, and >> >> drm_framebuffer in case of scanout). Userspace media frameworks >> >> like GStreamer already have a concept of format/caps negotiation. >> >> For non-display<->gpu sharing, I think this is probably where this >> >> sort of constraint negotiation should be handled. >> > >> > I agree that user-space will know which devices will access the >> > buffer and thus can figure out at least a common pixel format. >> > Though I'm not so sure userspace can figure out more low-level >> > details like alignment and placement in physical memory, etc. >> > >> >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. > > Absolutely. Pixel format has to be negotiated by user-space as in > most cases, user-space can map the buffer and thus will need to > know how to interpret the data. > > > >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. > > Seems sensible to me. > > >> ... This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time >> back, I had a patch that added a field or two to 'struct >> device_dma_parameters' so that it could be known if a device required >> contiguous buffers.. looks like that never got merged, so I'd need to >> dig that back up and resend it. But the idea was to have the 'struct >> device' encapsulate all the information that would be needed to >> do-the-right-thing when it comes to placement. > > As I understand it, it's up to the exporting device to allocate the > memory backing the dma_buf buffer. I guess the latest possible point > you can allocate the backing pages is when map_dma_buf is first > called? At that point the exporter can iterate over the current set > of attachments, programmatically determine the all the constraints of > all the attached drivers and attempt to allocate the backing pages > in such a way as to satisfy all those constraints? yes, this is the idea.. possibly some room for some helpers to help out with this, but that is all under the hood from userspace perspective > Didn't you say that programmatically describing device placement > constraints was an unbounded problem? I guess we would have to > accept that it's not possible to describe all possible constraints > and instead find a way to describe the common ones? well, the point I'm trying to make, is by dividing your constraints into two groups, one that impacts and is handled by userspace, and one that is in the kernel (ie. where the pages go), you cut down the number of permutations that the kernel has to care about considerably. And kernel already cares about, for example, what range of addresses that a device can dma to/from. I think really the only thing missing is the max # of sglist entries (contiguous or not) > One problem with this is it duplicates a lot of logic in each > driver which can export a dma_buf buffer. Each exporter will need to > do pretty much the same thing: iterate over all the attachments, > determine of all the constraints (assuming that can be done) and > allocate pages such that the lowest-common-denominator is satisfied. > > Perhaps rather than duplicating that logic in every driver, we could > Instead move allocation of the backing pages into dma_buf itself? > I tend to think it is better to add helpers as we see common patterns emerge, which drivers can opt-in to using. I don't think that we should move allocation into dma_buf itself, but it would perhaps be useful to have dma_alloc_*() variants that could allocate for multiple devices. That would help for simple stuff, although I'd suspect eventually a GPU driver will move away from that. (Since you probably want to play tricks w/ pools of pages that are pre-zero'd and in the correct cache state, use spare cycles on the gpu or dma engine to pre-zero uncached pages, and games like that.) > >> > Anyway, assuming user-space can figure out how a buffer should be >> > stored in memory, how does it indicate this to a kernel driver and >> > actually allocate it? Which ioctl on which device does user-space >> > call, with what parameters? Are you suggesting using something like >> > ION which exposes the low-level details of how buffers are laid out >> in >> > physical memory to userspace? If not, what? >> >> no, userspace should not need to know this. And having a central >> driver that knows this for all the other drivers in the system doesn't >> really solve anything and isn't really scalable. At best you might >> want, in some cases, a flag you can pass when allocating. For >> example, some of the drivers have a 'SCANOUT' flag that can be passed >> when allocating a GEM buffer, as a hint to the kernel that 'if this hw >> requires contig memory for scanout, allocate this buffer contig'. But >> really, when it comes to sharing buffers between devices, we want this >> sort of information in dev->dma_params of the importing device(s). > > If you had a single driver which knew the constraints of all devices > on that particular SoC and the interface allowed user-space to specify > which devices a buffer is intended to be used with, I guess it could > pretty trivially allocate pages which satisfy those constraints? It keep in mind, even a number of SoC's come with pcie these days. You already have things like https://developer.nvidia.com/content/kayla-platform You probably want to get out of the SoC mindset, otherwise you are going to make bad assumptions that come back to bite you later on. > wouldn't need a way to programmatically describe the constraints > either: As you say, if userspace sets the "SCANOUT" flag, it would > just "know" that on this SoC, that buffer needs to be physically > contiguous for example. not really.. it just knows it wants to scanout the buffer, and tells this as a hint to the kernel. For example, on omapdrm, the SCANOUT flag does nothing on omap4+ (where phys contig is not required for scanout), but causes CMA (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just knows that it wants to be able to scanout that particular buffer. > Though It would effectively mean you'd need an "allocation" driver per > SoC, which as you say may not be scalable? Right.. and not actually even possible in the general sense (see SoC + external pcie gfx card) BR, -R > > > Cheers, > > Tom > > > > >

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Lucas Stach

Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > Hi Rob, > > +lkml > > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com> > > >> wrote: > > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > > >> >> > also allocate buffers for the GPU. Still not sure how to > > >> >> > resolve this as we don't use DRM for our GPU driver. > > >> >> > > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > > >> >> (esp. once the dma-fence stuff is in place), we'd have > > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > > >> >> specific drm/kms driver, using prime/dmabuf to share between > > >> >> the two. > > >> > > > >> > The "extra" buffers we were allocating from armsoc DDX were really > > >> > being allocated through DRM/GEM so we could get an flink name > > >> > for them and pass a reference to them back to our GPU driver on > > >> > the client side. If it weren't for our need to access those > > >> > extra off-screen buffers with the GPU we wouldn't need to > > >> > allocate them with DRM at all. So, given they are really "GPU" > > >> > buffers, it does absolutely make sense to allocate them in a > > >> > different driver to the display driver. > > >> > > > >> > However, to avoid unnecessary memcpys & related cache > > >> > maintenance ops, we'd also like the GPU to render into buffers > > >> > which are scanned out by the display controller. So let's say > > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > >> > out buffers with the display's DRM driver but a custom ioctl > > >> > on the GPU's DRM driver to allocate non scanout, off-screen > > >> > buffers. Sounds great, but I don't think that really works > > >> > with DRI2. If we used two drivers to allocate buffers, which > > >> > of those drivers do we return in DRI2ConnectReply? Even if we > > >> > solve that somehow, GEM flink names are name-spaced to a > > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > > >> > how does the EGL in the client know which DRM device owns GEM > > >> > flink name "1234"? We'd need some pretty dirty hacks. > > >> > > >> You would return the name of the display driver allocating the > > >> buffers. On the client side you can use generic ioctls to go from > > >> flink -> handle -> dmabuf. So the client side would end up opening > > >> both the display drm device and the gpu, but without needing to know > > >> too much about the display. > > > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > > using dma_buf/PRIME can still be flink'd. So the display controller's > > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > > ioctl. Those scan-out buffers than then be exported from the > > > dispaly's DRM driver and imported into the GPU's DRM driver using > > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > > name for that buffer within the GPU DRM driver's name-space to return > > > to the DRI2 client. That same namespace is also what DRI2 back- > > > buffers are allocated from, so I think that could work... Except... > > > > (and.. the general direction is that things will move more to just use > > dmabuf directly, ie. wayland or dri3) > > I agree, DRI2 is the only reason why we need a system-wide ID. I also > prefer buffers to be passed around by dma_buf fd, but we still need to > support DRI2 and will do for some time I expect. > > > > > >> > Anyway, that latter case also gets quite difficult. The "GPU" > > >> > DRM driver would need to know the constraints of the display > > >> > controller when allocating buffers intended to be scanned out. > > >> > For example, pl111 typically isn't behind an IOMMU and so > > >> > requires physically contiguous memory. We'd have to teach the > > >> > GPU's DRM driver about the constraints of the display HW. Not > > >> > exactly a clean driver model. :-( > > >> > > > >> > I'm still a little stuck on how to proceed, so any ideas > > >> > would greatly appreciated! My current train of thought is > > >> > having a kind of SoC-specific DRM driver which allocates > > >> > buffers for both display and GPU within a single GEM > > >> > namespace. That SoC-specific DRM driver could then know the > > >> > constraints of both the GPU and the display HW. We could then > > >> > use PRIME to export buffers allocated with the SoC DRM driver > > >> > and import them into the GPU and/or display DRM driver. > > >> > > >> Usually if the display drm driver is allocating the buffers that > > >> might be scanned out, it just needs to have minimal knowledge of > > >> the GPU (pitch alignment constraints). I don't think we need a > > >> 3rd device just to allocate buffers. > > > > > > While Mali can render to pretty much any buffer, there is a mild > > > performance improvement to be had if the buffer stride is aligned to > > > the AXI bus's max burst length when drawing to the buffer. > > > > I suspect the display controllers might frequently benefit if the > > pitch is aligned to AXI burst length too.. > > If the display controller is going to be reading from linear memory > I don't think it will make much difference - you'll just get an extra > 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > you get those extra transactions per _tile_ scan-line and as such, > the overhead is more pronounced. > > > > > > So in some respects, there is a constraint on how buffers which will > > > be drawn to using the GPU are allocated. I don't really like the idea > > > of teaching the display controller DRM driver about the GPU buffer > > > constraints, even if they are fairly trivial like this. If the same > > > display HW IP is being used on several SoCs, it seems wrong somehow > > > to enforce those GPU constraints if some of those SoCs don't have a > > > GPU. > > > > Well, I suppose you could get min_pitch_alignment from devicetree, or > > something like this.. > > > > In the end, the easy solution is just to make the display allocate to > > the worst-case pitch alignment. In the early days of dma-buf > > discussions, we kicked around the idea of negotiating or > > programatically describing the constraints, but that didn't really > > seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > > > > We may also then have additional constraints when sharing buffers > > > between the display HW and video decode or even camera ISP HW. > > > Programmatically describing buffer allocation constraints is very > > > difficult and I'm not sure you can actually do it - there's some > > > pretty complex constraints out there! E.g. I believe there's a > > > platform where Y and UV planes of the reference frame need to be in > > > separate DRAM banks for real-time 1080p decode, or something like > > > that? > > > > yes, this was discussed. This is different from pitch/format/size > > constraints.. it is really just a placement constraint (ie. where do > > the physical pages go). IIRC the conclusion was to use a dummy > > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > > > Anyway, I guess my point is that even if we solve how to allocate > > > buffers which will be shared between the GPU and display HW such that > > > both sets of constraints are satisfied, that may not be the end of > > > the story. > > > > > > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? > I strongly disagree with exposing low-level hardware details like tiling to userspace. If we have to do the negotiation of those things in userspace we will end up with having to pipe those information through things like the wayland protocol. I don't see how this could ever be considered a good idea. I would rather see kernel drivers negotiating those things at dmabuf attach time in way invisible to userspace. I agree that this negotiation thing isn't easy to get right for the plethora of different hardware constraints we see today, but I would rather see this in-kernel, where we have the chance to fix things up if needed, than in a fixed userspace interface. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > >> > So in some respects, there is a constraint on how buffers which will >> > be drawn to using the GPU are allocated. I don't really like the idea >> > of teaching the display controller DRM driver about the GPU buffer >> > constraints, even if they are fairly trivial like this. If the same >> > display HW IP is being used on several SoCs, it seems wrong somehow >> > to enforce those GPU constraints if some of those SoCs don't have a >> > GPU. >> >> Well, I suppose you could get min_pitch_alignment from devicetree, or >> something like this.. >> >> In the end, the easy solution is just to make the display allocate to >> the worst-case pitch alignment. In the early days of dma-buf >> discussions, we kicked around the idea of negotiating or >> programatically describing the constraints, but that didn't really >> seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > >> > We may also then have additional constraints when sharing buffers >> > between the display HW and video decode or even camera ISP HW. >> > Programmatically describing buffer allocation constraints is very >> > difficult and I'm not sure you can actually do it - there's some >> > pretty complex constraints out there! E.g. I believe there's a >> > platform where Y and UV planes of the reference frame need to be in >> > separate DRAM banks for real-time 1080p decode, or something like >> > that? >> >> yes, this was discussed. This is different from pitch/format/size >> constraints.. it is really just a placement constraint (ie. where do >> the physical pages go). IIRC the conclusion was to use a dummy >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > Anyway, I guess my point is that even if we solve how to allocate >> > buffers which will be shared between the GPU and display HW such that >> > both sets of constraints are satisfied, that may not be the end of >> > the story. >> > >> >> that was part of the reason to punt this problem to userspace ;-) >> >> In practice, the kernel drivers doesn't usually know too much about >> the dimensions/format/etc.. that is really userspace level knowledge. >> There are a few exceptions when the kernel needs to know how to setup >> GTT/etc for tiled buffers, but normally this sort of information is up >> at the next level up (userspace, and drm_framebuffer in case of >> scanout). Userspace media frameworks like GStreamer already have a >> concept of format/caps negotiation. For non-display<->gpu sharing, I >> think this is probably where this sort of constraint negotiation >> should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. well, let's divide things up into two categories: 1) the arrangement and format of pixels.. ie. what userspace would need to know if it mmap's a buffer. This includes pixel format, stride, etc. This should be negotiated in userspace, it would be crazy to try to do this in the kernel. 2) the physical placement of the pages. Ie. whether it is contiguous or not. Which bank the pages in the buffer are placed in, etc. This is not visible to userspace. This is the purpose of the attach step, so you know all the devices involved in sharing up front before allocating the backing pages. (Or in the worst case, if you have a "late attacher" you at least know when no device is doing dma access to a buffer and can reallocate and move the buffer.) A long time back, I had a patch that added a field or two to 'struct device_dma_parameters' so that it could be known if a device required contiguous buffers.. looks like that never got merged, so I'd need to dig that back up and resend it. But the idea was to have the 'struct device' encapsulate all the information that would be needed to do-the-right-thing when it comes to placement. > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? no, userspace should not need to know this. And having a central driver that knows this for all the other drivers in the system doesn't really solve anything and isn't really scalable. At best you might want, in some cases, a flag you can pass when allocating. For example, some of the drivers have a 'SCANOUT' flag that can be passed when allocating a GEM buffer, as a hint to the kernel that 'if this hw requires contig memory for scanout, allocate this buffer contig'. But really, when it comes to sharing buffers between devices, we want this sort of information in dev->dma_params of the importing device(s). BR, -R

12 years, 2 months

Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

by Rob Clark

On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote: > Hi Rob, > > +linux-media, +linaro-mm-sig for discussion of video/camera > buffer constraints... > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com> >> wrote: >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> >> > allocate buffers for the GPU. Still not sure how to resolve >> >> > this as we don't use DRM for our GPU driver. >> >> >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> >> using prime/dmabuf to share between the two. >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > being allocated through DRM/GEM so we could get an flink name >> > for them and pass a reference to them back to our GPU driver on >> > the client side. If it weren't for our need to access those >> > extra off-screen buffers with the GPU we wouldn't need to >> > allocate them with DRM at all. So, given they are really "GPU" >> > buffers, it does absolutely make sense to allocate them in a >> > different driver to the display driver. >> > >> > However, to avoid unnecessary memcpys & related cache >> > maintenance ops, we'd also like the GPU to render into buffers >> > which are scanned out by the display controller. So let's say >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > out buffers with the display's DRM driver but a custom ioctl >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > buffers. Sounds great, but I don't think that really works >> > with DRI2. If we used two drivers to allocate buffers, which >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > solve that somehow, GEM flink names are name-spaced to a >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > how does the EGL in the client know which DRM device owns GEM >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> You would return the name of the display driver allocating the >> buffers. On the client side you can use generic ioctls to go from >> flink -> handle -> dmabuf. So the client side would end up opening >> both the display drm device and the gpu, but without needing to know >> too much about the display. > > I think the bit I was missing was that a GEM bo for a buffer imported > using dma_buf/PRIME can still be flink'd. So the display controller's > DRM driver allocates scan-out buffers via the DUMB buffer allocate > ioctl. Those scan-out buffers than then be exported from the > dispaly's DRM driver and imported into the GPU's DRM driver using > PRIME. Once imported into the GPU's driver, we can use flink to get a > name for that buffer within the GPU DRM driver's name-space to return > to the DRI2 client. That same namespace is also what DRI2 back-buffers > are allocated from, so I think that could work... Except... > (and.. the general direction is that things will move more to just use dmabuf directly, ie. wayland or dri3) > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > DRM driver would need to know the constraints of the display >> > controller when allocating buffers intended to be scanned out. >> > For example, pl111 typically isn't behind an IOMMU and so >> > requires physically contiguous memory. We'd have to teach the >> > GPU's DRM driver about the constraints of the display HW. Not >> > exactly a clean driver model. :-( >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > would greatly appreciated! My current train of thought is >> > having a kind of SoC-specific DRM driver which allocates >> > buffers for both display and GPU within a single GEM >> > namespace. That SoC-specific DRM driver could then know the >> > constraints of both the GPU and the display HW. We could then >> > use PRIME to export buffers allocated with the SoC DRM driver >> > and import them into the GPU and/or display DRM driver. >> >> Usually if the display drm driver is allocating the buffers that might >> be scanned out, it just needs to have minimal knowledge of the GPU >> (pitch alignment constraints). I don't think we need a 3rd device >> just to allocate buffers. > > While Mali can render to pretty much any buffer, there is a mild > performance improvement to be had if the buffer stride is aligned to > the AXI bus's max burst length when drawing to the buffer. I suspect the display controllers might frequently benefit if the pitch is aligned to AXI burst length too.. > So in some respects, there is a constraint on how buffers which will > be drawn to using the GPU are allocated. I don't really like the idea > of teaching the display controller DRM driver about the GPU buffer > constraints, even if they are fairly trivial like this. If the same > display HW IP is being used on several SoCs, it seems wrong somehow > to enforce those GPU constraints if some of those SoCs don't have a > GPU. Well, I suppose you could get min_pitch_alignment from devicetree, or something like this.. In the end, the easy solution is just to make the display allocate to the worst-case pitch alignment. In the early days of dma-buf discussions, we kicked around the idea of negotiating or programatically describing the constraints, but that didn't really seem like a bounded problem. > We may also then have additional constraints when sharing buffers > between the display HW and video decode or even camera ISP HW. > Programmatically describing buffer allocation constraints is very > difficult and I'm not sure you can actually do it - there's some > pretty complex constraints out there! E.g. I believe there's a > platform where Y and UV planes of the reference frame need to be in > separate DRAM banks for real-time 1080p decode, or something like > that? yes, this was discussed. This is different from pitch/format/size constraints.. it is really just a placement constraint (ie. where do the physical pages go). IIRC the conclusion was to use a dummy devices with it's own CMA pool for attaching the Y vs UV buffers. > Anyway, I guess my point is that even if we solve how to allocate > buffers which will be shared between the GPU and display HW such that > both sets of constraints are satisfied, that may not be the end of > the story. > that was part of the reason to punt this problem to userspace ;-) In practice, the kernel drivers doesn't usually know too much about the dimensions/format/etc.. that is really userspace level knowledge. There are a few exceptions when the kernel needs to know how to setup GTT/etc for tiled buffers, but normally this sort of information is up at the next level up (userspace, and drm_framebuffer in case of scanout). Userspace media frameworks like GStreamer already have a concept of format/caps negotiation. For non-display<->gpu sharing, I think this is probably where this sort of constraint negotiation should be handled. BR, -R > > Cheers, > > Tom > > > > >

12 years, 2 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig August 2013