I've been buried all day, but here's my response to a bunch of the discussion points today:

I definitely don't think it makes sense to solve part of the problem and then leave the gpu's to require their own allocator for their common case.  Why solve the allocator problem twice?  Also, we want to be able to support passing of these buffers between the camera, gpu, video decoder etc.  It'd be much easier if they were from a common source.  

The android team's graphics folks, the their counterparts at intel, imagination tech, nvidia, qualcomm and arm have all told me that they need a method for mapping buffers uncached to userspace.  The common case for this is to write vertexes, textures etc to these buffers once and never touch them again.  This may happen several (or even several 10s or more) of times per frame.  My experience with cache flushes on ARM architectures matches Marek's.  Typically write combine makes streaming writes really really fast, and on several SOC's we've found it cheaper to flush the whole cache than to flush by line.  Clearly this impacts the rest of system performance, not to mention the fact that a couple of large textures and you've totally blown your caches for the rest of the system.  

Once we've solved the multiple mapping problem, it becomes quite easy to support both cached AND uncached accesses from the cpu, so I think that covers the cases where it's actually desirable to have a cached mapping -- software rendering, processing frames from a camera etc.  

I think the issue of contiguous memory allocation and cache attributes are totally separate, except that in cases where large contiguous regions are necessary -- the problem qualcom's pmem and friends were written to solve -- you pretty much end up needing to put aside a pool of buffers at boot anyway in order to guarantee the availability of large order allocations.  Once you've done that, the attributes problem goes away since your memory is not in the direct map.  



On Wed, Apr 20, 2011 at 1:34 PM, Dave Airlie <airlied@gmail.com> wrote:
>
>
> [TC] I’m not sure I completely agree with this being a use case. From my
> understanding, the problem we’re trying to solve here isn’t a generic
> graphics memory manager but rather a memory manager to facilitate
> cross-device sharing of 2D image buffers. GPU drivers will still have their
> own allocators for textures which will probably be in a tiled or other
> proprietary format no other device can understand anyway. The use case where
> we (GPU driver vendors) might want uncached memory is for one-time texture
> upload. If we have a texture we know we’re only going to write to once (with
> the CPU), there is no benefit in using cached memory. In fact, there’s a
> potential performance drop if you used cached memory because the texture
> upload will cause useful cache lines to be evicted and replaced with useless
> lines for the texture. However, I don’t see any use case where we’d then
> want to share that CPU-uploaded texture with another device, in which case
> we would use our own (uncached) allocator, not this “cross-device”
> allocator. There’s also a school of thought (so I’m told) that for one-time
> texture upload you still want to use cached memory because more modern cores
> have smaller write buffers (so you want to use the cache for better
> combining of writes) and are good at detecting large sequential writes and
> thus don’t use the whole cache for those anyway. So, other than one-time
> texture upload, are there other graphics use cases you know of where it
> might be more optimal to use uncached memory? What about video decoder
> use-cases?
>

The memory mangaer should be used for all internal  GPU memory
management as well if desired.

If we have any hope of ever making open source ARM GPU drivers get
upstream they can't all just
go reinventing the wheel. They need to be based on a common layer.

Dave.