So I've been experimenting with support for Dave Airlie's new RandR 1.4 provider object interface, so that Optimus-based laptops can use our driver to drive the discrete GPU and display on the integrated GPU. The good news is that I've got a proof of concept working.
During a review of the current code, we came up with a few concerns:
1. The output source is responsible for allocating the shared memory
Right now, the X server calls CreatePixmap on the output source screen and then expects the output sink screen to be able to display from whatever memory the source allocates. Right now, the source has no mechanism for asking the sink what its requirements are for the surface. I'm using our own internal pitch alignment requirements and that seems to be good enough for the Intel device to scan out, but that could be pure luck.
Does it make sense to add a mechanism for drivers to negotiate this with each other, or is it sufficient to just define a lowest common denominator format and if your hardware can't deal with that format, you just don't get to share buffers?
One of my coworkers brought to my attention the fact that Tegra requires a specific pitch alignment, and cannot accommodate larger pitches. If other SoC designs have similar restrictions, we might need to add a handshake mechanism.
2. There's no fallback mechanism if sharing can't be negotiated
If RandR fails to share a pixmap with the output sink screen, the whole modeset fails. This means you'll end up not seeing anything on the screen and you'll probably think your computer locked up. Should there be some sort of software copy fallback to ensure that something at least shows up on the display?
3. How should the memory be allocated?
In the prototype I threw together, I'm allocating the shared memory using shm_open and then exporting that as a dma-buf file descriptor using an ioctl I added to the kernel, and then importing that memory back into our driver through dma_buf_attach & dma_buf_map_attachment. Does it make sense for user-space programs to be able to export shmfs files like that? Should that interface go in DRM / GEM / PRIME instead? Something else? I'm pretty unfamiliar with this kernel code so any suggestions would be appreciated.
-- Aaron
P.S. for those unfamiliar with PRIME: Dave Airlie added new support to the X Resize and Rotate extension version 1.4 to support offloading display and rendering to different drivers. PRIME is the DRM implementation in the kernel, layered on top of DMA-BUF, that implements the actual sharing of buffers between drivers.
http://cgit.freedesktop.org/xorg/proto/randrproto/tree/randrproto.txt?id=ran... http://airlied.livejournal.com/75555.html - update on hotplug server http://airlied.livejournal.com/76078.html - randr 1.5 demo videos
----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------
object interface, so that Optimus-based laptops can use our driver to drive the discrete GPU and display on the integrated GPU. The good news is that I've got a proof of concept working.
Don't suppose you'll be interested in adding the other method at some point as well? since saving power is probably important to a lot of people :-)
During a review of the current code, we came up with a few concerns:
- The output source is responsible for allocating the shared memory
Right now, the X server calls CreatePixmap on the output source screen and then expects the output sink screen to be able to display from whatever memory the source allocates. Right now, the source has no mechanism for asking the sink what its requirements are for the surface. I'm using our own internal pitch alignment requirements and that seems to be good enough for the Intel device to scan out, but that could be pure luck.
Well in theory it might be nice but it would have been premature since so far the only interactions for prime are combination of intel, nvidia and AMD, and I think everyone has fairly similar pitch alignment requirements, I'd be interested in adding such an interface but I don't think its some I personally would be working on.
other, or is it sufficient to just define a lowest common denominator format and if your hardware can't deal with that format, you just don't get to share buffers?
At the moment I'm happy to just go with linear, minimum pitch alignment 64 or something as a base standard, but yeah I'm happy for it to work either way, just don't have enough evidence it's worth it yet. I've not looked at ARM stuff, so patches welcome if people consider they need to use this stuff for SoC devices.
- There's no fallback mechanism if sharing can't be negotiated
If RandR fails to share a pixmap with the output sink screen, the whole modeset fails. This means you'll end up not seeing anything on the screen and you'll probably think your computer locked up. Should there be some sort of software copy fallback to ensure that something at least shows up on the display?
Uggh, it would be fairly slow and unuseable, I'd rather they saw nothing, but again open to suggestions on how to make this work, since it might fail for other reasons and in that case there is still nothing a sw copy can do. What happens if the slave intel device just fails to allocate a pixmap, but yeah I'm willing to think about it a bit more when we have some reference implementations.
- How should the memory be allocated?
In the prototype I threw together, I'm allocating the shared memory using shm_open and then exporting that as a dma-buf file descriptor using an ioctl I added to the kernel, and then importing that memory back into our driver through dma_buf_attach & dma_buf_map_attachment. Does it make sense for user-space programs to be able to export shmfs files like that? Should that interface go in DRM / GEM / PRIME instead? Something else? I'm pretty unfamiliar with this kernel code so any suggestions would be appreciated.
Your kernel driver should in theory be doing it all, if you allocate shared pixmaps in GTT accessible memory, then you need an ioctl to tell your kernel driver to export the dma buf to the fd handle. (assuming we get rid of the _GPL, which people have mentioned they are open to doing). We have handle->fd and fd->handle interfaces on DRM, you'd need something similiar on the nvidia kernel driver interface.
Yes for 4 some sort of fencing is being worked on by Maarten for other stuff but would be a pre-req for doing this, and also some devices don't want fullscreen updates, like USB, so doing flipped updates would have to be optional or negoitated. It makes sense for us as well since things like gnome-shell can do full screen pageflips and we have to do full screen dirty updates.
Dave.
On 08/31/2012 08:00 PM, Dave Airlie wrote:
object interface, so that Optimus-based laptops can use our driver to drive the discrete GPU and display on the integrated GPU. The good news is that I've got a proof of concept working.
Don't suppose you'll be interested in adding the other method at some point as well? since saving power is probably important to a lot of people
That's milestone 2. I'm focusing on display offload to start because it's easier to implement and lays the groundwork for the kernel pieces. I have to emphasize that I'm just doing a feasibility study right now and I can't promise that we're going to officially support this stuff.
During a review of the current code, we came up with a few concerns:
- The output source is responsible for allocating the shared memory
Right now, the X server calls CreatePixmap on the output source screen and then expects the output sink screen to be able to display from whatever memory the source allocates. Right now, the source has no mechanism for asking the sink what its requirements are for the surface. I'm using our own internal pitch alignment requirements and that seems to be good enough for the Intel device to scan out, but that could be pure luck.
Well in theory it might be nice but it would have been premature since so far the only interactions for prime are combination of intel, nvidia and AMD, and I think everyone has fairly similar pitch alignment requirements, I'd be interested in adding such an interface but I don't think its some I personally would be working on.
Okay. Hopefully that won't be too painful to add if we ever need it in the future.
other, or is it sufficient to just define a lowest common denominator format and if your hardware can't deal with that format, you just don't get to share buffers?
At the moment I'm happy to just go with linear, minimum pitch alignment 64 or
256, for us.
something as a base standard, but yeah I'm happy for it to work either way, just don't have enough evidence it's worth it yet. I've not looked at ARM stuff, so patches welcome if people consider they need to use this stuff for SoC devices.
We can always hack it to whatever is necessary if we see that the sink side driver is Tegra, but I was hoping for something more general.
- There's no fallback mechanism if sharing can't be negotiated
If RandR fails to share a pixmap with the output sink screen, the whole modeset fails. This means you'll end up not seeing anything on the screen and you'll probably think your computer locked up. Should there be some sort of software copy fallback to ensure that something at least shows up on the display?
Uggh, it would be fairly slow and unuseable, I'd rather they saw nothing, but again open to suggestions on how to make this work, since it might fail for other reasons and in that case there is still nothing a sw copy can do. What happens if the slave intel device just fails to allocate a pixmap, but yeah I'm willing to think about it a bit more when we have some reference implementations.
Just rolling back the modeset operation to whatever was working before would be a good start.
It's worse than that on my current laptop, though, since our driver sees a phantom CRT output and we happily start driving pixels to it that end up going nowhere. I'll need to think about what the right behavior is there since I don't know if we want to rely on an X client to make that configuration work.
- How should the memory be allocated?
In the prototype I threw together, I'm allocating the shared memory using shm_open and then exporting that as a dma-buf file descriptor using an ioctl I added to the kernel, and then importing that memory back into our driver through dma_buf_attach & dma_buf_map_attachment. Does it make sense for user-space programs to be able to export shmfs files like that? Should that interface go in DRM / GEM / PRIME instead? Something else? I'm pretty unfamiliar with this kernel code so any suggestions would be appreciated.
Your kernel driver should in theory be doing it all, if you allocate shared pixmaps in GTT accessible memory, then you need an ioctl to tell your kernel driver to export the dma buf to the fd handle. (assuming we get rid of the _GPL, which people have mentioned they are open to doing). We have handle->fd and fd->handle interfaces on DRM, you'd need something similiar on the nvidia kernel driver interface.
Okay, I can do that. We already have a mechanism for importing buffers allocated elsewhere so reusing that for shmfs and/or dma-buf seemed like a natural extension. I don't think adding a separate ioctl for exporting our own allocations will add too much extra code.
Yes for 4 some sort of fencing is being worked on by Maarten for other stuff but would be a pre-req for doing this, and also some devices don't want fullscreen updates, like USB, so doing flipped updates would have to be optional or negoitated. It makes sense for us as well since things like gnome-shell can do full screen pageflips and we have to do full screen dirty updates.
Right now my implementation has two sources of tearing:
1. The dGPU reads the vidmem primary surface asynchronously from its own rendering to it.
2. The iGPU fetches the shared surface for display asynchronously from the dGPU writing into it.
#1 I can fix within our driver. For #2, I don't want to rely on the dGPU being able to push complete frames over the bus during vblank in response to an iGPU fence trigger so I was thinking we would want double-buffering all the time. Also, I was hoping to set up a proper flip chain between the dGPU, the dGPU's DMA engine, and the Intel display engine so that for full-screen applications, glXSwapBuffers is stalled properly without relying on the CPU to schedule things. Maybe that's overly ambitious for now?
-- Aaron
On Tue, Sep 04, 2012 at 01:57:32PM -0700, Aaron Plattner wrote:
On 08/31/2012 08:00 PM, Dave Airlie wrote:
Yes for 4 some sort of fencing is being worked on by Maarten for other stuff but would be a pre-req for doing this, and also some devices don't want fullscreen updates, like USB, so doing flipped updates would have to be optional or negoitated. It makes sense for us as well since things like gnome-shell can do full screen pageflips and we have to do full screen dirty updates.
Right now my implementation has two sources of tearing:
The dGPU reads the vidmem primary surface asynchronously from its own rendering to it.
The iGPU fetches the shared surface for display asynchronously from the dGPU writing into it.
#1 I can fix within our driver. For #2, I don't want to rely on the dGPU being able to push complete frames over the bus during vblank in response to an iGPU fence trigger so I was thinking we would want double-buffering all the time. Also, I was hoping to set up a proper flip chain between the dGPU, the dGPU's DMA engine, and the Intel display engine so that for full-screen applications, glXSwapBuffers is stalled properly without relying on the CPU to schedule things. Maybe that's overly ambitious for now?
For the frontbuffer tearing Chris Wilson added a special mode to the SNA intel driver that uses pageflips for all buffer updates (like windowized Xv or dri2copybuffers), mostly because vsync'ed blits are busted on snb (and not yet proved to be fixed on ivb). So we could use that mode for an optimus platform.
Wrt the full flip-chain, that's what Maarten Lankhorst has running in his proof-of-concept (but only for a second or so, since nouveau is totally bust on his machine). The only place he wakes up the cpu is to sync from nv to intel, but even there we can kick of the intel gpu directly from the nv irq handler (with a simple register write). intel -> nv sync uses memory based sequence numbers. Only proof of concept for rendering though, iirc the fence support isn't wired up with the pageflipping on the intel side yet. -Daniel
linaro-mm-sig@lists.linaro.org