Hey,
So I took a look at the sync stuff in android, in a lot of ways I believe that they're similar, yet subtly different. Most of the stuff I looked at is from the sync.h header in drivers/staging, so maybe my knowledge is incomplete.
The timeline is similar to what I called a fence context. Each command stream on a gpu can have a context. Because nvidia hardware can have 4095 separate timelines, I didn't want to keep the bookkeeping for each timeline, although I guess that it's already done. Maybe it could be done in a unified way for each driver, making a transition to timelines that can be used by android easier.
I did not have an explicit syncpoint addition, but I think that sync points + sync_fence were similar to what I did with my dma-fence stuff, except slightly different. In my approach the dma-fence is signaled after all sync_points are done AND the queued commands are executed. In effect the dma-fence becomes the next syncpoint, depending on all previous dma-fence syncpoints.
An important thing to note is that dma-fence is kernelspace only, so it might be better to rename it to syncpoint, and use fence for the userspace interface.
A big difference is locking, I assume in my code that most fences emitted are not waited on, so the fastpath fence_signal is a test_and_set_bit plus test_bit. A single lock is used for the waitqueue and callbacks, with the waitqueue being implemented internally as an asynchronous callback. The lock is provided by the driver, which makes adding support for old hardware that has no reliable way of notifying completion of events easier.
I avoided using global locks, but I think for debugfs support I may end up having to add some.
The dma fence looks similar overall, except that I allow overriding some stuff and keep less track about state. I do believe that I can create a userspace interface around dma_fence that works similar to android, and the kernel space interface could be done in a similar way too.
One thing though: is it really required to merge fences? It seems to me that if I add a poll callback userspace could simply do a poll on a list of fences. This would give userspace all the information it needs about each individual fence.
The thing about wait/wound mutexes can be ignored for this discussion. It's really just a method of adding a fence to a dma-buf, and building a list of all dma-fences to wait on in the kernel before starting a command buffer, and setting a new fence to all the dma-bufs to signal completion of the event. Regardless of the sync mechanism we'll decide on, this stuff wouldn't change.
Depending on feedback I'll try reflashing my nexus 7 to stock android, and work on trying to convert android syncpoints to dma-fence, which I'll probably rename to syncpoints.
~Maarten