On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
If that means AI companies don't want to open our their hw specs enough to allow that, so be it - all you get in that case is offloading the kernel side of the stack for convenience, with zero long term prospects to ever make this into a cross vendor subsystem stack that does something useful.
I don't think this is true at all - nouveau is probably the best example.
nouveau reverse engineered a userspace stack for one of these devices.
How much further ahead would they have been by now if they had a vendor supported, fully featured, open kernel driver to build the userspace upon?
There is actually tons of example here, most of the arm socs have fully open kernel drivers, supported by the vendor (out of tree).
The hard part is the userspace driver and all the things you're submitting to it. We've had open kernel drivers for mail/qualcomm/... years before any believable open implementation started existing. Typing up the memory manager and hw submission queue handling is comparitively trivial. Generally the kernel driver is also done last, you bring up the userspace first, often by just directly programming the hw from userspace. Kernel driver only gets in the way with this stuff (nouveau is entirely developed as a userspace driver, as the most extreme example).
This is a bit different for the display side, but nowadays those drivers are fully in-kernel so they're all open. Well except the nvidia one, and I've not heard of nvidia working on even an out-of-tree open display driver, so that won't help the in-tree effort at all.
Where it would have helped is if this open driver would come with redistributable firmware, because that is right now the thing making nouveau reverse-engineering painful enough to be non-feasible. Well not the reverse-engineering, but the "shipping the result as a working driver stack".
I don't think the facts on the ground support your claim here, aside from the practical problem that nvidia is unwilling to even create an open driver to begin with. So there isn't anything to merge.
open up your hw enough for that, I really don't see the point in merging such a driver, it'll be an unmaintainable stack by anyone else who's not having access to those NDA covered specs and patents and everything.
My perspective from RDMA is that the drivers are black boxes. I can hack around the interface layers but there is a lot of wild stuff in there that can't be understood without access to the HW documentation.
There's shipping gpu drivers with entirely reverse-engineered stacks. And I don't mean "shipping in fedora" but "shipping in Chrome tablets sold by OEM partners of Google". So it's very much possible, even if the vendor is maximally stubborn about things.
I think only HW that has open specs, like say NVMe, can really be properly community oriented. Otherwise we have to work in a community partnership with the vendor.
Well sure that's the ideal case, but most vendors in the accel space arent interested actual partnership with the wider community. It's "merge this kernel driver and have no further demands about anything else". Well there are some who are on board, but it does take pretty enormous amounts of coercion. -Daniel