On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@gmail.com wrote:
On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt palmer@dabbelt.com wrote:
I was really experting someone to say that. It does seem kind of silly to build out the new interface, but not go all the way to a ring buffer. We just didn't really have any way to justify the extra complexity as our use cases aren't that high performance. I kind of like to have benchmarks for this sort of thing, though, and I didn't have anyone who had bothered avoiding the last copy to compare against.
I worked on something very similar, though performance was one of the goals. The implementation was floating around lockless ring buffers, shared memory for zerocopy, multiqueue and error handling. It could be that every disk storage vendor has to implement something like that in order to bridge Linux kernel to their own proprietary datapath running in userspace.
OK, good to know. That's kind of the feeling I'd gotten from having chatted to a handful of people about this, but I don't remember people having actually gotten all the way to zero-copy. That's how we managed to end up at this middle-ground ABI style: when I thought people were, in practice, punting on zero copy because the complexity just wasn't worth the performance benefit. Maybe I'd just been colored by how my projects ended up going, but I've ended up designing complicated interfaces in the past that allow for zero-copy only to never get around to actually making that work. I don't know if that's just because I've had the good fortune to avoid working on anything that ended up with users, though :).
For our use case I think we actually get better performance out of the copy-based (and probably more importantly kalloc-based, but that's an implementation thing not an ABI thing) approach: essentially we're very sensitive to memory pressure and expect this first dm-user daemon to mostly be idle, so we're really worried about avoiding excess memory usage while idle and less worried about throughput when active. This stream-based interface means that userspace doesn't need much memory allocated to service a request, which helps with sleep/wake latencies and/or idle memory usage. That's also why we have the simple locking scheme: no sense splitting locks if there's no contention, and we only need a single thread to saturate the storage bandwidth on these phones.
That said, it does sound like people really do care about the sort of performance levels where zero copy is relevant in this space. I'll take a shot at something along those lines, and while it will add a degree of userspace complexity I'm not sure it'll add much in the way of kernel complexity -- at least compared to a fast version of this, where we'd need most of that stuff anyway (obviously the malloc+single lock design is simple, but probably wouldn't stick around for long). At a bare minimum it'll be interesting to play around with, but if people are doing it in practice then I'm more confident that I can put something together that at least serves as a starting point for further discussion.
I haven't gotten around to writing any code yet, but I had spent a bit of time thinking about how to put this zero-copy version together and am leaning towards it being a standalone block device (as opposed to a DM target). I'd avoided that before as I didn't want to mess around with my own device control scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable. That's all speculation now, but it does bring up one interesting question:
IIUC, this version of dm-user handles BIOs before they reach the block scheduler while a standalone driver would likely handle them after blk-mq. I don't have direct experience with this, but the last time I ran into people who had these sorts of performance requirements for userspace drivers they weren't actually trying to write userspace drivers but were instead trying to write a userspace scheduler, with the userspace drivers just being the mechanism to implement that scheduler. This was a decade ago and I'm not sure that's what people are trying to do in the new blk-mq world, but if it is then it's going to be a major design consideration. I'm also not entirely sure that we're really solving the same problem at that point.