On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
What is the advantage over simply using nbd?
There's a short bit about that in the cover letter (and in some talks), but I'll expand on it here -- I suppose my most important question is "is this interesting enough to take upstream?", so there should be at least a bit of a description of what it actually enables:
I don't think there's any deep fundamental advantages to doing this as opposed to nbd/iscsi over localhost/unix (or by just writing a kernel implementation, for that matter), at least in terms of anything that was previously impossible now becoming possible. There are a handful of things that are easier and/or faster, though.
dm-user looks a lot like NBD without the networking. The major difference is which side initiates messages: in NBD the kernel initiates messages, while in dm-user userspace initiates messages (via a read that will block if there is no message, but presumably we'd want to add support for a non-blocking userspace implementations eventually). The NBD approach certainly makes sense for a networked system, as one generally wants to have a single storage server handling multiple clients, but inverting that makes some things simpler in dm-user.
One specific advantage of this change is that a dm-user target can be transitioned from one daemon to another without any IO errors: just spin up the second daemon, signal the first to stop requesting new messages, and let it exit. We're using that mechanism to replace the daemon launched by early init (which runs before the security subsystem is up, as in our use case dm-user provides the root filesystem) with one that's properly sandboxed (which can only be launched after the root filesystem has come up). There are ways around this (replacing the DM table, for example), but they don't fit it as cleanly.
Unless I'm missing something, NBD servers aren't capable of that style of transition: soft disconnects can only be initiated by the client (the kernel, in this case), which leaves no way for the server to transition while guaranteeing that no IOs error out. It's usually possible to shoehorn this sort of direction reversing concept into network protocols, but it's also usually ugly (I'm thinking of IDLE, for example). I didn't try to actually do it, but my guess would be that adding a way for the server to ask the client to stop sending messages until a new server shows up would be at least as much work as doing this.
There are also a handful of possible performance advantages, but I haven't gone through the work to prove any of them out yet as performance isn't all that important for our first use case. For example:
* Cutting out the network stack is unlikely to hurt performance. I'm not sure if it will help performance, though. I think if we really had workload where the extra copy was likely to be an issue we'd want an explicit ring buffer, but I have a theory that it would be possible to get very good performance out of a stream-style API by using multiple channels and relying on io_uring to plumb through multiple ops per channel. * There's a comment in the implementation about allowing userspace to insert itself into user_map(), likely by uploading a BPF fragment. There's a whole class of interesting block devices that could be written in this fashion: essentially you keep a cache on a regular block device that handles the common cases by remapping BIOs and passing them along, relegating the more complicated logic to fetch cache misses and watching some subset of the access stream where necessary.
We have a use case like this in Android, where we opportunistically store backups in a portion of the TRIM'd space on devices. It's currently implemented entirely in kernel by the dm-bow target, but IIUC that was deemed too Android-specific to merge. Assuming we could get good enough performance we could move that logic to userspace, which lets us shrink our diff with upstream. It feels like some other interesting block devices could be written in a similar fashion.
All in all, I've found it a bit hard to figure out what sort of interest people have in dm-user: when I bring this up I seem to run into people who've done similar things before and are vaguely interested, but certainly nobody is chomping at the bit. I'm sending it out in this early state to try and figure out if it's interesting enough to keep going.