On Tue, Sep 05, 2023 at 12:08:40PM -0600, Keith Busch wrote:
On Tue, Sep 05, 2023 at 10:48:25AM +0530, Kanchan Joshi wrote:
On Fri, Sep 01, 2023 at 10:45:50AM -0400, Keith Busch wrote:
And similiar to this problem, what if the metadata is extended rather than separate, and the user's buffer is too short? That will lead to the same type of problem you're trying to fix here?
No. For extended metadata, userspace is using its own buffer. Since intermediate kernel buffer does not exist, I do not have a problem to solve.
We still use kernel memory if the user buffer is unaligned. If the user space provides an short unaligned buffer, the device will corrupt kernel memory.
Ah yes. blk_rq_map_user_iov() does make a copy of user-buffer in that case.
My main concern, though, is forward and backward compatibility. Even when metadata is enabled, there are IO commands that don't touch it, so some tool that erroneously requested it will stop working. Or perhaps some other future opcode will have some other metadata use that doesn't match up exactly with how read/write/compare/append use it. As much as I'd like to avoid bad user commands from crashing, these kinds of checks can become problematic for maintenance.
For forward compatibility - if we have commands that need to specify metadata in a different way (than what is possible from this interface), we anyway need a new passthrough command structure.
Not sure about that. The existing struct is flexible enough to describe any possible nvme command.
More specifically about compatibility is that this patch assumes an "nlb" field exists inside an opaque structure at DW12 offset, and that field defines how large the metadata buffer needs to be. Some vendor specific or future opcode may have DW12 mean something completely different, but still need to access metadata this patch may prevent from working.
Right. It almost had me dropping the effort. But given the horrible bug at hand, added an untested patch [1] that handles all the shortcomings you mentioned. Please take a look.
Moreover, it's really about caring _only_ for cases when kernel allocates memory for metadata. And those cases are specific (i.e., when metadata and metalen are not zero). We don't have to think in terms of opcode (existing or future), no?
It looks like a little work, but I don't see why blk-integrity must use kernel memory. Introducing an API like 'bio_integrity_map_user()' might also address your concern, as long as the user buffer is aligned. It sounds like we're assuming user buffers are aligned, at least.
Would you really prefer to have nvme_add_user_metadata() changed to do away with allocation and use userspace meta-buffer directly? Even with that route, extended-lba-with-short-unaligned-buffer remains unhandled. That will still require similar checks that I would like to avoid but cannnot.
So how about this -
[1] diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index d8ff796fd5f2..d09b5691da3e 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -320,6 +320,67 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) meta_len, lower_32_bits(io.slba), NULL, 0, 0); }
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{ + switch(opcode) { + case nvme_cmd_read: + case nvme_cmd_write: + case nvme_cmd_compare: + case nvme_cmd_zone_append: + return true; + } + return false; +} + +static bool nvme_validate_passthru_meta(struct nvme_ctrl *ctrl, + struct nvme_ns *ns, + struct nvme_command *c, + __u64 meta, __u32 meta_len, + unsigned data_len) +{ + /* + * User may specify smaller meta-buffer with a larger data-buffer. + * Driver allocated meta buffer will also be small. + * Device can do larger dma into that, overwriting unrelated kernel + * memory. + */ + if (ns && (meta_len || meta || ns->features & NVME_NS_EXT_LBAS)) { + u16 nlb, control; + unsigned dlen, mlen; + + /* Exclude commands that do not have nlb in cdw12 */ + if (!nvme_nlb_in_cdw12(c->common.opcode)) + return true; + + control = upper_16_bits(le32_to_cpu(c->common.cdw12)); + /* Exclude when meta transfer from/to host is not done */ + if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size) + return true; + + nlb = lower_16_bits(le32_to_cpu(c->common.cdw12)); + mlen = (nlb + 1) * ns->ms; + + /* sanity for interleaved buffer */ + if (ns->features & NVME_NS_EXT_LBAS) { + dlen = (nlb + 1) << ns->lba_shift; + if (data_len < (dlen + mlen)) + goto out_false; + return true; + } + /* sanity for separate meta buffer */ + if (meta_len < mlen) + goto out_false; + + return true; +out_false: + dev_err(ctrl->device, + "%s: metadata length is small!\n", current->comm); + return false; + } + + return true; +} + static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, struct nvme_ns *ns, __u32 nsid) { @@ -364,6 +425,10 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, c.common.cdw14 = cpu_to_le32(cmd.cdw14); c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, cmd.metadata, + cmd.metadata_len, cmd.data_len)) + return -EINVAL; + if (!nvme_cmd_allowed(ns, &c, 0, open_for_write)) return -EACCES;
@@ -411,6 +476,10 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, c.common.cdw14 = cpu_to_le32(cmd.cdw14); c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, cmd.metadata, + cmd.metadata_len, cmd.data_len)) + return -EINVAL; + if (!nvme_cmd_allowed(ns, &c, flags, open_for_write)) return -EACCES;
@@ -593,6 +662,10 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, d.metadata_len = READ_ONCE(cmd->metadata_len); d.timeout_ms = READ_ONCE(cmd->timeout_ms);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, d.metadata, + d.metadata_len, d.data_len)) + return -EINVAL; + if (issue_flags & IO_URING_F_NONBLOCK) { rq_flags |= REQ_NOWAIT; blk_flags = BLK_MQ_REQ_NOWAIT; -- 2.25.1