User can specify a smaller meta buffer than what the device is wired to update/access. This may lead to Device doing a larger DMA operation, overwriting unrelated kernel memory.
Detect this situation for uring passthrough, and bail out.
Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device") Cc: stable@vger.kernel.org
Reported-by: Vincent Fu vincent.fu@samsung.com Signed-off-by: Kanchan Joshi joshi.k@samsung.com --- drivers/nvme/host/ioctl.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 19a5177bc360..717c7effaf8a 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -320,6 +320,35 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) meta_len, lower_32_bits(io.slba), NULL, 0, 0); }
+static bool nvme_validate_passthru_meta(struct nvme_ctrl *ctrl, + struct nvme_ns *ns, + struct nvme_command *c, + __u64 meta, __u32 meta_len) +{ + /* + * User may specify smaller meta-buffer with a larger data-buffer. + * Driver allocated meta buffer will also be small. + * Device can do larger dma into that, overwriting unrelated kernel + * memory. + */ + if (ns && (meta_len || meta)) { + u16 nlb = lower_16_bits(le32_to_cpu(c->common.cdw12)); + u16 control = upper_16_bits(le32_to_cpu(c->common.cdw12)); + + /* meta transfer from/to host is not done */ + if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size) + return true; + + if (meta_len != (nlb + 1) * ns->ms) { + dev_err(ctrl->device, + "%s: metadata length does not match!\n", current->comm); + return false; + } + } + + return true; +} + static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, struct nvme_ns *ns, __u32 nsid) { @@ -593,6 +622,10 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, d.metadata_len = READ_ONCE(cmd->metadata_len); d.timeout_ms = READ_ONCE(cmd->timeout_ms);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, d.metadata, + d.metadata_len)) + return -EINVAL; + if (issue_flags & IO_URING_F_NONBLOCK) { rq_flags |= REQ_NOWAIT; blk_flags = BLK_MQ_REQ_NOWAIT;
I think the metadata size check is too strict. Commands where the metadata size is too small should result in errors but when the metadata size is larger than needed they should still go through.
In any case, I tested this patch on a QEMU NVMe device (which supports PI by default).
I formatted the device with a 512+16 lbaf with a separate buffer for metadata:
nvme format /dev/ng0n1 -m 0 -i 1 -p 0 --lbaf 2 --force
Using the latest fio I wrote some data to it:
./fio --name=difdix --ioengine=io_uring_cmd --cmd_type=nvme \ --filename=/dev/ng0n1 --rw=write --bs=512 --md_per_io_size=16 --pi_act=1 \ --pi_chk=APPTAG --apptag=0x8888 --apptag_mask=0xFFFF --number_ios=128
Then I wrote a small program to read 4096 bytes from the device with only a 16-byte (instead of 64-byte) metadata buffer. Without this patch the kernel crashes. With the patch the read fails with an error message in the kernel log.
Tested-by: Vincent Fu vincent.fu@samsung.com
On 8/31/2023 7:39 PM, Vincent Fu wrote:
I think the metadata size check is too strict. Commands where the metadata size is too small should result in errors but when the metadata size is larger than needed they should still go through.
Indeed. I will fold that change in the next version.
In any case, I tested this patch on a QEMU NVMe device (which supports PI by default).
I formatted the device with a 512+16 lbaf with a separate buffer for metadata:
nvme format /dev/ng0n1 -m 0 -i 1 -p 0 --lbaf 2 --force
Using the latest fio I wrote some data to it:
./fio --name=difdix --ioengine=io_uring_cmd --cmd_type=nvme \ --filename=/dev/ng0n1 --rw=write --bs=512 --md_per_io_size=16 --pi_act=1 \ --pi_chk=APPTAG --apptag=0x8888 --apptag_mask=0xFFFF --number_ios=128
Then I wrote a small program to read 4096 bytes from the device with only a 16-byte (instead of 64-byte) metadata buffer. Without this patch the kernel crashes. With the patch the read fails with an error message in the kernel log.
Tested-by: Vincent Fu vincent.fu@samsung.com
Thanks.
On Mon, Aug 14, 2023 at 12:32:12PM +0530, Kanchan Joshi wrote:
+static bool nvme_validate_passthru_meta(struct nvme_ctrl *ctrl,
struct nvme_ns *ns,
struct nvme_command *c,
__u64 meta, __u32 meta_len)
+{
- /*
* User may specify smaller meta-buffer with a larger data-buffer.
* Driver allocated meta buffer will also be small.
* Device can do larger dma into that, overwriting unrelated kernel
* memory.
*/
What if the user doesn't specify metadata or length for a command that uses it? The driver won't set MPTR in that case, causing the device to access NULL.
And similiar to this problem, what if the metadata is extended rather than separate, and the user's buffer is too short? That will lead to the same type of problem you're trying to fix here?
My main concern, though, is forward and backward compatibility. Even when metadata is enabled, there are IO commands that don't touch it, so some tool that erroneously requested it will stop working. Or perhaps some other future opcode will have some other metadata use that doesn't match up exactly with how read/write/compare/append use it. As much as I'd like to avoid bad user commands from crashing, these kinds of checks can become problematic for maintenance. I realize we already do similiar sanity checks in nvme_submit_io(), but that one is confined to only 3 opcodes where this interface you're changing is much more flexible.
On Fri, Sep 01, 2023 at 10:45:50AM -0400, Keith Busch wrote:
On Mon, Aug 14, 2023 at 12:32:12PM +0530, Kanchan Joshi wrote:
+static bool nvme_validate_passthru_meta(struct nvme_ctrl *ctrl,
struct nvme_ns *ns,
struct nvme_command *c,
__u64 meta, __u32 meta_len)
+{
- /*
* User may specify smaller meta-buffer with a larger data-buffer.
* Driver allocated meta buffer will also be small.
* Device can do larger dma into that, overwriting unrelated kernel
* memory.
*/
What if the user doesn't specify metadata or length for a command that uses it? The driver won't set MPTR in that case, causing the device to access NULL.
That is fine, because kernel is not going to allocate memory for that case. I am not trying to provide any saftey net to user-space in this patch. Rather, the objective is to prevent kernel-memory corruption.
And similiar to this problem, what if the metadata is extended rather than separate, and the user's buffer is too short? That will lead to the same type of problem you're trying to fix here?
No. For extended metadata, userspace is using its own buffer. Since intermediate kernel buffer does not exist, I do not have a problem to solve.
My main concern, though, is forward and backward compatibility. Even when metadata is enabled, there are IO commands that don't touch it, so some tool that erroneously requested it will stop working. Or perhaps some other future opcode will have some other metadata use that doesn't match up exactly with how read/write/compare/append use it. As much as I'd like to avoid bad user commands from crashing, these kinds of checks can become problematic for maintenance.
For forward compatibility - if we have commands that need to specify metadata in a different way (than what is possible from this interface), we anyway need a new passthrough command structure. Moreover, it's really about caring _only_ for cases when kernel allocates memory for metadata. And those cases are specific (i.e., when metadata and metalen are not zero). We don't have to think in terms of opcode (existing or future), no?
For backward comptability, should we care about erroneous tools. My concern is we currenly have a hole that can be exploited to affect other sane applications and bring the kernel down to its knees.
On Tue, Sep 05, 2023 at 10:48:25AM +0530, Kanchan Joshi wrote:
On Fri, Sep 01, 2023 at 10:45:50AM -0400, Keith Busch wrote:
And similiar to this problem, what if the metadata is extended rather than separate, and the user's buffer is too short? That will lead to the same type of problem you're trying to fix here?
No. For extended metadata, userspace is using its own buffer. Since intermediate kernel buffer does not exist, I do not have a problem to solve.
We still use kernel memory if the user buffer is unaligned. If the user space provides an short unaligned buffer, the device will corrupt kernel memory.
My main concern, though, is forward and backward compatibility. Even when metadata is enabled, there are IO commands that don't touch it, so some tool that erroneously requested it will stop working. Or perhaps some other future opcode will have some other metadata use that doesn't match up exactly with how read/write/compare/append use it. As much as I'd like to avoid bad user commands from crashing, these kinds of checks can become problematic for maintenance.
For forward compatibility - if we have commands that need to specify metadata in a different way (than what is possible from this interface), we anyway need a new passthrough command structure.
Not sure about that. The existing struct is flexible enough to describe any possible nvme command.
More specifically about compatibility is that this patch assumes an "nlb" field exists inside an opaque structure at DW12 offset, and that field defines how large the metadata buffer needs to be. Some vendor specific or future opcode may have DW12 mean something completely different, but still need to access metadata this patch may prevent from working.
Moreover, it's really about caring _only_ for cases when kernel allocates memory for metadata. And those cases are specific (i.e., when metadata and metalen are not zero). We don't have to think in terms of opcode (existing or future), no?
It looks like a little work, but I don't see why blk-integrity must use kernel memory. Introducing an API like 'bio_integrity_map_user()' might also address your concern, as long as the user buffer is aligned. It sounds like we're assuming user buffers are aligned, at least.
On Tue, Sep 05, 2023 at 12:08:40PM -0600, Keith Busch wrote:
On Tue, Sep 05, 2023 at 10:48:25AM +0530, Kanchan Joshi wrote:
On Fri, Sep 01, 2023 at 10:45:50AM -0400, Keith Busch wrote:
And similiar to this problem, what if the metadata is extended rather than separate, and the user's buffer is too short? That will lead to the same type of problem you're trying to fix here?
No. For extended metadata, userspace is using its own buffer. Since intermediate kernel buffer does not exist, I do not have a problem to solve.
We still use kernel memory if the user buffer is unaligned. If the user space provides an short unaligned buffer, the device will corrupt kernel memory.
Ah yes. blk_rq_map_user_iov() does make a copy of user-buffer in that case.
My main concern, though, is forward and backward compatibility. Even when metadata is enabled, there are IO commands that don't touch it, so some tool that erroneously requested it will stop working. Or perhaps some other future opcode will have some other metadata use that doesn't match up exactly with how read/write/compare/append use it. As much as I'd like to avoid bad user commands from crashing, these kinds of checks can become problematic for maintenance.
For forward compatibility - if we have commands that need to specify metadata in a different way (than what is possible from this interface), we anyway need a new passthrough command structure.
Not sure about that. The existing struct is flexible enough to describe any possible nvme command.
More specifically about compatibility is that this patch assumes an "nlb" field exists inside an opaque structure at DW12 offset, and that field defines how large the metadata buffer needs to be. Some vendor specific or future opcode may have DW12 mean something completely different, but still need to access metadata this patch may prevent from working.
Right. It almost had me dropping the effort. But given the horrible bug at hand, added an untested patch [1] that handles all the shortcomings you mentioned. Please take a look.
Moreover, it's really about caring _only_ for cases when kernel allocates memory for metadata. And those cases are specific (i.e., when metadata and metalen are not zero). We don't have to think in terms of opcode (existing or future), no?
It looks like a little work, but I don't see why blk-integrity must use kernel memory. Introducing an API like 'bio_integrity_map_user()' might also address your concern, as long as the user buffer is aligned. It sounds like we're assuming user buffers are aligned, at least.
Would you really prefer to have nvme_add_user_metadata() changed to do away with allocation and use userspace meta-buffer directly? Even with that route, extended-lba-with-short-unaligned-buffer remains unhandled. That will still require similar checks that I would like to avoid but cannnot.
So how about this -
[1] diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index d8ff796fd5f2..d09b5691da3e 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -320,6 +320,67 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) meta_len, lower_32_bits(io.slba), NULL, 0, 0); }
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{ + switch(opcode) { + case nvme_cmd_read: + case nvme_cmd_write: + case nvme_cmd_compare: + case nvme_cmd_zone_append: + return true; + } + return false; +} + +static bool nvme_validate_passthru_meta(struct nvme_ctrl *ctrl, + struct nvme_ns *ns, + struct nvme_command *c, + __u64 meta, __u32 meta_len, + unsigned data_len) +{ + /* + * User may specify smaller meta-buffer with a larger data-buffer. + * Driver allocated meta buffer will also be small. + * Device can do larger dma into that, overwriting unrelated kernel + * memory. + */ + if (ns && (meta_len || meta || ns->features & NVME_NS_EXT_LBAS)) { + u16 nlb, control; + unsigned dlen, mlen; + + /* Exclude commands that do not have nlb in cdw12 */ + if (!nvme_nlb_in_cdw12(c->common.opcode)) + return true; + + control = upper_16_bits(le32_to_cpu(c->common.cdw12)); + /* Exclude when meta transfer from/to host is not done */ + if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size) + return true; + + nlb = lower_16_bits(le32_to_cpu(c->common.cdw12)); + mlen = (nlb + 1) * ns->ms; + + /* sanity for interleaved buffer */ + if (ns->features & NVME_NS_EXT_LBAS) { + dlen = (nlb + 1) << ns->lba_shift; + if (data_len < (dlen + mlen)) + goto out_false; + return true; + } + /* sanity for separate meta buffer */ + if (meta_len < mlen) + goto out_false; + + return true; +out_false: + dev_err(ctrl->device, + "%s: metadata length is small!\n", current->comm); + return false; + } + + return true; +} + static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, struct nvme_ns *ns, __u32 nsid) { @@ -364,6 +425,10 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, c.common.cdw14 = cpu_to_le32(cmd.cdw14); c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, cmd.metadata, + cmd.metadata_len, cmd.data_len)) + return -EINVAL; + if (!nvme_cmd_allowed(ns, &c, 0, open_for_write)) return -EACCES;
@@ -411,6 +476,10 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, c.common.cdw14 = cpu_to_le32(cmd.cdw14); c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, cmd.metadata, + cmd.metadata_len, cmd.data_len)) + return -EINVAL; + if (!nvme_cmd_allowed(ns, &c, flags, open_for_write)) return -EACCES;
@@ -593,6 +662,10 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, d.metadata_len = READ_ONCE(cmd->metadata_len); d.timeout_ms = READ_ONCE(cmd->timeout_ms);
+ if (!nvme_validate_passthru_meta(ctrl, ns, &c, d.metadata, + d.metadata_len, d.data_len)) + return -EINVAL; + if (issue_flags & IO_URING_F_NONBLOCK) { rq_flags |= REQ_NOWAIT; blk_flags = BLK_MQ_REQ_NOWAIT; -- 2.25.1
On Wed, Sep 06, 2023 at 09:18:15PM +0530, Kanchan Joshi wrote:
Would you really prefer to have nvme_add_user_metadata() changed to do away with allocation and use userspace meta-buffer directly?
I mean, sure, if it's possible. We can avoid a costly copy if the user metabuffer is aligned and physically contiguous.
Even with that route, extended-lba-with-short-unaligned-buffer remains unhandled. That will still require similar checks that I would like to avoid but cannnot.
So how about this -
There's lots of bad things you can do with this interface. Example, provide an unaligned single byte user buffer and send an Identify command.
We never provided opcode decoding sanity checks before because it's a bad maintenance burden, adds performance killing overhead, couldn't catch all the cases anyway due to vendor specific and future opcodes, and harms the flexibility of the interface. The burden is usually on the user for these kinds of priviledged interfaces: if you abuse it, "you get to keep both pieces" territory.
On Thu, Sep 7, 2023 at 9:11 PM Keith Busch kbusch@kernel.org wrote:
On Wed, Sep 06, 2023 at 09:18:15PM +0530, Kanchan Joshi wrote:
Would you really prefer to have nvme_add_user_metadata() changed to do away with allocation and use userspace meta-buffer directly?
I mean, sure, if it's possible. We can avoid a costly copy if the user metabuffer is aligned and physically contiguous.
Seems possible, but that does not really solve the actual problem (which is not performance) this patch is for. It will require replicating big code of blk_rq_map_user_iov() for integrity metadata and map the pages into bip. But since the user-space meta buffer can be unaligned (and bunch of other conditions present there), it has to make the meta copy in kernel-space. And we will be back to where we started - how to avoid corruption into kernel memory.
Same situation for the case when we are dealing with extended-lba-format and interleaved user-buffer is unaligned. .
Handling both these anyway requires adding the kind of code/checks mentioned in the previous patch. Do you see another way?
While I agree there is value in avoiding the meta copy in general and I can look into it, but that should be a separate effort (with focus on performance).
Even with that route, extended-lba-with-short-unaligned-buffer remains unhandled. That will still require similar checks that I would like to avoid but cannnot.
So how about this -
There's lots of bad things you can do with this interface. Example, provide an unaligned single byte user buffer and send an Identify command.
Not sure I follow. Do you mean the patch does not handle these cases?
We never provided opcode decoding sanity checks before because it's a bad maintenance burden, adds performance killing overhead, couldn't catch all the cases anyway due to vendor specific and future opcodes, and harms the flexibility of the interface.
Given the way things are (integrity schemes, cdw12 etc.), I do not see a way to avoid opcode checks. Flexibility is not getting reduced in the previous patch. All the other commands (beyond read/write/compare/append) remain untouched. And metadata-io is not the fast path at the moment (given memory allocation, bunch of extra things by blk-integrity, bunch of extra things done by device etc.).
The burden is usually on the user for these kinds of priviledged interfaces: if you abuse it, "you get to keep both pieces" territory.
Not sure I got that. Have you seen the crash mentioned in this cover letter: https://lore.kernel.org/linux-nvme/20230811155906.15883-1-joshi.k@samsung.co... A simple unprivileged read by a rogue application can wreck other applications/system at the moment. Is it fine to keep the status quo?
linux-stable-mirror@lists.linaro.org