User can specify a smaller meta buffer than what the device is wired to update/access. Kernel makes a copy of the meta buffer into which the device does DMA. As a result, the device overwrites the unrelated kernel memory, causing random kernel crashes.
Same issue is possible for extended-lba case also. When user specifies a short unaligned buffer, the kernel makes a copy and uses that for DMA.
Detect these situations for sync/uring passthrough, and bail out.
Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device") Cc: stable@vger.kernel.org
Reported-by: Vincent Fu vincent.fu@samsung.com Signed-off-by: Kanchan Joshi joshi.k@samsung.com ---
Changes since v2: - Handle extended-lba case: short unaligned buffer IO - Reduce the scope of check to only well-known commands - Do not check early. Move it deeper so that check gets executed less often - Combine two patches into one.
Changes since v1: - Revise the check to exclude PRACT=1 case
Random crash example:
[ 6815.014478] general protection fault, probably for non-canonical address 0x70e3cdbe9133b7a6: 0000 [#1] PREEMPT SMP PTI [ 6815.014505] CPU: 1 PID: 434 Comm: systemd-timesyn Tainted: G OE 6.4.0-rc3+ #5 [ 6815.014516] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 6815.014522] RIP: 0010:__kmem_cache_alloc_node+0x100/0x440 [ 6815.014551] Code: 48 85 c0 0f 84 fb 02 00 00 41 83 ff ff 74 10 48 8b 00 48 c1 e8 36 41 39 c7 0f 85 e5 02 00 00 41 8b 45 28 49 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9d b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb [ 6815.014559] RSP: 0018:ffffb510c0577d18 EFLAGS: 00010216 [ 6815.014569] RAX: 70e3cdbe9133b7a6 RBX: ffff8a9ec1042300 RCX: 0000000000000010 [ 6815.014575] RDX: 00000000048b0001 RSI: 0000000000000dc0 RDI: 0000000000037060 [ 6815.014581] RBP: ffffb510c0577d58 R08: ffffffffb9ffa280 R09: 0000000000000000 [ 6815.014586] R10: ffff8a9ecbcab1f0 R11: 0000000000000000 R12: 70e3cdbe9133b79e [ 6815.014591] R13: ffff8a9ec1042300 R14: 0000000000000dc0 R15: 00000000ffffffff [ 6815.014597] FS: 00007fce590d6940(0000) GS:ffff8a9f3dd00000(0000) knlGS:0000000000000000 [ 6815.014604] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6815.014609] CR2: 00005579abbb6498 CR3: 000000000d9b0000 CR4: 00000000000006e0 [ 6815.014622] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6815.014627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6815.014632] Call Trace: [ 6815.014650] <TASK> [ 6815.014655] ? apparmor_sk_alloc_security+0x40/0x80 [ 6815.014673] kmalloc_trace+0x2a/0xa0 [ 6815.014684] apparmor_sk_alloc_security+0x40/0x80 [ 6815.014694] security_sk_alloc+0x3f/0x60 [ 6815.014703] sk_prot_alloc+0x75/0x110 [ 6815.014712] sk_alloc+0x31/0x200 [ 6815.014721] inet_create+0xd8/0x3a0 [ 6815.014734] __sock_create+0x11b/0x220 [ 6815.014749] __sys_socket_create.part.0+0x49/0x70 [ 6815.014756] ? __secure_computing+0x94/0xf0 [ 6815.014768] __sys_socket+0x3c/0xc0 [ 6815.014776] __x64_sys_socket+0x1a/0x30 [ 6815.014783] do_syscall_64+0x3b/0x90 [ 6815.014794] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 6815.014804] RIP: 0033:0x7fce59aa795b
drivers/nvme/host/ioctl.c | 76 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index d8ff796fd5f2..8147750beff4 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -158,6 +158,67 @@ static struct request *nvme_alloc_user_request(struct request_queue *q, return req; }
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{ + switch (opcode) { + case nvme_cmd_read: + case nvme_cmd_write: + case nvme_cmd_compare: + case nvme_cmd_zone_append: + return true; + } + return false; +} + +static bool nvme_validate_passthru_meta(struct nvme_ns *ns, + struct nvme_command *c, + __u32 meta_len, + unsigned data_len) +{ + /* + * User may specify smaller meta-buffer with a larger data-buffer. + * Driver allocated meta buffer will also be small. + * Device can do larger dma into that, overwriting unrelated kernel + * memory. + * For extended lba case, if user provides a short unaligned buffer, + * the device will corrupt kernel memory. + * Avoid running into corruption in both situations. + */ + bool ext_lba = ns->features & NVME_NS_EXT_LBAS; + u16 nlb, control; + unsigned dlen, mlen; + + /* Exclude commands that do not have nlb in cdw12 */ + if (!nvme_nlb_in_cdw12(c->common.opcode)) + return true; + + control = upper_16_bits(le32_to_cpu(c->common.cdw12)); + /* Exclude when meta transfer from/to host is not done */ + if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size) + return true; + + nlb = lower_16_bits(le32_to_cpu(c->common.cdw12)); + mlen = (nlb + 1) * ns->ms; + + /* sanity for interleaved buffer */ + if (ext_lba) { + dlen = (nlb + 1) << ns->lba_shift; + if (data_len < (dlen + mlen)) + goto out_false; + return true; + } + /* sanity for separate meta buffer */ + if (meta_len < mlen) + goto out_false; + + return true; + +out_false: + dev_err(ns->ctrl->device, + "%s: metadata length is small!\n", current->comm); + return false; +} + static int nvme_map_user_request(struct request *req, u64 ubuffer, unsigned bufflen, void __user *meta_buffer, unsigned meta_len, u32 meta_seed, void **metap, struct io_uring_cmd *ioucmd, @@ -194,6 +255,12 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer, bio_set_dev(bio, bdev);
if (bdev && meta_buffer && meta_len) { + if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd, + meta_len, bufflen)) { + ret = -EINVAL; + goto out_unmap; + } + meta = nvme_add_user_metadata(req, meta_buffer, meta_len, meta_seed); if (IS_ERR(meta)) { @@ -203,6 +270,15 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer, *metap = meta; }
+ /* guard another case when kernel memory is being used */ + if (bio->bi_private && ns && ns->features & NVME_NS_EXT_LBAS) { + if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd, + meta_len, bufflen)) { + ret = -EINVAL; + goto out_unmap; + } + } + return ret;
out_unmap:
On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
Same issue is possible for extended-lba case also. When user specifies a short unaligned buffer, the kernel makes a copy and uses that for DMA.
I fail to understand the extent LBA case, and also from looking at the code mixing it up with validation of the metadata_len seems very confusion. Can you try to clearly explain it and maybe split it into a separate patch?
Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device")
Is this really io_uring specific? I think we also had the same issue before and this should go back to adding metadata support to the general passthrough ioctl?
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{
- switch (opcode) {
- case nvme_cmd_read:
- case nvme_cmd_write:
- case nvme_cmd_compare:
- case nvme_cmd_zone_append:
return true;
- }
- return false;
+}
Nitpick: I find it nicer to read to have a switch that catches everything with a default statement instead of falling out of it for checks like this. It's not making any different in practice but just reads a little nicer.
- /* Exclude commands that do not have nlb in cdw12 */
- if (!nvme_nlb_in_cdw12(c->common.opcode))
return true;
So we can still get exactly the same corruption for all commands that are not known? That's not a very safe way to deal with the issue..
- control = upper_16_bits(le32_to_cpu(c->common.cdw12));
- /* Exclude when meta transfer from/to host is not done */
- if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
return true;
- nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
I'd use the rw field of the union and the typed control and length fields to clean this up a bit.
if (bdev && meta_buffer && meta_len) {
if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
meta_len, bufflen)) {
ret = -EINVAL;
goto out_unmap;
}
- meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
I'd move the check into nvme_add_user_metadata to keep it out of the hot path.
FYI: here is what I'd do for the external metadata only case:
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index d8ff796fd5f21d..bf22c7953856f5 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -96,6 +96,71 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval) return (void __user *)ptrval; }
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{ + switch (opcode) { + case nvme_cmd_read: + case nvme_cmd_write: + case nvme_cmd_compare: + case nvme_cmd_zone_append: + return true; + default: + return false; + } +} + +/* + * NVMe has no separate field to encode the metadata length expected + * (except when using SGLs). + * + * Because of that we can't allow to transfer arbitrary metadata, as + * a metadata buffer that is shorted than what the device expects for + * the command will lead to arbitrary kernel (if bounce buffering) or + * userspace (if not) memory corruption. + * + * Check that external metadata is only specified for the few commands + * where we know the length based of other fields, and that it fits + * the actual data transfer from/to the device. + */ +static bool nvme_validate_metadata_len(struct request *req, unsigned meta_len) +{ + struct nvme_ns *ns = req->q->queuedata; + struct nvme_command *c = nvme_req(req)->cmd; + u32 len_by_nlb; + + /* Exclude commands that do not have nlb in cdw12 */ + if (!nvme_nlb_in_cdw12(c->common.opcode)) { + dev_err(ns->ctrl->device, + "unknown metadata command %c (%s).\n", + c->common.opcode, current->comm); + return false; + } + + /* + * Skip the check when PI is inserted or stripped and not transferred. + */ + if (ns->ms == ns->pi_size && + (c->rw.control & cpu_to_le16(NVME_RW_PRINFO_PRACT))) + return true; + + if (ns->features & NVME_NS_EXT_LBAS) { + dev_err(ns->ctrl->device, + "requires extended LBAs for metadata (%s).\n", + current->comm); + return false; + } + + len_by_nlb = (le16_to_cpu(c->rw.length) + 1) * ns->ms; + if (meta_len < len_by_nlb) { + dev_err(ns->ctrl->device, + "metadata length (%u instad of %u) is too small (%s).\n", + meta_len, len_by_nlb, current->comm); + return false; + } + + return true; +} + static void *nvme_add_user_metadata(struct request *req, void __user *ubuf, unsigned len, u32 seed) { @@ -104,6 +169,9 @@ static void *nvme_add_user_metadata(struct request *req, void __user *ubuf, void *buf; struct bio *bio = req->bio;
+ if (!nvme_validate_metadata_len(req, len)) + return ERR_PTR(-EINVAL); + buf = kmalloc(len, GFP_KERNEL); if (!buf) goto out;
On Tue, Oct 10, 2023 at 1:16 PM Christoph Hellwig hch@lst.de wrote:
On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
Same issue is possible for extended-lba case also. When user specifies a short unaligned buffer, the kernel makes a copy and uses that for DMA.
I fail to understand the extent LBA case, and also from looking at the code mixing it up with validation of the metadata_len seems very confusion. Can you try to clearly explain it and maybe split it into a separate patch?
The case is for the single interleaved buffer with both data and metadata. When the driver sends this buffer to blk_rq_map_user_iov(), it may make a copy of it. This kernel buffer will be used for DMA rather than user buffer. If the user-buffer is short, the kernel buffer is also short.
Does this explanation help? I can move the part to a separate patch.
Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device")
Is this really io_uring specific? I think we also had the same issue before and this should go back to adding metadata support to the general passthrough ioctl?
Yes, not io_uring specific. Just that I was not sure on (i) whether to go back that far in history, and (ii) what patch to tag.
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{
switch (opcode) {
case nvme_cmd_read:
case nvme_cmd_write:
case nvme_cmd_compare:
case nvme_cmd_zone_append:
return true;
}
return false;
+}
Nitpick: I find it nicer to read to have a switch that catches everything with a default statement instead of falling out of it for checks like this. It's not making any different in practice but just reads a little nicer.
Sure, I can change it.
/* Exclude commands that do not have nlb in cdw12 */
if (!nvme_nlb_in_cdw12(c->common.opcode))
return true;
So we can still get exactly the same corruption for all commands that are not known? That's not a very safe way to deal with the issue..
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
control = upper_16_bits(le32_to_cpu(c->common.cdw12));
/* Exclude when meta transfer from/to host is not done */
if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
return true;
nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
I'd use the rw field of the union and the typed control and length fields to clean this up a bit.
if (bdev && meta_buffer && meta_len) {
if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
meta_len, bufflen)) {
ret = -EINVAL;
goto out_unmap;
}
meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
I'd move the check into nvme_add_user_metadata to keep it out of the hot path.
FYI: here is what I'd do for the external metadata only case:
Since you have improvised comments too, I may just use this for the next iteration.
On Tuesday, October 10, 2023 6:40 AM Kanchan Joshi joshi.k@samsung.com wrote:
On Tue, Oct 10, 2023 at 1:16 PM Christoph Hellwig hch@lst.de wrote:
On Fri, Oct 06, 2023 at 07:17:06PM +0530, Kanchan Joshi wrote:
Same issue is possible for extended-lba case also. When user specifies a short unaligned buffer, the kernel makes a copy and uses that for DMA.
I fail to understand the extent LBA case, and also from looking at the code mixing it up with validation of the metadata_len seems very confusion. Can you try to clearly explain it and maybe split it into a separate patch?
The case is for the single interleaved buffer with both data and metadata. When the driver sends this buffer to blk_rq_map_user_iov(), it may make a copy of it. This kernel buffer will be used for DMA rather than user buffer. If the user-buffer is short, the kernel buffer is also short.
Does this explanation help? I can move the part to a separate patch.
Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on
char-device")
Is this really io_uring specific? I think we also had the same issue before and this should go back to adding metadata support to the general passthrough ioctl?
Yes, not io_uring specific. Just that I was not sure on (i) whether to go back that far in history, and (ii) what patch to tag.
+static inline bool nvme_nlb_in_cdw12(u8 opcode) +{
switch (opcode) {
case nvme_cmd_read:
case nvme_cmd_write:
case nvme_cmd_compare:
case nvme_cmd_zone_append:
return true;
}
return false;
+}
Nitpick: I find it nicer to read to have a switch that catches everything with a default statement instead of falling out of it for checks like this. It's not making any different in practice but just reads a little nicer.
Sure, I can change it.
What if the ns used the KV CS? Store and retrieve are the same op codes as nvme_cmd_write and nvme_cmd_read.
/* Exclude commands that do not have nlb in cdw12 */
if (!nvme_nlb_in_cdw12(c->common.opcode))
return true;
So we can still get exactly the same corruption for all commands that are not known? That's not a very safe way to deal with the issue..
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
control = upper_16_bits(le32_to_cpu(c->common.cdw12));
/* Exclude when meta transfer from/to host is not done */
if (control & NVME_RW_PRINFO_PRACT && ns->ms == ns->pi_size)
return true;
nlb = lower_16_bits(le32_to_cpu(c->common.cdw12));
I'd use the rw field of the union and the typed control and length fields to clean this up a bit.
if (bdev && meta_buffer && meta_len) {
if (!nvme_validate_passthru_meta(ns, nvme_req(req)->cmd,
meta_len, bufflen)) {
ret = -EINVAL;
goto out_unmap;
}
meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
I'd move the check into nvme_add_user_metadata to keep it out of the hot path.
FYI: here is what I'd do for the external metadata only case:
Since you have improvised comments too, I may just use this for the next iteration.
On Tue, Oct 10, 2023 at 07:09:54PM +0530, Kanchan Joshi wrote:
The case is for the single interleaved buffer with both data and metadata. When the driver sends this buffer to blk_rq_map_user_iov(), it may make a copy of it. This kernel buffer will be used for DMA rather than user buffer. If the user-buffer is short, the kernel buffer is also short.
Yes. Note that we'll corrupt memory either way, so user vs kernel does not matter.
Does this explanation help? I can move the part to a separate patch.
Definitively separate function please, not sure if a separate patch is required.
Yes, not io_uring specific. Just that I was not sure on (i) whether to go back that far in history, and (ii) what patch to tag.
I think the one that adds the original problem is:
63263d60e0f9f37bfd5e6a1e83a62f0e62fc459f Author: Keith Busch kbusch@kernel.org Date: Tue Aug 29 17:46:04 2017 -0400
nvme: Use metadata for passthrough commands
/* Exclude commands that do not have nlb in cdw12 */
if (!nvme_nlb_in_cdw12(c->common.opcode))
return true;
So we can still get exactly the same corruption for all commands that are not known? That's not a very safe way to deal with the issue..
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
Fixing just a subset of these problems is pointless. If people want to use metadata on vendor specific commands they need to work with NVMe to figure out a generic way to pass the length.
On 10/11/2023 10:32 AM, Christoph Hellwig wrote:
Just that I was not sure on (i) whether to go back that far in history, and (ii) what patch to tag.
I think the one that adds the original problem is:
63263d60e0f9f37bfd5e6a1e83a62f0e62fc459f Author: Keith Busch kbusch@kernel.org Date: Tue Aug 29 17:46:04 2017 -0400
nvme: Use metadata for passthrough commands
Thanks.
/* Exclude commands that do not have nlb in cdw12 */
if (!nvme_nlb_in_cdw12(c->common.opcode))
return true;
So we can still get exactly the same corruption for all commands that are not known? That's not a very safe way to deal with the issue..
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
Fixing just a subset of these problems is pointless. If people want to use metadata on vendor specific commands they need to work with NVMe to figure out a generic way to pass the length.
Do you suggest that vendor specific opcodes should be blocked here?
On Wed, Oct 11, 2023 at 10:56:44AM +0530, Kanchan Joshi wrote:
Fixing just a subset of these problems is pointless. If people want to use metadata on vendor specific commands they need to work with NVMe to figure out a generic way to pass the length.
Do you suggest that vendor specific opcodes should be blocked here?
We have to block everything that we can't calculate the length for. Otherwise you still leave the hole open.
On Wed, Oct 11, 2023 at 07:02:54AM +0200, Christoph Hellwig wrote:
On Tue, Oct 10, 2023 at 07:09:54PM +0530, Kanchan Joshi wrote:
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
Fixing just a subset of these problems is pointless. If people want to use metadata on vendor specific commands they need to work with NVMe to figure out a generic way to pass the length.
NVMe already tried to solve that with NDT and NDM fields, but no vendor implemented it. Maybe just require SGL's for passthrough IO since that encodes the buffer sizes.
I don't think it's reasonable for the driver to decode every passthrough command to validate the data lengths, or reject ones that we don't know how to decode. SG_IO doesn't do that either.
On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
Given the way things are in NVMe, I do not find a better way. Maybe another day for commands that do (or can do) things very differently for nlb and PI representation.
Fixing just a subset of these problems is pointless. If people want to use metadata on vendor specific commands they need to work with NVMe to figure out a generic way to pass the length.
NVMe already tried to solve that with NDT and NDM fields, but no vendor implemented it. Maybe just require SGL's for passthrough IO since that encodes the buffer sizes.
How is that going to help us with metadata where neither we, nor most devices support SGLs?
I don't think it's reasonable for the driver to decode every passthrough command to validate the data lengths, or reject ones that we don't know how to decode. SG_IO doesn't do that either.
I don't want that either, but what can we do against a (possibly unprivileged) user corrupting data?
SCSI has it much either because it has an explicit data transfer length (outside the CDB) instead of trying to build it from information that differs per opcode. One of the many places where it shows that NVMe is a very sloppy and badly thought out protocol.
On Thu, Oct 12, 2023 at 06:36:52AM +0200, Christoph Hellwig wrote:
On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
I don't think it's reasonable for the driver to decode every passthrough command to validate the data lengths, or reject ones that we don't know how to decode. SG_IO doesn't do that either.
I don't want that either, but what can we do against a (possibly unprivileged) user corrupting data?
The unpriviledged access is kind of recent. Maybe limit the scope of decoding to that usage?
We've always known the interface can be misused to corrupt memory and/or data, and it was always user responsibility to use this interface reponsibly. We shouldn't disable something people have relied on for over 10 years just because someone rediscovered ways to break it.
It's not like this is a "metadata" specific thing either; you can provide short user space buffers and corrupt memory with regular admin commands, and we have been able to that from day 1. But if you abuse this interface, it was always your fault; the kernel never took responsibility to sanity check your nvme input, and I think it's a bad precedent to start doing it.
SCSI has it much either because it has an explicit data transfer length (outside the CDB) instead of trying to build it from information that differs per opcode. One of the many places where it shows that NVMe is a very sloppy and badly thought out protocol.
Yeah, implicit PRP length has often been reported as one of the early protocol "regrets"...
On Thu, Oct 12, 2023 at 09:31:38AM -0600, Keith Busch wrote:
I don't want that either, but what can we do against a (possibly unprivileged) user corrupting data?
The unpriviledged access is kind of recent. Maybe limit the scope of decoding to that usage?
Let's just drop support for unpriviledged passthrough for now. That's easily backportable and gives us time to sort out what we can do. Probably only allowing it when SGLs are in use, including a flag to force using it.
On Thu, Oct 12, 2023 at 9:01 PM Keith Busch kbusch@kernel.org wrote:
On Thu, Oct 12, 2023 at 06:36:52AM +0200, Christoph Hellwig wrote:
On Wed, Oct 11, 2023 at 11:04:58AM -0600, Keith Busch wrote:
I don't think it's reasonable for the driver to decode every passthrough command to validate the data lengths, or reject ones that we don't know how to decode. SG_IO doesn't do that either.
I don't want that either, but what can we do against a (possibly unprivileged) user corrupting data?
The unpriviledged access is kind of recent. Maybe limit the scope of decoding to that usage?
I can send an iteration today that takes this route. Maybe that can be considered over dropping a useful feature.
We've always known the interface can be misused to corrupt memory and/or data, and it was always user responsibility to use this interface reponsibly. We shouldn't disable something people have relied on for over 10 years just because someone rediscovered ways to break it.
It's not like this is a "metadata" specific thing either; you can provide short user space buffers and corrupt memory with regular admin commands, and we have been able to that from day 1. But if you abuse this interface, it was always your fault; the kernel never took responsibility to sanity check your nvme input, and I think it's a bad precedent to start doing it.
In my mind, this was about dealing with the specific case when the kernel memory is being used for device DMA. We have just two cases: (i) separate meta buffer, and (ii) bounce buffer for data (+metadata). I had not planned sanity checks for user inputs for anything beyond that. As opposed to being preventive (in all cases), it was about failing only when we are certain that DMA will take place and it will corrupt kernel memory.
In the long-term, it may be possible for the path to do away with memory copies. The checks can disappear with that.
On Fri, Oct 13, 2023 at 07:49:19AM +0530, Kanchan Joshi wrote:
precedent to start doing it.
In my mind, this was about dealing with the specific case when the kernel memory is being used for device DMA. We have just two cases: (i) separate meta buffer, and (ii) bounce buffer for data (+metadata). I had not planned sanity checks for user inputs for anything beyond that. As opposed to being preventive (in all cases), it was about failing only when we are certain that DMA will take place and it will corrupt kernel memory.
In the long-term, it may be possible for the path to do away with memory copies. The checks can disappear with that.
As soon as the user buffer is unaligned we need to bounce buffer, including for the data buffer.
On Fri, Oct 13, 2023 at 10:08 AM Christoph Hellwig hch@lst.de wrote:
On Fri, Oct 13, 2023 at 07:49:19AM +0530, Kanchan Joshi wrote:
precedent to start doing it.
In my mind, this was about dealing with the specific case when the kernel memory is being used for device DMA. We have just two cases: (i) separate meta buffer, and (ii) bounce buffer for data (+metadata). I had not planned sanity checks for user inputs for anything beyond that. As opposed to being preventive (in all cases), it was about failing only when we are certain that DMA will take place and it will corrupt kernel memory.
In the long-term, it may be possible for the path to do away with memory copies. The checks can disappear with that.
As soon as the user buffer is unaligned we need to bounce buffer, including for the data buffer.
Yes, but that also sprinkles a bunch of checks and goes against the theme of and doing as minimal as possible (at least for passthrough). Had the plain buffer (potentially unaligned) gone down, either it would have worked or the device would not like it and user space would have got the error anyway. No?
linux-stable-mirror@lists.linaro.org