Unfortunately, we ended up merging an old version of the patch "fix info leak with DMA_FROM_DEVICE" instead of merging the latest one. Christoph (the swiotlb maintainer), he asked me to create an incremental fix (after I have pointed this out the mix up, and asked him for guidance). So here we go.
The main differences between what we got and what was agreed are: * swiotlb_sync_single_for_device is also required to do an extra bounce * We decided not to introduce DMA_ATTR_OVERWRITE until we have exploiters * The implantation of DMA_ATTR_OVERWRITE is flawed: DMA_ATTR_OVERWRITE must take precedence over DMA_ATTR_SKIP_CPU_SYNC
Thus this patch removes DMA_ATTR_OVERWRITE, and makes swiotlb_sync_single_for_device() bounce unconditionally (that is, also when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale data from the swiotlb buffer.
Let me note, that if the size used with dma_sync_* API is less than the size used with dma_[un]map_*, under certain circumstances we may still end up with swiotlb not being transparent. In that sense, this is no perfect fix either.
To get this bullet proof, we would have to bounce the entire mapping/bounce buffer. For that we would have to figure out the starting address, and the size of the mapping in swiotlb_sync_single_for_device(). While this does seem possible, there seems to be no firm consensus on how things are supposed to work.
Signed-off-by: Halil Pasic pasic@linux.ibm.com Fixes: ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE") Cc: stable@vger.kernel.org
---
I just realized that there are still scenarios where swiotlb may produce some strange effects. Thus I don't think we have discussed the dma_sync_* part in detail.
v1 -> v2: * single patch instead of revert + right version --- Documentation/core-api/dma-attributes.rst | 8 -------- include/linux/dma-mapping.h | 8 -------- kernel/dma/swiotlb.c | 23 +++++++++++++++-------- 3 files changed, 15 insertions(+), 24 deletions(-)
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst index 17706dc91ec9..1887d92e8e92 100644 --- a/Documentation/core-api/dma-attributes.rst +++ b/Documentation/core-api/dma-attributes.rst @@ -130,11 +130,3 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged subsystem that the buffer is fully accessible at the elevated privilege level (and ideally inaccessible or at least read-only at the lesser-privileged levels). - -DMA_ATTR_OVERWRITE ------------------- - -This is a hint to the DMA-mapping subsystem that the device is expected to -overwrite the entire mapped size, thus the caller does not require any of the -previous buffer contents to be preserved. This allows bounce-buffering -implementations to optimise DMA_FROM_DEVICE transfers. diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 6150d11a607e..dca2b1355bb1 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -61,14 +61,6 @@ */ #define DMA_ATTR_PRIVILEGED (1UL << 9)
-/* - * This is a hint to the DMA-mapping subsystem that the device is expected - * to overwrite the entire mapped size, thus the caller does not require any - * of the previous buffer contents to be preserved. This allows - * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers. - */ -#define DMA_ATTR_OVERWRITE (1UL << 10) - /* * A dma_addr_t can hold any valid DMA or bus address for the platform. It can * be given to a device to use as a DMA source or target. It is specific to a diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index bfc56cb21705..6db1c475ec82 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -627,10 +627,14 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, for (i = 0; i < nr_slots(alloc_size + offset); i++) mem->slots[index + i].orig_addr = slot_addr(orig_addr, i); tlb_addr = slot_addr(mem->start, index) + offset; - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && - (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE || - dir == DMA_BIDIRECTIONAL)) - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); + /* + * When dir == DMA_FROM_DEVICE we could omit the copy from the orig + * to the tlb buffer, if we knew for sure the device will + * overwirte the entire current content. But we don't. Thus + * unconditional bounce may prevent leaking swiotlb content (i.e. + * kernel memory) to user-space. + */ + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); return tlb_addr; }
@@ -697,10 +701,13 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { - if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); - else - BUG_ON(dir != DMA_FROM_DEVICE); + /* + * Unconditional bounce is necessary to avoid corruption on + * sync_*_for_cpu or dma_ummap_* when the device didn't overwrite + * the whole lengt of the bounce buffer. + */ + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); + BUG_ON(!valid_dma_direction(dir)); }
void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
base-commit: 38f80f42147ff658aff218edb0a88c37e58bf44f
On Sat, Mar 5, 2022 at 9:07 AM Halil Pasic pasic@linux.ibm.com wrote:
Unfortunately, we ended up merging an old version of the patch "fix info leak with DMA_FROM_DEVICE" instead of merging the latest one. Christoph (the swiotlb maintainer), he asked me to create an incremental fix (after I have pointed this out the mix up, and asked him for guidance). So here we go.
[...]
Christoph, I am assuming I'll get this from you, but if you have nothing else pending, just holler and I can take it directly.
That said, it seems sad to bounce the buffer just in case the device doesn't do what we expect it to do. Wouldn't it be better to just clear the buffer instead of copying the old data into it?
Also, possibly stupid question - when is swiotlb used in practice these days? What are the performance implications of this? Will this mean completely unnecessary copies for all normal IO that will be overwritten by the DMA? Can't we limit it to just SG_IO (or is it already for some reason)?
Linus
On Sat, Mar 05, 2022 at 11:44:51AM -0800, Linus Torvalds wrote:
Christoph, I am assuming I'll get this from you, but if you have nothing else pending, just holler and I can take it directly.
I have nothing else pending, so feel free to take it directly:
Reviewed-by: Christoph Hellwig hch@lst.de
That said, it seems sad to bounce the buffer just in case the device doesn't do what we expect it to do. Wouldn't it be better to just clear the buffer instead of copying the old data into it?
That unfortunately does not work always work, as we have interfaces where we expect data that is not written to by the device to remain unchanged, and there are many cases where the device does not write the whole buffer. In various places like SCSI the data transfer can be smaller, but even more common is the case of no data transfer at all error cases.
Also, possibly stupid question - when is swiotlb used in practice these days? What are the performance implications of this? Will this mean completely unnecessary copies for all normal IO that will be overwritten by the DMA? Can't we limit it to just SG_IO (or is it already for some reason)?
There are three use case: - bounce buffering due to address limitations when no IOMMU is present - bounce buffering for not page aligned I/O on IOMMUs if the device is considered untrusted (which ã‚Šight now means thunderbolt attached) - unconditional bounce buffering for trusted hypervisor schemes
All of thee apply to all I/O. SG_IO is just a vector here as a particularly badly designed userspace interface that happens to get a decent test coverage. There unfortuately are plenty of other ways to expose this issue as well. We're thinking of building interfaces to avoid the overhead for well designed interfaces, but it will take some time.
On Sat, Mar 5, 2022 at 11:58 PM Christoph Hellwig hch@infradead.org wrote:
All of thee apply to all I/O. SG_IO is just a vector here as a particularly badly designed userspace interface that happens to get a decent test coverage.
I phrased that badly. I didn't mean SG_IO in particular, I meant any of the "generate special commands" interfaces.
In particular, I meant it as "not *regular* read/write commands".
It seems extremely wasteful to do a completely unnecessary bounce buffer operation for READ/WRITE commands.
Because honestly, if *those* are broken on some DMA device and they don't fill the buffer completely - despite claiming they do - then that device is so terminally broken that it's not even worth worrying about.
So I would expect that
(a) READ/WRITE actually fills the whole buffer
(b) READ/WRITE are the only ones where we care about performance at a bounce-buffer level
so it boils down to "do we still do this horrible memcpy even for regular IO commands"? Because that would, in my opinion, just be stupid.
Linus
On Sun, Mar 06, 2022 at 12:30:55PM -0800, Linus Torvalds wrote:
So I would expect that
(a) READ/WRITE actually fills the whole buffer
(b) READ/WRITE are the only ones where we care about performance at a bounce-buffer level
so it boils down to "do we still do this horrible memcpy even for regular IO commands"? Because that would, in my opinion, just be stupid.
For one thing this is not just for block I/O, but for all DMA. Second, "normal" I/O might always fail, including after partial transfers. SCSI even considers that normal. Network devices consider it normal to not fill the entiret receive buffers, etc.
In short: anything that operates directly on user memory is a trivial reproducer here. The CVE uses SG_IO, but staying in block land direct I/O will work just the same because swiotlb will copy back the uninitialized data to user memory after an I/O failure.
What we've been thinking of is a version of the dma map calls where the unmap gets passed how much data actually was transferred and only copies that out. But that seems like the only sane interface.
Now IFF we known that the buffer is never looked at on I/O failure or short I/O we could do away with all this. But without adding new interfaces where the caller guarantees that we can't know that. For userspace memory it is guaranteed to be not true. For kernel memory is most likely is true, but there's some amazingly awful pieces of code that probably still get it wrong.
linux-stable-mirror@lists.linaro.org