Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There are a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
All the changes introduced in this patch series are intended to prepare a good ground for upcoming generic IOMMU integration to DMA mapping framework on ARM architecture.
For more information about proof-of-concept IOMMU implementation in DMA mapping framework, please refer to my previous set of patches: http://www.spinics.net/lists/linux-mm/msg19856.html
I've tried to split the redesign into a set of single-step changes for easier review and understanding. If there is anything that needs further clarification, please don't hesitate to ask.
The patches are prepared on top of Linux Kernel v3.0-rc3.
The proposed changes have been tested on Samsung Exynos4 platform. I've also tested dmabounce code (by manually registering support for DMA bounce for some of the devices available on my board), although my hardware have no such strict requirements. Would be great if one could test my patches on different ARM architectures to check if I didn't break anything.
Best regards
This patch removes the need for offset parameter in dma bounce functions. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
Signed-off-by: Kyungmin Park kyungmin.park@samsung.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- arch/arm/common/dmabounce.c | 13 ++++++-- arch/arm/include/asm/dma-mapping.h | 63 +++++++++++++++++------------------ arch/arm/mm/dma-mapping.c | 4 +- 3 files changed, 43 insertions(+), 37 deletions(-)
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c index e568163..f7b330f 100644 --- a/arch/arm/common/dmabounce.c +++ b/arch/arm/common/dmabounce.c @@ -171,7 +171,8 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_ read_lock_irqsave(&device_info->lock, flags);
list_for_each_entry(b, &device_info->safe_buffers, node) - if (b->safe_dma_addr == safe_dma_addr) { + if (b->safe_dma_addr <= safe_dma_addr && + b->safe_dma_addr + b->size > safe_dma_addr) { rb = b; break; } @@ -391,9 +392,10 @@ void __dma_unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, EXPORT_SYMBOL(__dma_unmap_page);
int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, - unsigned long off, size_t sz, enum dma_data_direction dir) + size_t sz, enum dma_data_direction dir) { struct safe_buffer *buf; + unsigned long off;
dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n", __func__, addr, off, sz, dir); @@ -402,6 +404,8 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, if (!buf) return 1;
+ off = addr - buf->safe_dma_addr; + BUG_ON(buf->direction != dir);
dev_dbg(dev, "%s: unsafe buffer %p (dma=%#x) mapped to %p (dma=%#x)\n", @@ -420,9 +424,10 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, EXPORT_SYMBOL(dmabounce_sync_for_cpu);
int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr, - unsigned long off, size_t sz, enum dma_data_direction dir) + size_t sz, enum dma_data_direction dir) { struct safe_buffer *buf; + unsigned long off;
dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n", __func__, addr, off, sz, dir); @@ -431,6 +436,8 @@ int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr, if (!buf) return 1;
+ off = addr - buf->safe_dma_addr; + BUG_ON(buf->direction != dir);
dev_dbg(dev, "%s: unsafe buffer %p (dma=%#x) mapped to %p (dma=%#x)\n", diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 4fff837..ca920aa 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -310,10 +310,8 @@ extern void __dma_unmap_page(struct device *, dma_addr_t, size_t, /* * Private functions */ -int dmabounce_sync_for_cpu(struct device *, dma_addr_t, unsigned long, - size_t, enum dma_data_direction); -int dmabounce_sync_for_device(struct device *, dma_addr_t, unsigned long, - size_t, enum dma_data_direction); +int dmabounce_sync_for_cpu(struct device *, dma_addr_t, size_t, enum dma_data_direction); +int dmabounce_sync_for_device(struct device *, dma_addr_t, size_t, enum dma_data_direction); #else static inline int dmabounce_sync_for_cpu(struct device *d, dma_addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir) @@ -454,6 +452,33 @@ static inline void dma_unmap_page(struct device *dev, dma_addr_t handle, __dma_unmap_page(dev, handle, size, dir); }
+ +static inline void dma_sync_single_for_cpu(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + BUG_ON(!valid_dma_direction(dir)); + + debug_dma_sync_single_for_cpu(dev, handle, size, dir); + + if (!dmabounce_sync_for_cpu(dev, handle, size, dir)) + return; + + __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); +} + +static inline void dma_sync_single_for_device(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + BUG_ON(!valid_dma_direction(dir)); + + debug_dma_sync_single_for_device(dev, handle, size, dir); + + if (!dmabounce_sync_for_device(dev, handle, size, dir)) + return; + + __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir); +} + /** * dma_sync_single_range_for_cpu * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices @@ -476,40 +501,14 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev, dma_addr_t handle, unsigned long offset, size_t size, enum dma_data_direction dir) { - BUG_ON(!valid_dma_direction(dir)); - - debug_dma_sync_single_for_cpu(dev, handle + offset, size, dir); - - if (!dmabounce_sync_for_cpu(dev, handle, offset, size, dir)) - return; - - __dma_single_dev_to_cpu(dma_to_virt(dev, handle) + offset, size, dir); + dma_sync_single_for_cpu(dev, handle + offset, size, dir); }
static inline void dma_sync_single_range_for_device(struct device *dev, dma_addr_t handle, unsigned long offset, size_t size, enum dma_data_direction dir) { - BUG_ON(!valid_dma_direction(dir)); - - debug_dma_sync_single_for_device(dev, handle + offset, size, dir); - - if (!dmabounce_sync_for_device(dev, handle, offset, size, dir)) - return; - - __dma_single_cpu_to_dev(dma_to_virt(dev, handle) + offset, size, dir); -} - -static inline void dma_sync_single_for_cpu(struct device *dev, - dma_addr_t handle, size_t size, enum dma_data_direction dir) -{ - dma_sync_single_range_for_cpu(dev, handle, 0, size, dir); -} - -static inline void dma_sync_single_for_device(struct device *dev, - dma_addr_t handle, size_t size, enum dma_data_direction dir) -{ - dma_sync_single_range_for_device(dev, handle, 0, size, dir); + dma_sync_single_for_device(dev, handle + offset, size, dir); }
/* diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 82a093c..c11f234 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -619,7 +619,7 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int i;
for_each_sg(sg, s, nents, i) { - if (!dmabounce_sync_for_cpu(dev, sg_dma_address(s), 0, + if (!dmabounce_sync_for_cpu(dev, sg_dma_address(s), sg_dma_len(s), dir)) continue;
@@ -645,7 +645,7 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int i;
for_each_sg(sg, s, nents, i) { - if (!dmabounce_sync_for_device(dev, sg_dma_address(s), 0, + if (!dmabounce_sync_for_device(dev, sg_dma_address(s), sg_dma_len(s), dir)) continue;
On Mon, 20 Jun 2011 09:50:06 +0200, Marek Szyprowski m.szyprowski@samsung.com wrote:
+static inline void dma_sync_single_for_cpu(struct device *dev,
I wouldn't really put inline here or in the function below.
dma_addr_t handle, size_t size, enum dma_data_direction dir)
+{
- BUG_ON(!valid_dma_direction(dir));
- debug_dma_sync_single_for_cpu(dev, handle, size, dir);
- if (!dmabounce_sync_for_cpu(dev, handle, size, dir))
return;
- __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir);
I know it is just copy'n'paste but how about:
if (dmabounce_sync_for_cpu(dev, handle, size, dir)) __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir);
+}
+static inline void dma_sync_single_for_device(struct device *dev,
dma_addr_t handle, size_t size, enum dma_data_direction dir)
+{
- BUG_ON(!valid_dma_direction(dir));
- debug_dma_sync_single_for_device(dev, handle, size, dir);
- if (!dmabounce_sync_for_device(dev, handle, size, dir))
return;
- __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir);
Same as above.
+}
/**
- dma_sync_single_range_for_cpu
- @dev: valid struct device pointer, or NULL for ISA and EISA-like
devices
Hello,
On Monday, June 20, 2011 10:36 AM Michal Nazarewicz wrote:
On Mon, 20 Jun 2011 09:50:06 +0200, Marek Szyprowski m.szyprowski@samsung.com wrote:
+static inline void dma_sync_single_for_cpu(struct device *dev,
I wouldn't really put inline here or in the function below.
dma_addr_t handle, size_t size, enum dma_data_direction dir)
+{
- BUG_ON(!valid_dma_direction(dir));
- debug_dma_sync_single_for_cpu(dev, handle, size, dir);
- if (!dmabounce_sync_for_cpu(dev, handle, size, dir))
return;
- __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir);
I know it is just copy'n'paste but how about:
This patch is just about moving the code between the files and I wanted just to show what's being changed and how. There is a final cleanup anyway in the separate patch. All these patches are meant to start the discussion about the way the dma mapping can be redesigned for further extensions with generic iommu support.
if (dmabounce_sync_for_cpu(dev, handle, size, dir)) __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir);
The above lines will be removed by the next patches in this series, so I really see no point in changing this.
(snipped)
Best regards
On Mon, Jun 20, 2011 at 09:50:06AM +0200, Marek Szyprowski wrote:
This patch removes the need for offset parameter in dma bounce functions. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
I really don't like this. Really as in hate. Why? I've said in the past that the whole idea of getting rid of the sub-range functions is idiotic.
If you have to deal with cache coherence, what you _really_ want is an API which tells you the size of the original buffer and the section of that buffer which you want to handle - because the edges of the buffer need special handling.
Lets say that you have a buffer which is 256 bytes long, misaligned to half a cache line. Let's first look at the sequence for whole-buffer:
1. You map it for DMA from the device. This means you writeback the first and last cache lines to perserve any data shared in the overlapping cache line. The remainder you can just invalidate.
2. You want to access the buffer, so you use the sync_for_cpu function. If your CPU doesn't do any speculative prefetching, then you don't need to do anything. If you do, you have to invalidate the buffer, but you must preserve the overlapping cache lines which again must be written back.
3. You transfer ownership back to the device using sync_for_device. As you may have caused cache lines to be read in, again you need to invalidate, and the overlapping cache lines must be written back.
Now, if you ask for a sub-section of the buffer to be sync'd, you can actually eliminate those writebacks which are potentially troublesome, and which could corrupt neighbouring data.
If you get rid of the sub-buffer functions and start using the whole buffer functions for that purpose, you no longer know whether the partial cache lines are part of the buffer or not, so you have to write those back every time too.
So far, we haven't had any reports of corruption of this type (maybe folk using the sync functions are rare on ARM - thankfully) but getting rid of the range sync functions means that solving this becomes a lot more difficult because we've lost the information to make the decision.
So I've always believed - and continue to do so - that those who want to get rid of the range sync functions are misguided and are storing up problems for the future.
Hello,
On Sunday, July 03, 2011 5:28 PM Russell King wrote:
On Mon, Jun 20, 2011 at 09:50:06AM +0200, Marek Szyprowski wrote:
This patch removes the need for offset parameter in dma bounce functions. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
I really don't like this. Really as in hate. Why? I've said in the past that the whole idea of getting rid of the sub-range functions is idiotic.
If you have to deal with cache coherence, what you _really_ want is an API which tells you the size of the original buffer and the section of that buffer which you want to handle - because the edges of the buffer need special handling.
Lets say that you have a buffer which is 256 bytes long, misaligned to half a cache line. Let's first look at the sequence for whole-buffer:
You map it for DMA from the device. This means you writeback the first and last cache lines to perserve any data shared in the overlapping cache line. The remainder you can just invalidate.
You want to access the buffer, so you use the sync_for_cpu function. If your CPU doesn't do any speculative prefetching, then you don't need to do anything. If you do, you have to invalidate the buffer, but you must preserve the overlapping cache lines which again must be written back.
You transfer ownership back to the device using sync_for_device. As you may have caused cache lines to be read in, again you need to invalidate, and the overlapping cache lines must be written back.
Now, if you ask for a sub-section of the buffer to be sync'd, you can actually eliminate those writebacks which are potentially troublesome, and which could corrupt neighbouring data.
If you get rid of the sub-buffer functions and start using the whole buffer functions for that purpose, you no longer know whether the partial cache lines are part of the buffer or not, so you have to write those back every time too.
So far, we haven't had any reports of corruption of this type (maybe folk using the sync functions are rare on ARM - thankfully) but getting rid of the range sync functions means that solving this becomes a lot more difficult because we've lost the information to make the decision.
Well, right now I haven't heard anyone who wants to remove dma_sync_single_range_for_{cpu,device}. All this is about internal implementation and dma_map_ops which uses the simplified calls, not exposed to the drivers or any public API.
I also see no reason why we loose the information. All drivers are still required to call dma_map_{single,page} to aquire dma address first. This way DMA mapping subsystem perfectly knows that the range from returned dma_addr to dma_addr+size has been used for dma operations. All calls to dma_sync_single_* operations takes dma_addr as one of the arguments, so there is no problem to check which dma range this particular sync operation fits.
In my patch I have shown that it is perfectly possible to use the common dma_map_ops structure on ARM and unify dma mapping implementation a bit with other architectures.
IMHO this is the right way. There is a need for custom dma mapping implementations (mainly related to taking the advantage of iommu controllers available on newer SoCs). I would really like to avoid another set of ifdefs or sequences of "if (iommu_supported())" all over the dma-mapping code. Even now all this code is hard to understand in the first read (due to coherent/ non-coherent sub-architectures and dmabounce code mixed in).
So I've always believed - and continue to do so - that those who want to get rid of the range sync functions are misguided and are storing up problems for the future.
I never said that I want to remove these operations from drivers API.
Best regards
This patch consolidates dma_map_single and dma_map_page calls. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
Signed-off-by: Kyungmin Park kyungmin.park@samsung.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- arch/arm/common/dmabounce.c | 28 ---------- arch/arm/include/asm/dma-mapping.h | 100 +++++++++++++---------------------- 2 files changed, 37 insertions(+), 91 deletions(-)
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c index f7b330f..9eb161e 100644 --- a/arch/arm/common/dmabounce.c +++ b/arch/arm/common/dmabounce.c @@ -329,34 +329,6 @@ static inline void unmap_single(struct device *dev, dma_addr_t dma_addr, * substitute the safe buffer for the unsafe one. * (basically move the buffer from an unsafe area to a safe one) */ -dma_addr_t __dma_map_single(struct device *dev, void *ptr, size_t size, - enum dma_data_direction dir) -{ - dev_dbg(dev, "%s(ptr=%p,size=%d,dir=%x)\n", - __func__, ptr, size, dir); - - BUG_ON(!valid_dma_direction(dir)); - - return map_single(dev, ptr, size, dir); -} -EXPORT_SYMBOL(__dma_map_single); - -/* - * see if a mapped address was really a "safe" buffer and if so, copy - * the data from the safe buffer back to the unsafe buffer and free up - * the safe buffer. (basically return things back to the way they - * should be) - */ -void __dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, - enum dma_data_direction dir) -{ - dev_dbg(dev, "%s(ptr=%p,size=%d,dir=%x)\n", - __func__, (void *) dma_addr, size, dir); - - unmap_single(dev, dma_addr, size, dir); -} -EXPORT_SYMBOL(__dma_unmap_single); - dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index ca920aa..799669d 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -298,10 +298,6 @@ extern int dma_needs_bounce(struct device*, dma_addr_t, size_t); /* * The DMA API, implemented by dmabounce.c. See below for descriptions. */ -extern dma_addr_t __dma_map_single(struct device *, void *, size_t, - enum dma_data_direction); -extern void __dma_unmap_single(struct device *, dma_addr_t, size_t, - enum dma_data_direction); extern dma_addr_t __dma_map_page(struct device *, struct page *, unsigned long, size_t, enum dma_data_direction); extern void __dma_unmap_page(struct device *, dma_addr_t, size_t, @@ -325,14 +321,6 @@ static inline int dmabounce_sync_for_device(struct device *d, dma_addr_t addr, return 1; }
- -static inline dma_addr_t __dma_map_single(struct device *dev, void *cpu_addr, - size_t size, enum dma_data_direction dir) -{ - __dma_single_cpu_to_dev(cpu_addr, size, dir); - return virt_to_dma(dev, cpu_addr); -} - static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { @@ -340,12 +328,6 @@ static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, return pfn_to_dma(dev, page_to_pfn(page)) + offset; }
-static inline void __dma_unmap_single(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir) -{ - __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); -} - static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { @@ -354,34 +336,6 @@ static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, } #endif /* CONFIG_DMABOUNCE */
-/** - * dma_map_single - map a single buffer for streaming DMA - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @cpu_addr: CPU direct mapped address of buffer - * @size: size of buffer to map - * @dir: DMA transfer direction - * - * Ensure that any data held in the cache is appropriately discarded - * or written back. - * - * The device owns this memory once this call has completed. The CPU - * can regain ownership by calling dma_unmap_single() or - * dma_sync_single_for_cpu(). - */ -static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, - size_t size, enum dma_data_direction dir) -{ - dma_addr_t addr; - - BUG_ON(!valid_dma_direction(dir)); - - addr = __dma_map_single(dev, cpu_addr, size, dir); - debug_dma_map_page(dev, virt_to_page(cpu_addr), - (unsigned long)cpu_addr & ~PAGE_MASK, size, - dir, addr, true); - - return addr; -}
/** * dma_map_page - map a portion of a page for streaming DMA @@ -411,48 +365,68 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, }
/** - * dma_unmap_single - unmap a single buffer previously mapped + * dma_unmap_page - unmap a buffer previously mapped through dma_map_page() * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @handle: DMA address of buffer - * @size: size of buffer (same as passed to dma_map_single) - * @dir: DMA transfer direction (same as passed to dma_map_single) + * @size: size of buffer (same as passed to dma_map_page) + * @dir: DMA transfer direction (same as passed to dma_map_page) * - * Unmap a single streaming mode DMA translation. The handle and size - * must match what was provided in the previous dma_map_single() call. + * Unmap a page streaming mode DMA translation. The handle and size + * must match what was provided in the previous dma_map_page() call. * All other usages are undefined. * * After this call, reads by the CPU to the buffer are guaranteed to see * whatever the device wrote there. */ -static inline void dma_unmap_single(struct device *dev, dma_addr_t handle, + +static inline void dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - debug_dma_unmap_page(dev, handle, size, dir, true); - __dma_unmap_single(dev, handle, size, dir); + debug_dma_unmap_page(dev, handle, size, dir, false); + __dma_unmap_page(dev, handle, size, dir); }
/** - * dma_unmap_page - unmap a buffer previously mapped through dma_map_page() + * dma_map_single - map a single buffer for streaming DMA + * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices + * @cpu_addr: CPU direct mapped address of buffer + * @size: size of buffer to map + * @dir: DMA transfer direction + * + * Ensure that any data held in the cache is appropriately discarded + * or written back. + * + * The device owns this memory once this call has completed. The CPU + * can regain ownership by calling dma_unmap_single() or + * dma_sync_single_for_cpu(). + */ +static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, + size_t size, enum dma_data_direction dir) +{ + return dma_map_page(dev, virt_to_page(cpu_addr), + (unsigned long)cpu_addr & ~PAGE_MASK, size, dir); +} + +/** + * dma_unmap_single - unmap a single buffer previously mapped * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @handle: DMA address of buffer - * @size: size of buffer (same as passed to dma_map_page) - * @dir: DMA transfer direction (same as passed to dma_map_page) + * @size: size of buffer (same as passed to dma_map_single) + * @dir: DMA transfer direction (same as passed to dma_map_single) * - * Unmap a page streaming mode DMA translation. The handle and size - * must match what was provided in the previous dma_map_page() call. + * Unmap a single streaming mode DMA translation. The handle and size + * must match what was provided in the previous dma_map_single() call. * All other usages are undefined. * * After this call, reads by the CPU to the buffer are guaranteed to see * whatever the device wrote there. */ -static inline void dma_unmap_page(struct device *dev, dma_addr_t handle, +static inline void dma_unmap_single(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - debug_dma_unmap_page(dev, handle, size, dir, false); - __dma_unmap_page(dev, handle, size, dir); + dma_unmap_page(dev, handle, size, dir); }
- static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) {
On Mon, Jun 20, 2011 at 09:50:07AM +0200, Marek Szyprowski wrote:
This patch consolidates dma_map_single and dma_map_page calls. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
This breaks DMA API debugging, which requires that dma_map_page and dma_unmap_page are paired separately from dma_map_single and dma_unmap_single().
This also breaks dmabounce when used with a highmem-enabled system - dmabounce refuses the dma_map_page() API but allows the dma_map_single() API.
Hello,
On Monday, June 20, 2011 4:39 PM Russell King - ARM Linux wrote:
On Mon, Jun 20, 2011 at 09:50:07AM +0200, Marek Szyprowski wrote:
This patch consolidates dma_map_single and dma_map_page calls. This is required to let dma-mapping framework on ARM architecture use common, generic dma-mapping helpers.
This breaks DMA API debugging, which requires that dma_map_page and dma_unmap_page are paired separately from dma_map_single and dma_unmap_single().
Ok, right. This can be fixed by creating appropriate static inline functions in dma-mapping.h and moving dma_debug_* calls there. These function will be later removed by using dma_map_ops and include/asm-generic/dma-mapping-common.h inlines, which do all the dma_debug_* calls correctly anyway.
This also breaks dmabounce when used with a highmem-enabled system - dmabounce refuses the dma_map_page() API but allows the dma_map_single() API.
I really not sure how this change will break dma bounce code.
Does it mean that it is allowed to call dma_map_single() on kmapped HIGH_MEM page?
Best regards
On Monday 20 June 2011, Marek Szyprowski wrote:
This also breaks dmabounce when used with a highmem-enabled system - dmabounce refuses the dma_map_page() API but allows the dma_map_single() API.
I really not sure how this change will break dma bounce code.
Does it mean that it is allowed to call dma_map_single() on kmapped HIGH_MEM page?
dma_map_single on a kmapped page already doesn't work, the argument needs to be inside of the linear mapping in order for virt_to_page to work.
Arnd
Hello,
On Friday, June 24, 2011 5:24 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
This also breaks dmabounce when used with a highmem-enabled system - dmabounce refuses the dma_map_page() API but allows the
dma_map_single()
API.
I really not sure how this change will break dma bounce code.
Does it mean that it is allowed to call dma_map_single() on kmapped HIGH_MEM page?
dma_map_single on a kmapped page already doesn't work, the argument needs to be inside of the linear mapping in order for virt_to_page to work.
Then I got really confused.
Documentation/DMA-mapping.txt says that dma_map_single() can be used only with kernel linear mapping, while dma_map_page() can be also called on HIGHMEM pages.
Now, lets go to arch/arm/common/dmabounce.c code:
dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { dev_dbg(dev, "%s(page=%p,off=%#lx,size=%zx,dir=%x)\n", __func__, page, offset, size, dir);
BUG_ON(!valid_dma_direction(dir));
if (PageHighMem(page)) { dev_err(dev, "DMA buffer bouncing of HIGHMEM pages " "is not supported\n"); return ~0; }
return map_single(dev, page_address(page) + offset, size, dir); } EXPORT_SYMBOL(__dma_map_page);
Am I right that there is something mixed here? I really don't get why there is high mem check in dma_map_page implementation. dma_map_single doesn't perform such check and works with kmapped highmem pages...
Russell also pointed that my patch broke dma bounch with high mem enabled.
Best regards
On Monday 27 June 2011, Marek Szyprowski wrote:
On Friday, June 24, 2011 5:24 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
This also breaks dmabounce when used with a highmem-enabled system - dmabounce refuses the dma_map_page() API but allows the
dma_map_single()
API.
I really not sure how this change will break dma bounce code.
Does it mean that it is allowed to call dma_map_single() on kmapped HIGH_MEM page?
dma_map_single on a kmapped page already doesn't work, the argument needs to be inside of the linear mapping in order for virt_to_page to work.
Then I got really confused.
Documentation/DMA-mapping.txt says that dma_map_single() can be used only with kernel linear mapping, while dma_map_page() can be also called on HIGHMEM pages.
Right, this is true in general.
Now, lets go to arch/arm/common/dmabounce.c code:
dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { dev_dbg(dev, "%s(page=%p,off=%#lx,size=%zx,dir=%x)\n", __func__, page, offset, size, dir);
BUG_ON(!valid_dma_direction(dir)); if (PageHighMem(page)) { dev_err(dev, "DMA buffer bouncing of HIGHMEM pages " "is not supported\n"); return ~0; } return map_single(dev, page_address(page) + offset, size, dir);
} EXPORT_SYMBOL(__dma_map_page);
Am I right that there is something mixed here? I really don't get why there is high mem check in dma_map_page implementation. dma_map_single doesn't perform such check and works with kmapped highmem pages...
Russell also pointed that my patch broke dma bounch with high mem enabled.
The version of __dma_map_page that you cited is the one used with dmabounce enabled, when CONFIG_DMABOUNCE is disabled, the following version is used:
static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { __dma_page_cpu_to_dev(page, offset, size, dir); return pfn_to_dma(dev, page_to_pfn(page)) + offset; }
This does not have the check, because the kernel does not need to touch the kernel mapping in that case.
If you pass a kmapped page into dma_map_single, it should also not work because of the BUG_ON in ___dma_single_cpu_to_dev -- it warns you that you would end up flushing the cache for the wrong page (if any).
Arnd
Hello,
On Monday, June 27, 2011 4:54 PM Arnd Bergmann wrote:
On Monday 27 June 2011, Marek Szyprowski wrote:
On Friday, June 24, 2011 5:24 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
This also breaks dmabounce when used with a highmem-enabled system
dmabounce refuses the dma_map_page() API but allows the
dma_map_single()
API.
I really not sure how this change will break dma bounce code.
Does it mean that it is allowed to call dma_map_single() on kmapped HIGH_MEM page?
dma_map_single on a kmapped page already doesn't work, the argument
needs
to be inside of the linear mapping in order for virt_to_page to work.
Then I got really confused.
Documentation/DMA-mapping.txt says that dma_map_single() can be used only with kernel linear mapping, while dma_map_page() can be also called on HIGHMEM pages.
Right, this is true in general.
Ok, so I see no reasons for not implementing dma_map_single() on top of dma_map_page() like it has been done in asm-generic/dma-mapping-common.h
Now, lets go to arch/arm/common/dmabounce.c code:
dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum
dma_data_direction dir)
{ dev_dbg(dev, "%s(page=%p,off=%#lx,size=%zx,dir=%x)\n", __func__, page, offset, size, dir);
BUG_ON(!valid_dma_direction(dir)); if (PageHighMem(page)) { dev_err(dev, "DMA buffer bouncing of HIGHMEM pages " "is not supported\n"); return ~0; } return map_single(dev, page_address(page) + offset, size, dir);
} EXPORT_SYMBOL(__dma_map_page);
Am I right that there is something mixed here? I really don't get why
there is
high mem check in dma_map_page implementation. dma_map_single doesn't
perform
such check and works with kmapped highmem pages...
Russell also pointed that my patch broke dma bounch with high mem enabled.
The version of __dma_map_page that you cited is the one used with dmabounce enabled, when CONFIG_DMABOUNCE is disabled, the following version is used:
static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir) { __dma_page_cpu_to_dev(page, offset, size, dir); return pfn_to_dma(dev, page_to_pfn(page)) + offset; }
This does not have the check, because the kernel does not need to touch the kernel mapping in that case.
If you pass a kmapped page into dma_map_single, it should also not work because of the BUG_ON in ___dma_single_cpu_to_dev -- it warns you that you would end up flushing the cache for the wrong page (if any).
Yes, I know that the flow is different when dma bounce is not used. Non-dma bounce version will still work correctly after my patch.
However I still don't get how my patch broke dma bounce code with HIGHMEM, what has been pointed by Russell...
Best regards
This patch modifies dma-mapping implementation on ARM architecture to use common dma_map_ops structure and asm-generic/dma-mapping-common.h helpers.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/Kconfig | 1 + arch/arm/include/asm/device.h | 1 + arch/arm/include/asm/dma-mapping.h | 201 +++++------------------------------- arch/arm/mm/dma-mapping.c | 117 +++++++++++++++++---- 4 files changed, 127 insertions(+), 193 deletions(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 9adc278..0b834c1 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -3,6 +3,7 @@ config ARM default y select HAVE_AOUT select HAVE_DMA_API_DEBUG + select HAVE_DMA_ATTRS select HAVE_IDE select HAVE_MEMBLOCK select RTC_LIB diff --git a/arch/arm/include/asm/device.h b/arch/arm/include/asm/device.h index 9f390ce..d3b35d8 100644 --- a/arch/arm/include/asm/device.h +++ b/arch/arm/include/asm/device.h @@ -7,6 +7,7 @@ #define ASMARM_DEVICE_H
struct dev_archdata { + struct dma_map_ops *dma_ops; #ifdef CONFIG_DMABOUNCE struct dmabounce_device_info *dmabounce; #endif diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 799669d..f4e4968 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -10,6 +10,27 @@ #include <asm-generic/dma-coherent.h> #include <asm/memory.h>
+extern struct dma_map_ops dma_ops; + +static inline struct dma_map_ops *get_dma_ops(struct device *dev) +{ + if (dev->archdata.dma_ops) + return dev->archdata.dma_ops; + return &dma_ops; +} + +static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops) +{ + dev->archdata.dma_ops = ops; +} + +#include <asm-generic/dma-mapping-common.h> + +static inline int dma_set_mask(struct device *dev, u64 mask) +{ + return get_dma_ops(dev)->set_dma_mask(dev, mask); +} + #ifdef __arch_page_to_dma #error Please update to __arch_pfn_to_dma #endif @@ -131,24 +152,6 @@ static inline int dma_supported(struct device *dev, u64 mask) return 1; }
-static inline int dma_set_mask(struct device *dev, u64 dma_mask) -{ -#ifdef CONFIG_DMABOUNCE - if (dev->archdata.dmabounce) { - if (dma_mask >= ISA_DMA_THRESHOLD) - return 0; - else - return -EIO; - } -#endif - if (!dev->dma_mask || !dma_supported(dev, dma_mask)) - return -EIO; - - *dev->dma_mask = dma_mask; - - return 0; -} - /* * DMA errors are defined by all-bits-set in the DMA address. */ @@ -336,167 +339,17 @@ static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, } #endif /* CONFIG_DMABOUNCE */
- -/** - * dma_map_page - map a portion of a page for streaming DMA - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @page: page that buffer resides in - * @offset: offset into page for start of buffer - * @size: size of buffer to map - * @dir: DMA transfer direction - * - * Ensure that any data held in the cache is appropriately discarded - * or written back. - * - * The device owns this memory once this call has completed. The CPU - * can regain ownership by calling dma_unmap_page(). - */ -static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir) -{ - dma_addr_t addr; - - BUG_ON(!valid_dma_direction(dir)); - - addr = __dma_map_page(dev, page, offset, size, dir); - debug_dma_map_page(dev, page, offset, size, dir, addr, false); - - return addr; -} - -/** - * dma_unmap_page - unmap a buffer previously mapped through dma_map_page() - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @handle: DMA address of buffer - * @size: size of buffer (same as passed to dma_map_page) - * @dir: DMA transfer direction (same as passed to dma_map_page) - * - * Unmap a page streaming mode DMA translation. The handle and size - * must match what was provided in the previous dma_map_page() call. - * All other usages are undefined. - * - * After this call, reads by the CPU to the buffer are guaranteed to see - * whatever the device wrote there. - */ - -static inline void dma_unmap_page(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir) -{ - debug_dma_unmap_page(dev, handle, size, dir, false); - __dma_unmap_page(dev, handle, size, dir); -} - -/** - * dma_map_single - map a single buffer for streaming DMA - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @cpu_addr: CPU direct mapped address of buffer - * @size: size of buffer to map - * @dir: DMA transfer direction - * - * Ensure that any data held in the cache is appropriately discarded - * or written back. - * - * The device owns this memory once this call has completed. The CPU - * can regain ownership by calling dma_unmap_single() or - * dma_sync_single_for_cpu(). - */ -static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, - size_t size, enum dma_data_direction dir) -{ - return dma_map_page(dev, virt_to_page(cpu_addr), - (unsigned long)cpu_addr & ~PAGE_MASK, size, dir); -} - -/** - * dma_unmap_single - unmap a single buffer previously mapped - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @handle: DMA address of buffer - * @size: size of buffer (same as passed to dma_map_single) - * @dir: DMA transfer direction (same as passed to dma_map_single) - * - * Unmap a single streaming mode DMA translation. The handle and size - * must match what was provided in the previous dma_map_single() call. - * All other usages are undefined. - * - * After this call, reads by the CPU to the buffer are guaranteed to see - * whatever the device wrote there. - */ -static inline void dma_unmap_single(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir) -{ - dma_unmap_page(dev, handle, size, dir); -} - -static inline void dma_sync_single_for_cpu(struct device *dev, - dma_addr_t handle, size_t size, enum dma_data_direction dir) -{ - BUG_ON(!valid_dma_direction(dir)); - - debug_dma_sync_single_for_cpu(dev, handle, size, dir); - - if (!dmabounce_sync_for_cpu(dev, handle, size, dir)) - return; - - __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); -} - -static inline void dma_sync_single_for_device(struct device *dev, - dma_addr_t handle, size_t size, enum dma_data_direction dir) -{ - BUG_ON(!valid_dma_direction(dir)); - - debug_dma_sync_single_for_device(dev, handle, size, dir); - - if (!dmabounce_sync_for_device(dev, handle, size, dir)) - return; - - __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir); -} - -/** - * dma_sync_single_range_for_cpu - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @handle: DMA address of buffer - * @offset: offset of region to start sync - * @size: size of region to sync - * @dir: DMA transfer direction (same as passed to dma_map_single) - * - * Make physical memory consistent for a single streaming mode DMA - * translation after a transfer. - * - * If you perform a dma_map_single() but wish to interrogate the - * buffer using the cpu, yet do not wish to teardown the PCI dma - * mapping, you must call this function before doing so. At the - * next point you give the PCI dma address back to the card, you - * must first the perform a dma_sync_for_device, and then the - * device again owns the buffer. - */ -static inline void dma_sync_single_range_for_cpu(struct device *dev, - dma_addr_t handle, unsigned long offset, size_t size, - enum dma_data_direction dir) -{ - dma_sync_single_for_cpu(dev, handle + offset, size, dir); -} - -static inline void dma_sync_single_range_for_device(struct device *dev, - dma_addr_t handle, unsigned long offset, size_t size, - enum dma_data_direction dir) -{ - dma_sync_single_for_device(dev, handle + offset, size, dir); -} - /* * The scatter list versions of the above methods. */ -extern int dma_map_sg(struct device *, struct scatterlist *, int, - enum dma_data_direction); -extern void dma_unmap_sg(struct device *, struct scatterlist *, int, +extern int arm_dma_map_sg(struct device *, struct scatterlist *, int, + enum dma_data_direction, struct dma_attrs *attrs); +extern void arm_dma_unmap_sg(struct device *, struct scatterlist *, int, + enum dma_data_direction, struct dma_attrs *attrs); +extern void arm_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, +extern void arm_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void dma_sync_sg_for_device(struct device *, struct scatterlist *, int, - enum dma_data_direction); -
#endif /* __KERNEL__ */ #endif diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index c11f234..5264552 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -25,6 +25,98 @@ #include <asm/tlbflush.h> #include <asm/sizes.h>
+/** + * dma_map_page - map a portion of a page for streaming DMA + * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices + * @page: page that buffer resides in + * @offset: offset into page for start of buffer + * @size: size of buffer to map + * @dir: DMA transfer direction + * + * Ensure that any data held in the cache is appropriately discarded + * or written back. + * + * The device owns this memory once this call has completed. The CPU + * can regain ownership by calling dma_unmap_page(). + */ +static inline dma_addr_t arm_dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + return __dma_map_page(dev, page, offset, size, dir); +} + +/** + * dma_unmap_page - unmap a buffer previously mapped through dma_map_page() + * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices + * @handle: DMA address of buffer + * @size: size of buffer (same as passed to dma_map_page) + * @dir: DMA transfer direction (same as passed to dma_map_page) + * + * Unmap a page streaming mode DMA translation. The handle and size + * must match what was provided in the previous dma_map_page() call. + * All other usages are undefined. + * + * After this call, reads by the CPU to the buffer are guaranteed to see + * whatever the device wrote there. + */ + +static inline void arm_dma_unmap_page(struct device *dev, dma_addr_t handle, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + __dma_unmap_page(dev, handle, size, dir); +} + +static inline void arm_dma_sync_single_for_cpu(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + if (!dmabounce_sync_for_cpu(dev, handle, size, dir)) + return; + + __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); +} + +static inline void arm_dma_sync_single_for_device(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + if (!dmabounce_sync_for_device(dev, handle, size, dir)) + return; + + __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir); +} + +static int arm_dma_set_mask(struct device *dev, u64 dma_mask) +{ +#ifdef CONFIG_DMABOUNCE + if (dev->archdata.dmabounce) { + if (dma_mask >= ISA_DMA_THRESHOLD) + return 0; + else + return -EIO; + } +#endif + if (!dev->dma_mask || !dma_supported(dev, dma_mask)) + return -EIO; + + *dev->dma_mask = dma_mask; + + return 0; +} + +struct dma_map_ops dma_ops = { + .map_page = arm_dma_map_page, + .unmap_page = arm_dma_unmap_page, + .map_sg = arm_dma_map_sg, + .unmap_sg = arm_dma_unmap_sg, + .sync_single_for_cpu = arm_dma_sync_single_for_cpu, + .sync_single_for_device = arm_dma_sync_single_for_device, + .sync_sg_for_cpu = arm_dma_sync_sg_for_cpu, + .sync_sg_for_device = arm_dma_sync_sg_for_device, + .set_dma_mask = arm_dma_set_mask, +}; +EXPORT_SYMBOL(dma_ops); + static u64 get_coherent_dma_mask(struct device *dev) { u64 mask = ISA_DMA_THRESHOLD; @@ -558,21 +650,18 @@ EXPORT_SYMBOL(___dma_page_dev_to_cpu); * Device ownership issues as mentioned for dma_map_single are the same * here. */ -int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction dir) +int arm_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *s; int i, j;
- BUG_ON(!valid_dma_direction(dir)); - for_each_sg(sg, s, nents, i) { s->dma_address = __dma_map_page(dev, sg_page(s), s->offset, s->length, dir); if (dma_mapping_error(dev, s->dma_address)) goto bad_mapping; } - debug_dma_map_sg(dev, sg, nents, nents, dir); return nents;
bad_mapping: @@ -580,7 +669,6 @@ int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, __dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir); return 0; } -EXPORT_SYMBOL(dma_map_sg);
/** * dma_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg @@ -592,18 +680,15 @@ EXPORT_SYMBOL(dma_map_sg); * Unmap a set of streaming mode DMA translations. Again, CPU access * rules concerning calls here are the same as for dma_unmap_single(). */ -void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction dir) +void arm_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *s; int i;
- debug_dma_unmap_sg(dev, sg, nents, dir); - for_each_sg(sg, s, nents, i) __dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir); } -EXPORT_SYMBOL(dma_unmap_sg);
/** * dma_sync_sg_for_cpu @@ -612,7 +697,7 @@ EXPORT_SYMBOL(dma_unmap_sg); * @nents: number of buffers to map (returned from dma_map_sg) * @dir: DMA transfer direction (same as was passed to dma_map_sg) */ -void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, +void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir) { struct scatterlist *s; @@ -626,10 +711,7 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, __dma_page_dev_to_cpu(sg_page(s), s->offset, s->length, dir); } - - debug_dma_sync_sg_for_cpu(dev, sg, nents, dir); } -EXPORT_SYMBOL(dma_sync_sg_for_cpu);
/** * dma_sync_sg_for_device @@ -638,7 +720,7 @@ EXPORT_SYMBOL(dma_sync_sg_for_cpu); * @nents: number of buffers to map (returned from dma_map_sg) * @dir: DMA transfer direction (same as was passed to dma_map_sg) */ -void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, +void arm_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir) { struct scatterlist *s; @@ -652,10 +734,7 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, dir); } - - debug_dma_sync_sg_for_device(dev, sg, nents, dir); } -EXPORT_SYMBOL(dma_sync_sg_for_device);
#define PREALLOC_DMA_DEBUG_ENTRIES 4096
Hi.
Great job.
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
+static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops) +{
- dev->archdata.dma_ops = ops;
+}
Who calls set_dma_ops()? In the mach. initialization part? What if a device driver does not want to use arch's dma_map_ops when machine init procedure set a dma_map_ops? Even though, may arch defiens their dma_map_ops in archdata of device structure, I think it is not a good idea that is device structure contains a pointer to dma_map_ops that may not be common to all devices in a board.
I also think that it is better to attach and to detach dma_map_ops dynamically. Moreover, a mapping is not permanent in our Exynos platform because a System MMU may be turned off while runtime.
DMA API must come with IOMMU API to initialize IOMMU in runtime.
Regards, Cho KyongHo.
Hello,
On Monday, June 20, 2011 4:33 PM KyongHo Cho wrote:
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
+static inline void set_dma_ops(struct device *dev, struct dma_map_ops
*ops)
+{
- dev->archdata.dma_ops = ops;
+}
Who calls set_dma_ops()? In the mach. initialization part?
Yes, some board, machine or device bus initialization code is supposed to call this function. Just 'git grep dma_set_ops' and you will see. In my patch series one of the clients of set_dma_ops function is dmabounce framework (it is called in dmabounce_register_dev() function).
What if a device driver does not want to use arch's dma_map_ops when machine init procedure set a dma_map_ops?
Could you elaborate on this case? The whole point of dma-mapping framework is to hide the implementation of DMA mapping operation from the driver. The driver should never fiddle with dma map ops directly.
Even though, may arch defiens their dma_map_ops in archdata of device structure, I think it is not a good idea that is device structure contains a pointer to dma_map_ops that may not be common to all devices in a board.
It is up to the board/bus startup code to set dma ops correctly.
I also think that it is better to attach and to detach dma_map_ops dynamically.
What's the point of such operations? Why do you want to change dma mapping methods in runtime?
Moreover, a mapping is not permanent in our Exynos platform because a System MMU may be turned off while runtime.
This is theoretically possible. The System MMU (Samsung IOMMU controller) driver can change dma_map_ops back to NULL on remove moving back the client device to generic ARM dma mapping implementation.
DMA API must come with IOMMU API to initialize IOMMU in runtime.
I don't understand what's the problem here.
Best regards
On Tue, Jun 21, 2011 at 01:47:03PM +0200, Marek Szyprowski wrote:
I also think that it is better to attach and to detach dma_map_ops dynamically.
What's the point of such operations? Why do you want to change dma mapping methods in runtime?
That is dangerous. You have to make sure that there are no mappings granted to the the device driver before changing the dma_ops of a device at runtime. Otherwise existing mappings for a device may disappear and confuse the driver and the device.
Joerg
On Monday 20 June 2011, Marek Szyprowski wrote:
This patch modifies dma-mapping implementation on ARM architecture to use common dma_map_ops structure and asm-generic/dma-mapping-common.h helpers.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
This is a good idea in general, but I have a few concerns about details:
First of all, should we only allow using dma_map_ops on ARM, or do we also want to support a case where these are all inlined as before?
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 799669d..f4e4968 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -10,6 +10,27 @@ #include <asm-generic/dma-coherent.h> #include <asm/memory.h> +extern struct dma_map_ops dma_ops;
+static inline struct dma_map_ops *get_dma_ops(struct device *dev) +{
- if (dev->archdata.dma_ops)
return dev->archdata.dma_ops;
- return &dma_ops;
+}
I would not name the global structure just 'dma_ops', the identifier could too easily conflict with a local variable in some driver. How about arm_dma_ops or linear_dma_ops instead?
/*
- The scatter list versions of the above methods.
*/ -extern int dma_map_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction);
-extern void dma_unmap_sg(struct device *, struct scatterlist *, int, +extern int arm_dma_map_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction, struct dma_attrs *attrs);
+extern void arm_dma_unmap_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction, struct dma_attrs *attrs);
+extern void arm_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, +extern void arm_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void dma_sync_sg_for_device(struct device *, struct scatterlist *, int,
enum dma_data_direction);
You should not need to make these symbols visible in the header file any more unless they are used outside of the main file later.
Arnd
Hello,
On Friday, June 24, 2011 5:37 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
This patch modifies dma-mapping implementation on ARM architecture to use common dma_map_ops structure and asm-generic/dma-mapping-common.h helpers.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
This is a good idea in general, but I have a few concerns about details:
First of all, should we only allow using dma_map_ops on ARM, or do we also want to support a case where these are all inlined as before?
I really wonder if it is possible to have a clean implementation of dma_map_ops based and linear inlined dma-mapping framework together. Theoretically it should be possible, but it will end with a lot of #ifdef hackery which is really hard to follow and understand for anyone but the authors.
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
Even in coherent case, the overhead caused by additional function call should have really negligible impact on drivers performance.
diff --git a/arch/arm/include/asm/dma-mapping.h
b/arch/arm/include/asm/dma-mapping.h
index 799669d..f4e4968 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -10,6 +10,27 @@ #include <asm-generic/dma-coherent.h> #include <asm/memory.h>
+extern struct dma_map_ops dma_ops;
+static inline struct dma_map_ops *get_dma_ops(struct device *dev) +{
- if (dev->archdata.dma_ops)
return dev->archdata.dma_ops;
- return &dma_ops;
+}
I would not name the global structure just 'dma_ops', the identifier could too easily conflict with a local variable in some driver. How about arm_dma_ops or linear_dma_ops instead?
I'm fine with both of them. Even arm_linear_dma_ops make some sense.
/*
- The scatter list versions of the above methods.
*/ -extern int dma_map_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction);
-extern void dma_unmap_sg(struct device *, struct scatterlist *, int, +extern int arm_dma_map_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction, struct dma_attrs *attrs);
+extern void arm_dma_unmap_sg(struct device *, struct scatterlist *, int,
enum dma_data_direction, struct dma_attrs *attrs);
+extern void arm_dma_sync_sg_for_cpu(struct device *, struct scatterlist
*, int,
enum dma_data_direction);
-extern void dma_sync_sg_for_cpu(struct device *, struct scatterlist *,
int,
+extern void arm_dma_sync_sg_for_device(struct device *, struct
scatterlist *, int,
enum dma_data_direction);
-extern void dma_sync_sg_for_device(struct device *, struct scatterlist *,
int,
enum dma_data_direction);
You should not need to make these symbols visible in the header file any more unless they are used outside of the main file later.
They are used by the dma bounce code once converted to dma_map_ops framework.
Best regards
On Monday 27 June 2011, Marek Szyprowski wrote:
On Friday, June 24, 2011 5:37 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
This patch modifies dma-mapping implementation on ARM architecture to use common dma_map_ops structure and asm-generic/dma-mapping-common.h helpers.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
This is a good idea in general, but I have a few concerns about details:
First of all, should we only allow using dma_map_ops on ARM, or do we also want to support a case where these are all inlined as before?
I really wonder if it is possible to have a clean implementation of dma_map_ops based and linear inlined dma-mapping framework together. Theoretically it should be possible, but it will end with a lot of #ifdef hackery which is really hard to follow and understand for anyone but the authors.
Right. It's probably not worth unless there is a significant overhead in terms of code size or performance in the coherent linear case.
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
Even in coherent case, the overhead caused by additional function call should have really negligible impact on drivers performance.
What about object code size? I guess since ixp23xx is the only platform that announces itself as coherent, we probably don't need to worry about it too much either. Lennert?
On everything else, we only replace a direct functin call with an indirect one.
diff --git a/arch/arm/include/asm/dma-mapping.h
b/arch/arm/include/asm/dma-mapping.h
index 799669d..f4e4968 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -10,6 +10,27 @@ #include <asm-generic/dma-coherent.h> #include <asm/memory.h>
+extern struct dma_map_ops dma_ops;
+static inline struct dma_map_ops *get_dma_ops(struct device *dev) +{
- if (dev->archdata.dma_ops)
return dev->archdata.dma_ops;
- return &dma_ops;
+}
I would not name the global structure just 'dma_ops', the identifier could too easily conflict with a local variable in some driver. How about arm_dma_ops or linear_dma_ops instead?
I'm fine with both of them. Even arm_linear_dma_ops make some sense.
Ok, just pick one then if nobody has a strong opinion either way.
You should not need to make these symbols visible in the header file any more unless they are used outside of the main file later.
They are used by the dma bounce code once converted to dma_map_ops framework.
Ok, I see.
Arnd
On Mon, Jun 27, 2011 at 03:19:43PM +0200, Arnd Bergmann wrote:
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
FWIW, when I was hacking on ARM access point routing performance some time ago, turning the L1/L2 cache maintenance operations into inline functions (inlined into the ethernet driver) gave me a significant and measurable performance boost.
Such things can remain product-specific hacks, though.
Even in coherent case, the overhead caused by additional function call should have really negligible impact on drivers performance.
What about object code size? I guess since ixp23xx is the only platform that announces itself as coherent, we probably don't need to worry about it too much either. Lennert?
I don't think so. ixp23xx isn't a very popular platform anymore either, having been discontinued some time ago.
thanks, Lennert
On Thu, Jul 07, 2011 at 02:09:18PM +0200, Lennert Buytenhek wrote:
On Mon, Jun 27, 2011 at 03:19:43PM +0200, Arnd Bergmann wrote:
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
FWIW, when I was hacking on ARM access point routing performance some time ago, turning the L1/L2 cache maintenance operations into inline functions (inlined into the ethernet driver) gave me a significant and measurable performance boost.
On what architecture? Can you show what you did to gain that?
On Thu, Jul 07, 2011 at 01:38:25PM +0100, Russell King - ARM Linux wrote:
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
FWIW, when I was hacking on ARM access point routing performance some time ago, turning the L1/L2 cache maintenance operations into inline functions (inlined into the ethernet driver) gave me a significant and measurable performance boost.
On what architecture? Can you show what you did to gain that?
Patch is attached below. It's an ugly product-specific hack, not suitable for upstreaming in this form, etc etc, but IIRC it gave me a ~5% improvement on packet routing.
From 4e9ab8b1e5fd3a5d7abb3253b653a2990b377f97 Mon Sep 17 00:00:00 2001
From: Lennert Buytenhek buytenh@wantstofly.org Date: Thu, 9 Apr 2009 02:28:54 +0200 Subject: [PATCH] Inline dma_cache_maint()
Signed-off-by: Lennert Buytenhek buytenh@marvell.com --- arch/arm/include/asm/cacheflush.h | 174 ++++++++++++++++++++++++++++++++++++ arch/arm/include/asm/dma-mapping.h | 24 +++++- arch/arm/mm/Kconfig | 1 - arch/arm/mm/cache-feroceon-l2.c | 10 ++- arch/arm/mm/dma-mapping.c | 35 ------- 5 files changed, 205 insertions(+), 39 deletions(-)
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 6cbd8fd..7cc28eb 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -228,9 +228,105 @@ extern struct cpu_cache_fns cpu_cache; * is visible to DMA, or data written by DMA to system memory is * visible to the CPU. */ +#if defined(CONFIG_ARCH_KIRKWOOD) || defined(CONFIG_ARCH_MV78XX0) +#define CACHE_LINE_SIZE 32 + +static inline void l1d_flush_mva(unsigned long addr) +{ + __asm__("mcr p15, 0, %0, c7, c14, 1" : : "r" (addr)); +} + +static inline void l1d_inv_mva_range(unsigned long start, unsigned long end) +{ + unsigned long flags; + + raw_local_irq_save(flags); + __asm__("mcr p15, 5, %0, c15, c14, 0\n\t" + "mcr p15, 5, %1, c15, c14, 1" + : : "r" (start), "r" (end)); + raw_local_irq_restore(flags); +} + +static inline void l1d_clean_mva_range(unsigned long start, unsigned long end) +{ + unsigned long flags; + + raw_local_irq_save(flags); + __asm__("mcr p15, 5, %0, c15, c13, 0\n\t" + "mcr p15, 5, %1, c15, c13, 1" + : : "r" (start), "r" (end)); + raw_local_irq_restore(flags); +} + +static inline void l1d_flush_mva_range(unsigned long start, unsigned long end) +{ + unsigned long flags; + + raw_local_irq_save(flags); + __asm__("mcr p15, 5, %0, c15, c15, 0\n\t" + "mcr p15, 5, %1, c15, c15, 1" + : : "r" (start), "r" (end)); + raw_local_irq_restore(flags); +} + +static inline void dmac_inv_range(const void *_start, const void *_end) +{ + unsigned long start = (unsigned long)_start; + unsigned long end = (unsigned long)_end; + + /* + * Clean and invalidate partial first cache line. + */ + if (start & (CACHE_LINE_SIZE - 1)) { + l1d_flush_mva(start & ~(CACHE_LINE_SIZE - 1)); + start = (start | (CACHE_LINE_SIZE - 1)) + 1; + } + + /* + * Clean and invalidate partial last cache line. + */ + if (start < end && end & (CACHE_LINE_SIZE - 1)) { + l1d_flush_mva(end & ~(CACHE_LINE_SIZE - 1)); + end &= ~(CACHE_LINE_SIZE - 1); + } + + /* + * Invalidate all full cache lines between 'start' and 'end'. + */ + if (start < end) + l1d_inv_mva_range(start, end - CACHE_LINE_SIZE); + + dsb(); +} + +static inline void dmac_clean_range(const void *_start, const void *_end) +{ + unsigned long start = (unsigned long)_start; + unsigned long end = (unsigned long)_end; + + start &= ~(CACHE_LINE_SIZE - 1); + end = (end + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1); + l1d_clean_mva_range(start, end - CACHE_LINE_SIZE); + + dsb(); +} + +static inline void dmac_flush_range(const void *_start, const void *_end) +{ + unsigned long start = (unsigned long)_start; + unsigned long end = (unsigned long)_end; + + start &= ~(CACHE_LINE_SIZE - 1); + end = (end + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1); + l1d_flush_mva_range(start, end - CACHE_LINE_SIZE); + + dsb(); +} +#else #define dmac_inv_range cpu_cache.dma_inv_range #define dmac_clean_range cpu_cache.dma_clean_range #define dmac_flush_range cpu_cache.dma_flush_range +#endif
#else
@@ -286,12 +382,90 @@ static inline void outer_flush_range(unsigned long start, unsigned long end)
#else
+#if defined(CONFIG_ARCH_KIRKWOOD) || defined(CONFIG_ARCH_MV78XX0) +static inline void l2_clean_pa_range(unsigned long start, unsigned long end) +{ + unsigned long flags; + + raw_local_irq_save(flags); + __asm__("mcr p15, 1, %0, c15, c9, 4\n\t" + "mcr p15, 1, %1, c15, c9, 5" + : : "r" (__phys_to_virt(start)), "r" (__phys_to_virt(end))); + raw_local_irq_restore(flags); +} + +static inline void l2_clean_inv_pa(unsigned long addr) +{ + __asm__("mcr p15, 1, %0, c15, c10, 3" : : "r" (addr)); +} + +static inline void l2_inv_pa_range(unsigned long start, unsigned long end) +{ + unsigned long flags; + + raw_local_irq_save(flags); + __asm__("mcr p15, 1, %0, c15, c11, 4\n\t" + "mcr p15, 1, %1, c15, c11, 5" + : : "r" (__phys_to_virt(start)), "r" (__phys_to_virt(end))); + raw_local_irq_restore(flags); +} + +static inline void outer_inv_range(unsigned long start, unsigned long end) +{ + /* + * Clean and invalidate partial first cache line. + */ + if (start & (CACHE_LINE_SIZE - 1)) { + l2_clean_inv_pa(start & ~(CACHE_LINE_SIZE - 1)); + start = (start | (CACHE_LINE_SIZE - 1)) + 1; + } + + /* + * Clean and invalidate partial last cache line. + */ + if (start < end && end & (CACHE_LINE_SIZE - 1)) { + l2_clean_inv_pa(end & ~(CACHE_LINE_SIZE - 1)); + end &= ~(CACHE_LINE_SIZE - 1); + } + + /* + * Invalidate all full cache lines between 'start' and 'end'. + */ + if (start < end) + l2_inv_pa_range(start, end - CACHE_LINE_SIZE); + + dsb(); +} + +static inline void outer_clean_range(unsigned long start, unsigned long end) +{ + start &= ~(CACHE_LINE_SIZE - 1); + end = (end + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1); + if (start != end) + l2_clean_pa_range(start, end - CACHE_LINE_SIZE); + + dsb(); +} + +static inline void outer_flush_range(unsigned long start, unsigned long end) +{ + start &= ~(CACHE_LINE_SIZE - 1); + end = (end + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1); + if (start != end) { + l2_clean_pa_range(start, end - CACHE_LINE_SIZE); + l2_inv_pa_range(start, end - CACHE_LINE_SIZE); + } + + dsb(); +} +#else static inline void outer_inv_range(unsigned long start, unsigned long end) { } static inline void outer_clean_range(unsigned long start, unsigned long end) { } static inline void outer_flush_range(unsigned long start, unsigned long end) { } +#endif
#endif
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 22cb14e..10b517c 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -6,6 +6,7 @@ #include <linux/mm_types.h> #include <linux/scatterlist.h>
+#include <asm/cacheflush.h> #include <asm-generic/dma-coherent.h> #include <asm/memory.h>
@@ -56,7 +57,28 @@ static inline dma_addr_t virt_to_dma(struct device *dev, void *addr) * platforms with CONFIG_DMABOUNCE. * Use the driver DMA support - see dma-mapping.h (dma_sync_*) */ -extern void dma_cache_maint(const void *kaddr, size_t size, int rw); +static inline void +dma_cache_maint(const void *start, size_t size, int direction) +{ +// BUG_ON(!virt_addr_valid(start) || !virt_addr_valid(end - 1)); + + switch (direction) { + case DMA_FROM_DEVICE: /* invalidate only */ + dmac_inv_range(start, start + size); + outer_inv_range(__pa(start), __pa(start) + size); + break; + case DMA_TO_DEVICE: /* writeback only */ + dmac_clean_range(start, start + size); + outer_clean_range(__pa(start), __pa(start) + size); + break; + case DMA_BIDIRECTIONAL: /* writeback and invalidate */ + dmac_flush_range(start, start + size); + outer_flush_range(__pa(start), __pa(start) + size); + break; +// default: +// BUG(); + } +}
/* * Return whether the given device DMA address mask can be supported diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index d490f37..3e4c526 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -690,7 +690,6 @@ config CACHE_FEROCEON_L2 bool "Enable the Feroceon L2 cache controller" depends on ARCH_KIRKWOOD || ARCH_MV78XX0 default y - select OUTER_CACHE help This option enables the Feroceon L2 cache controller.
diff --git a/arch/arm/mm/cache-feroceon-l2.c b/arch/arm/mm/cache-feroceon-l2.c index 355c2a1..f84e34f 100644 --- a/arch/arm/mm/cache-feroceon-l2.c +++ b/arch/arm/mm/cache-feroceon-l2.c @@ -17,6 +17,9 @@ #include <plat/cache-feroceon-l2.h>
+static int l2_wt_override; + +#if 0 /* * Low-level cache maintenance operations. * @@ -94,12 +97,14 @@ static inline void l2_inv_pa_range(unsigned long start, unsigned long end) { l2_inv_mva_range(__phys_to_virt(start), __phys_to_virt(end)); } +#endif
static inline void l2_inv_all(void) { __asm__("mcr p15, 1, %0, c15, c11, 0" : : "r" (0)); }
+#if 0 /* * Linux primitives. * @@ -110,8 +115,6 @@ static inline void l2_inv_all(void) #define CACHE_LINE_SIZE 32 #define MAX_RANGE_SIZE 1024
-static int l2_wt_override; - static unsigned long calc_range_end(unsigned long start, unsigned long end) { unsigned long range_end; @@ -204,6 +207,7 @@ static void feroceon_l2_flush_range(unsigned long start, unsigned long end)
dsb(); } +#endif
/* @@ -318,9 +322,11 @@ void __init feroceon_l2_init(int __l2_wt_override)
disable_l2_prefetch();
+#if 0 outer_cache.inv_range = feroceon_l2_inv_range; outer_cache.clean_range = feroceon_l2_clean_range; outer_cache.flush_range = feroceon_l2_flush_range; +#endif
enable_l2();
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index f1ef561..d866150 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -482,41 +482,6 @@ static int __init consistent_init(void)
core_initcall(consistent_init);
-/* - * Make an area consistent for devices. - * Note: Drivers should NOT use this function directly, as it will break - * platforms with CONFIG_DMABOUNCE. - * Use the driver DMA support - see dma-mapping.h (dma_sync_*) - */ -void dma_cache_maint(const void *start, size_t size, int direction) -{ - void (*inner_op)(const void *, const void *); - void (*outer_op)(unsigned long, unsigned long); - - BUG_ON(!virt_addr_valid(start) || !virt_addr_valid(start + size - 1)); - - switch (direction) { - case DMA_FROM_DEVICE: /* invalidate only */ - inner_op = dmac_inv_range; - outer_op = outer_inv_range; - break; - case DMA_TO_DEVICE: /* writeback only */ - inner_op = dmac_clean_range; - outer_op = outer_clean_range; - break; - case DMA_BIDIRECTIONAL: /* writeback and invalidate */ - inner_op = dmac_flush_range; - outer_op = outer_flush_range; - break; - default: - BUG(); - } - - inner_op(start, start + size); - outer_op(__pa(start), __pa(start) + size); -} -EXPORT_SYMBOL(dma_cache_maint); - /** * dma_map_sg - map a set of SG buffers for streaming mode DMA * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
On Fri, Jul 15, 2011 at 02:10:21AM +0200, Lennert Buytenhek wrote:
On Thu, Jul 07, 2011 at 01:38:25PM +0100, Russell King - ARM Linux wrote:
I suppose for the majority of the cases, the overhead of the indirect function call is near-zero, compared to the overhead of the cache management operation, so it would only make a difference for coherent systems without an IOMMU. Do we care about micro-optimizing those?
FWIW, when I was hacking on ARM access point routing performance some time ago, turning the L1/L2 cache maintenance operations into inline functions (inlined into the ethernet driver) gave me a significant and measurable performance boost.
On what architecture? Can you show what you did to gain that?
Patch is attached below. It's an ugly product-specific hack, not suitable for upstreaming in this form, etc etc, but IIRC it gave me a ~5% improvement on packet routing.
Do you know how much is contributed from each change - L1, L2, moving dma_cache_maint() inline, removing the virt_addr_valid() etc?
On Fri, Jul 15, 2011 at 10:27:17AM +0100, Russell King - ARM Linux wrote:
> I suppose for the majority of the cases, the overhead of the indirect > function call is near-zero, compared to the overhead of the cache > management operation, so it would only make a difference for coherent > systems without an IOMMU. Do we care about micro-optimizing those?
FWIW, when I was hacking on ARM access point routing performance some time ago, turning the L1/L2 cache maintenance operations into inline functions (inlined into the ethernet driver) gave me a significant and measurable performance boost.
On what architecture? Can you show what you did to gain that?
Patch is attached below. It's an ugly product-specific hack, not suitable for upstreaming in this form, etc etc, but IIRC it gave me a ~5% improvement on packet routing.
Do you know how much is contributed from each change - L1, L2, moving dma_cache_maint() inline, removing the virt_addr_valid() etc?
Sorry, I'm not sure -- I never tested it to that granularity, and I don't have access to the hardware anymore now.
This patch converts all dma_sg methods to be generic (independent of the current DMA mapping implementation for ARM architecture). All dma sg operations are now implemented on top of respective dma_map_page/dma_sync_single_for* operations from dma_map_ops structure.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/include/asm/dma-mapping.h | 10 +++--- arch/arm/mm/dma-mapping.c | 59 ++++++++++++++++------------------- 2 files changed, 32 insertions(+), 37 deletions(-)
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index f4e4968..fa73efc 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -340,15 +340,15 @@ static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, #endif /* CONFIG_DMABOUNCE */
/* - * The scatter list versions of the above methods. + * The generic scatter list versions of dma methods. */ -extern int arm_dma_map_sg(struct device *, struct scatterlist *, int, +extern int generic_dma_map_sg(struct device *, struct scatterlist *, int, enum dma_data_direction, struct dma_attrs *attrs); -extern void arm_dma_unmap_sg(struct device *, struct scatterlist *, int, +extern void generic_dma_unmap_sg(struct device *, struct scatterlist *, int, enum dma_data_direction, struct dma_attrs *attrs); -extern void arm_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, +extern void generic_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void arm_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, +extern void generic_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, enum dma_data_direction);
#endif /* __KERNEL__ */ diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 5264552..ebbd76c 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -107,12 +107,12 @@ static int arm_dma_set_mask(struct device *dev, u64 dma_mask) struct dma_map_ops dma_ops = { .map_page = arm_dma_map_page, .unmap_page = arm_dma_unmap_page, - .map_sg = arm_dma_map_sg, - .unmap_sg = arm_dma_unmap_sg, .sync_single_for_cpu = arm_dma_sync_single_for_cpu, .sync_single_for_device = arm_dma_sync_single_for_device, - .sync_sg_for_cpu = arm_dma_sync_sg_for_cpu, - .sync_sg_for_device = arm_dma_sync_sg_for_device, + .map_sg = generic_dma_map_sg, + .unmap_sg = generic_dma_unmap_sg, + .sync_sg_for_cpu = generic_dma_sync_sg_for_cpu, + .sync_sg_for_device = generic_dma_sync_sg_for_device, .set_dma_mask = arm_dma_set_mask, }; EXPORT_SYMBOL(dma_ops); @@ -635,7 +635,7 @@ void ___dma_page_dev_to_cpu(struct page *page, unsigned long off, EXPORT_SYMBOL(___dma_page_dev_to_cpu);
/** - * dma_map_sg - map a set of SG buffers for streaming mode DMA + * generic_map_sg - map a set of SG buffers for streaming mode DMA * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @sg: list of buffers * @nents: number of buffers to map @@ -650,15 +650,16 @@ EXPORT_SYMBOL(___dma_page_dev_to_cpu); * Device ownership issues as mentioned for dma_map_single are the same * here. */ -int arm_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, +int generic_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) { + struct dma_map_ops *ops = get_dma_ops(dev); struct scatterlist *s; int i, j;
for_each_sg(sg, s, nents, i) { - s->dma_address = __dma_map_page(dev, sg_page(s), s->offset, - s->length, dir); + s->dma_address = ops->map_page(dev, sg_page(s), s->offset, + s->length, dir, attrs); if (dma_mapping_error(dev, s->dma_address)) goto bad_mapping; } @@ -666,12 +667,12 @@ int arm_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
bad_mapping: for_each_sg(sg, s, i, j) - __dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir); + ops->unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); return 0; }
/** - * dma_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg + * generic_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @sg: list of buffers * @nents: number of buffers to unmap (same as was passed to dma_map_sg) @@ -680,60 +681,54 @@ int arm_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, * Unmap a set of streaming mode DMA translations. Again, CPU access * rules concerning calls here are the same as for dma_unmap_single(). */ -void arm_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, +void generic_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) { + struct dma_map_ops *ops = get_dma_ops(dev); struct scatterlist *s; + int i;
for_each_sg(sg, s, nents, i) - __dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir); + ops->unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); }
/** - * dma_sync_sg_for_cpu + * generic_sync_sg_for_cpu * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @sg: list of buffers * @nents: number of buffers to map (returned from dma_map_sg) * @dir: DMA transfer direction (same as was passed to dma_map_sg) */ -void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, +void generic_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir) { + struct dma_map_ops *ops = get_dma_ops(dev); struct scatterlist *s; int i;
- for_each_sg(sg, s, nents, i) { - if (!dmabounce_sync_for_cpu(dev, sg_dma_address(s), - sg_dma_len(s), dir)) - continue; - - __dma_page_dev_to_cpu(sg_page(s), s->offset, - s->length, dir); - } + for_each_sg(sg, s, nents, i) + ops->sync_single_for_cpu(dev, sg_dma_address(s) + s->offset, + s->length, dir); }
/** - * dma_sync_sg_for_device + * generic_sync_sg_for_device * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @sg: list of buffers * @nents: number of buffers to map (returned from dma_map_sg) * @dir: DMA transfer direction (same as was passed to dma_map_sg) */ -void arm_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, +void generic_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir) { + struct dma_map_ops *ops = get_dma_ops(dev); struct scatterlist *s; int i;
- for_each_sg(sg, s, nents, i) { - if (!dmabounce_sync_for_device(dev, sg_dma_address(s), - sg_dma_len(s), dir)) - continue; - - __dma_page_cpu_to_dev(sg_page(s), s->offset, - s->length, dir); - } + for_each_sg(sg, s, nents, i) + ops->sync_single_for_device(dev, sg_dma_address(s) + s->offset, + s->length, dir); }
#define PREALLOC_DMA_DEBUG_ENTRIES 4096
Hi.
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
-extern int arm_dma_map_sg(struct device *, struct scatterlist *, int, +extern int generic_dma_map_sg(struct device *, struct scatterlist *, int, enum dma_data_direction, struct dma_attrs *attrs); -extern void arm_dma_unmap_sg(struct device *, struct scatterlist *, int, +extern void generic_dma_unmap_sg(struct device *, struct scatterlist *, int, enum dma_data_direction, struct dma_attrs *attrs); -extern void arm_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, +extern void generic_dma_sync_sg_for_cpu(struct device *, struct scatterlist *, int, enum dma_data_direction); -extern void arm_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, +extern void generic_dma_sync_sg_for_device(struct device *, struct scatterlist *, int, enum dma_data_direction);
I don't understand why you changed arm_dma_*() with generic_dma_*() even though they're functionality is not changed and they are still specific to ARM. They look like that they are generic in the kernel code.
Regards, Cho KyongHo.
On Mon, Jun 20, 2011 at 09:50:09AM +0200, Marek Szyprowski wrote:
This patch converts all dma_sg methods to be generic (independent of the current DMA mapping implementation for ARM architecture). All dma sg operations are now implemented on top of respective dma_map_page/dma_sync_single_for* operations from dma_map_ops structure.
No. We really don't want to do this. If we want to move the dsb() out of the mapping functions (which I have a patch for) to avoid doing a dsb() on each and every sg segment, then we must not use the generic stuff.
Hello,
On Monday, June 20, 2011 4:40 PM Russell King - ARM Linux wrote:
On Mon, Jun 20, 2011 at 09:50:09AM +0200, Marek Szyprowski wrote:
This patch converts all dma_sg methods to be generic (independent of the current DMA mapping implementation for ARM architecture). All dma sg operations are now implemented on top of respective dma_map_page/dma_sync_single_for* operations from dma_map_ops structure.
No. We really don't want to do this.
I assume you want to keep the current design for performance reasons?
It's really not a problem for me. I can change my patches to keep arm_dma_*_sg_* functions and create some stubs for dmabounce version.
If we want to move the dsb() out of the mapping functions (which I have a patch for) to avoid doing a dsb() on each and every sg segment, then we must not use the generic stuff.
Ok, specialized (optimized) version of dma_*_sg_* operations are definitely better. The current version just calls respective dma_single_* operations in a loop, so in my patches I decided to create some generic version just to simplify the code.
Best regards
This patch removes dma bounce hooks from the common dma mapping implementation on ARM architecture and creates a separate set of dma_map_ops for dma bounce devices.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/common/dmabounce.c | 68 ++++++++++++++++++++------ arch/arm/include/asm/dma-mapping.h | 96 ------------------------------------ arch/arm/mm/dma-mapping.c | 83 ++++++++++++++++++++++++++----- 3 files changed, 122 insertions(+), 125 deletions(-)
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c index 9eb161e..5b411ef 100644 --- a/arch/arm/common/dmabounce.c +++ b/arch/arm/common/dmabounce.c @@ -256,7 +256,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size, if (buf == 0) { dev_err(dev, "%s: unable to map unsafe buffer %p!\n", __func__, ptr); - return 0; + return ~0; }
dev_dbg(dev, @@ -278,7 +278,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size, * We don't need to sync the DMA buffer since * it was allocated via the coherent allocators. */ - __dma_single_cpu_to_dev(ptr, size, dir); + dma_ops.sync_single_for_device(dev, dma_addr, size, dir); }
return dma_addr; @@ -317,7 +317,7 @@ static inline void unmap_single(struct device *dev, dma_addr_t dma_addr, } free_safe_buffer(dev->archdata.dmabounce, buf); } else { - __dma_single_dev_to_cpu(dma_to_virt(dev, dma_addr), size, dir); + dma_ops.sync_single_for_cpu(dev, dma_addr, size, dir); } }
@@ -329,14 +329,13 @@ static inline void unmap_single(struct device *dev, dma_addr_t dma_addr, * substitute the safe buffer for the unsafe one. * (basically move the buffer from an unsafe area to a safe one) */ -dma_addr_t __dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir) +static dma_addr_t dmabounce_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { dev_dbg(dev, "%s(page=%p,off=%#lx,size=%zx,dir=%x)\n", __func__, page, offset, size, dir);
- BUG_ON(!valid_dma_direction(dir)); - if (PageHighMem(page)) { dev_err(dev, "DMA buffer bouncing of HIGHMEM pages " "is not supported\n"); @@ -345,7 +344,6 @@ dma_addr_t __dma_map_page(struct device *dev, struct page *page,
return map_single(dev, page_address(page) + offset, size, dir); } -EXPORT_SYMBOL(__dma_map_page);
/* * see if a mapped address was really a "safe" buffer and if so, copy @@ -353,17 +351,16 @@ EXPORT_SYMBOL(__dma_map_page); * the safe buffer. (basically return things back to the way they * should be) */ -void __dma_unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, - enum dma_data_direction dir) +static void dmabounce_unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction dir, struct dma_attrs *attrs) { dev_dbg(dev, "%s(ptr=%p,size=%d,dir=%x)\n", __func__, (void *) dma_addr, size, dir);
unmap_single(dev, dma_addr, size, dir); } -EXPORT_SYMBOL(__dma_unmap_page);
-int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, +static int __dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, size_t sz, enum dma_data_direction dir) { struct safe_buffer *buf; @@ -393,9 +390,17 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr, } return 0; } -EXPORT_SYMBOL(dmabounce_sync_for_cpu);
-int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr, +static void dmabounce_sync_for_cpu(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + if (!__dmabounce_sync_for_cpu(dev, handle, size, dir)) + return; + + dma_ops.sync_single_for_cpu(dev, handle, size, dir); +} + +static int __dmabounce_sync_for_device(struct device *dev, dma_addr_t addr, size_t sz, enum dma_data_direction dir) { struct safe_buffer *buf; @@ -425,7 +430,38 @@ int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr, } return 0; } -EXPORT_SYMBOL(dmabounce_sync_for_device); + +static void dmabounce_sync_for_device(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + if (!__dmabounce_sync_for_device(dev, handle, size, dir)) + return; + + dma_ops.sync_single_for_device(dev, handle, size, dir); +} + +static int dmabounce_set_mask(struct device *dev, u64 dma_mask) +{ + if (dev->archdata.dmabounce) { + if (dma_mask >= ISA_DMA_THRESHOLD) + return 0; + else + return -EIO; + } + return dma_ops.set_dma_mask(dev, dma_mask); +} + +static struct dma_map_ops dmabounce_ops = { + .map_page = dmabounce_map_page, + .unmap_page = dmabounce_unmap_page, + .sync_single_for_cpu = dmabounce_sync_for_cpu, + .sync_single_for_device = dmabounce_sync_for_device, + .map_sg = generic_dma_map_sg, + .unmap_sg = generic_dma_unmap_sg, + .sync_sg_for_cpu = generic_dma_sync_sg_for_cpu, + .sync_sg_for_device = generic_dma_sync_sg_for_device, + .set_dma_mask = dmabounce_set_mask, +};
static int dmabounce_init_pool(struct dmabounce_pool *pool, struct device *dev, const char *name, unsigned long size) @@ -485,6 +521,7 @@ int dmabounce_register_dev(struct device *dev, unsigned long small_buffer_size, #endif
dev->archdata.dmabounce = device_info; + set_dma_ops(dev, &dmabounce_ops);
dev_info(dev, "dmabounce: registered device\n");
@@ -503,6 +540,7 @@ void dmabounce_unregister_dev(struct device *dev) struct dmabounce_device_info *device_info = dev->archdata.dmabounce;
dev->archdata.dmabounce = NULL; + set_dma_ops(dev, NULL);
if (!device_info) { dev_warn(dev, diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index fa73efc..7ceec8f 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -83,60 +83,6 @@ static inline dma_addr_t virt_to_dma(struct device *dev, void *addr) #endif
/* - * The DMA API is built upon the notion of "buffer ownership". A buffer - * is either exclusively owned by the CPU (and therefore may be accessed - * by it) or exclusively owned by the DMA device. These helper functions - * represent the transitions between these two ownership states. - * - * Note, however, that on later ARMs, this notion does not work due to - * speculative prefetches. We model our approach on the assumption that - * the CPU does do speculative prefetches, which means we clean caches - * before transfers and delay cache invalidation until transfer completion. - * - * Private support functions: these are not part of the API and are - * liable to change. Drivers must not use these. - */ -static inline void __dma_single_cpu_to_dev(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - extern void ___dma_single_cpu_to_dev(const void *, size_t, - enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_single_cpu_to_dev(kaddr, size, dir); -} - -static inline void __dma_single_dev_to_cpu(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - extern void ___dma_single_dev_to_cpu(const void *, size_t, - enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_single_dev_to_cpu(kaddr, size, dir); -} - -static inline void __dma_page_cpu_to_dev(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) -{ - extern void ___dma_page_cpu_to_dev(struct page *, unsigned long, - size_t, enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_page_cpu_to_dev(page, off, size, dir); -} - -static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) -{ - extern void ___dma_page_dev_to_cpu(struct page *, unsigned long, - size_t, enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_page_dev_to_cpu(page, off, size, dir); -} - -/* * Return whether the given device DMA address mask can be supported * properly. For example, if your device can only drive the low 24-bits * during bus mastering, then you would pass 0x00ffffff as the mask @@ -240,7 +186,6 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *, void *, dma_addr_t, size_t);
-#ifdef CONFIG_DMABOUNCE /* * For SA-1111, IXP425, and ADI systems the dma-mapping functions are "magic" * and utilize bounce buffers as needed to work around limited DMA windows. @@ -299,47 +244,6 @@ extern void dmabounce_unregister_dev(struct device *); extern int dma_needs_bounce(struct device*, dma_addr_t, size_t);
/* - * The DMA API, implemented by dmabounce.c. See below for descriptions. - */ -extern dma_addr_t __dma_map_page(struct device *, struct page *, - unsigned long, size_t, enum dma_data_direction); -extern void __dma_unmap_page(struct device *, dma_addr_t, size_t, - enum dma_data_direction); - -/* - * Private functions - */ -int dmabounce_sync_for_cpu(struct device *, dma_addr_t, size_t, enum dma_data_direction); -int dmabounce_sync_for_device(struct device *, dma_addr_t, size_t, enum dma_data_direction); -#else -static inline int dmabounce_sync_for_cpu(struct device *d, dma_addr_t addr, - unsigned long offset, size_t size, enum dma_data_direction dir) -{ - return 1; -} - -static inline int dmabounce_sync_for_device(struct device *d, dma_addr_t addr, - unsigned long offset, size_t size, enum dma_data_direction dir) -{ - return 1; -} - -static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir) -{ - __dma_page_cpu_to_dev(page, offset, size, dir); - return pfn_to_dma(dev, page_to_pfn(page)) + offset; -} - -static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir) -{ - __dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)), - handle & ~PAGE_MASK, size, dir); -} -#endif /* CONFIG_DMABOUNCE */ - -/* * The generic scatter list versions of dma methods. */ extern int generic_dma_map_sg(struct device *, struct scatterlist *, int, diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index ebbd76c..9536481 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -25,6 +25,75 @@ #include <asm/tlbflush.h> #include <asm/sizes.h>
+/* + * The DMA API is built upon the notion of "buffer ownership". A buffer + * is either exclusively owned by the CPU (and therefore may be accessed + * by it) or exclusively owned by the DMA device. These helper functions + * represent the transitions between these two ownership states. + * + * Note, however, that on later ARMs, this notion does not work due to + * speculative prefetches. We model our approach on the assumption that + * the CPU does do speculative prefetches, which means we clean caches + * before transfers and delay cache invalidation until transfer completion. + * + * Private support functions: these are not part of the API and are + * liable to change. Drivers must not use these. + */ +static inline void __dma_single_cpu_to_dev(const void *kaddr, size_t size, + enum dma_data_direction dir) +{ + extern void ___dma_single_cpu_to_dev(const void *, size_t, + enum dma_data_direction); + + if (!arch_is_coherent()) + ___dma_single_cpu_to_dev(kaddr, size, dir); +} + +static inline void __dma_single_dev_to_cpu(const void *kaddr, size_t size, + enum dma_data_direction dir) +{ + extern void ___dma_single_dev_to_cpu(const void *, size_t, + enum dma_data_direction); + + if (!arch_is_coherent()) + ___dma_single_dev_to_cpu(kaddr, size, dir); +} + +static inline void __dma_page_cpu_to_dev(struct page *page, unsigned long off, + size_t size, enum dma_data_direction dir) +{ + extern void ___dma_page_cpu_to_dev(struct page *, unsigned long, + size_t, enum dma_data_direction); + + if (!arch_is_coherent()) + ___dma_page_cpu_to_dev(page, off, size, dir); +} + +static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off, + size_t size, enum dma_data_direction dir) +{ + extern void ___dma_page_dev_to_cpu(struct page *, unsigned long, + size_t, enum dma_data_direction); + + if (!arch_is_coherent()) + ___dma_page_dev_to_cpu(page, off, size, dir); +} + + +static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir) +{ + __dma_page_cpu_to_dev(page, offset, size, dir); + return pfn_to_dma(dev, page_to_pfn(page)) + offset; +} + +static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, + size_t size, enum dma_data_direction dir) +{ + __dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)), + handle & ~PAGE_MASK, size, dir); +} + /** * dma_map_page - map a portion of a page for streaming DMA * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices @@ -71,31 +140,17 @@ static inline void arm_dma_unmap_page(struct device *dev, dma_addr_t handle, static inline void arm_dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - if (!dmabounce_sync_for_cpu(dev, handle, size, dir)) - return; - __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); }
static inline void arm_dma_sync_single_for_device(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - if (!dmabounce_sync_for_device(dev, handle, size, dir)) - return; - __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir); }
static int arm_dma_set_mask(struct device *dev, u64 dma_mask) { -#ifdef CONFIG_DMABOUNCE - if (dev->archdata.dmabounce) { - if (dma_mask >= ISA_DMA_THRESHOLD) - return 0; - else - return -EIO; - } -#endif if (!dev->dma_mask || !dma_supported(dev, dma_mask)) return -EIO;
On Mon, Jun 20, 2011 at 09:50:10AM +0200, Marek Szyprowski wrote:
This patch removes dma bounce hooks from the common dma mapping implementation on ARM architecture and creates a separate set of dma_map_ops for dma bounce devices.
Why all this additional indirection for no gain?
@@ -278,7 +278,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size, * We don't need to sync the DMA buffer since * it was allocated via the coherent allocators. */
__dma_single_cpu_to_dev(ptr, size, dir);
}dma_ops.sync_single_for_device(dev, dma_addr, size, dir);
return dma_addr; @@ -317,7 +317,7 @@ static inline void unmap_single(struct device *dev, dma_addr_t dma_addr, } free_safe_buffer(dev->archdata.dmabounce, buf); } else {
__dma_single_dev_to_cpu(dma_to_virt(dev, dma_addr), size, dir);
}dma_ops.sync_single_for_cpu(dev, dma_addr, size, dir);
}
Hello,
On Monday, June 20, 2011 4:43 PM Russell King - ARM Linux wrote:
On Mon, Jun 20, 2011 at 09:50:10AM +0200, Marek Szyprowski wrote:
This patch removes dma bounce hooks from the common dma mapping implementation on ARM architecture and creates a separate set of dma_map_ops for dma bounce devices.
Why all this additional indirection for no gain?
I've did it to really separate dmabounce code and let it be completely independent of particular internal functions of the main generic dma-mapping code.
dmabounce is just one of possible dma-mapping implementation and it is really convenient to have it closed into common interface (dma_map_ops) rather than having it spread around and hardcoded behind some #ifdefs in generic ARM dma-mapping.
There will be also other dma-mapping implementations in the future - I thinking mainly of some iommu capable versions.
In terms of speed I really doubt that these changes have any impact on the system performance, but they significantly improves the code readability (see next patch with cleanup of dma-mapping.c).
@@ -278,7 +278,7 @@ static inline dma_addr_t map_single(struct device
*dev, void *ptr, size_t size,
* We don't need to sync the DMA buffer since * it was allocated via the coherent allocators. */
__dma_single_cpu_to_dev(ptr, size, dir);
dma_ops.sync_single_for_device(dev, dma_addr, size, dir);
}
return dma_addr;
@@ -317,7 +317,7 @@ static inline void unmap_single(struct device *dev,
dma_addr_t dma_addr,
} free_safe_buffer(dev->archdata.dmabounce, buf);
} else {
__dma_single_dev_to_cpu(dma_to_virt(dev, dma_addr), size, dir);
}dma_ops.sync_single_for_cpu(dev, dma_addr, size, dir);
}
Best regards
On Monday 20 June 2011, Marek Szyprowski wrote:
On Monday, June 20, 2011 4:43 PM Russell King - ARM Linux wrote:
On Mon, Jun 20, 2011 at 09:50:10AM +0200, Marek Szyprowski wrote:
This patch removes dma bounce hooks from the common dma mapping implementation on ARM architecture and creates a separate set of dma_map_ops for dma bounce devices.
Why all this additional indirection for no gain?
I've did it to really separate dmabounce code and let it be completely independent of particular internal functions of the main generic dma-mapping code.
dmabounce is just one of possible dma-mapping implementation and it is really convenient to have it closed into common interface (dma_map_ops) rather than having it spread around and hardcoded behind some #ifdefs in generic ARM dma-mapping.
There will be also other dma-mapping implementations in the future - I thinking mainly of some iommu capable versions.
In terms of speed I really doubt that these changes have any impact on the system performance, but they significantly improves the code readability (see next patch with cleanup of dma-mapping.c).
Yes. I believe the main effect of splitting out dmabounce into its own set of operations is improved readability for people that are not familiar with the existing code (which excludes Russell ;-) ), by separating the two codepaths and losing various #ifdef.
The simplification becomes more obvious when you look at patch 6, which removes a lot of the code that becomes redundant after this one.
Still, patches 5 and 6 are certainly not essential, nothing depends on that and if Russell still doesn't like them, they can easily be dropped.
Arnd
Hello,
On Friday, June 24, 2011 5:47 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
On Monday, June 20, 2011 4:43 PM Russell King - ARM Linux wrote:
On Mon, Jun 20, 2011 at 09:50:10AM +0200, Marek Szyprowski wrote:
This patch removes dma bounce hooks from the common dma mapping implementation on ARM architecture and creates a separate set of dma_map_ops for dma bounce devices.
Why all this additional indirection for no gain?
I've did it to really separate dmabounce code and let it be completely independent of particular internal functions of the main generic dma-
mapping
code.
dmabounce is just one of possible dma-mapping implementation and it is
really
convenient to have it closed into common interface (dma_map_ops) rather
than
having it spread around and hardcoded behind some #ifdefs in generic ARM dma-mapping.
There will be also other dma-mapping implementations in the future - I thinking mainly of some iommu capable versions.
In terms of speed I really doubt that these changes have any impact on
the
system performance, but they significantly improves the code readability (see next patch with cleanup of dma-mapping.c).
Yes. I believe the main effect of splitting out dmabounce into its own set of operations is improved readability for people that are not familiar with the existing code (which excludes Russell ;-) ), by separating the two codepaths and losing various #ifdef.
The simplification becomes more obvious when you look at patch 6, which removes a lot of the code that becomes redundant after this one.
This separation might also help in future with code consolidation across different architectures. It was suggested that ARM DMA bounce code has a lot in common with SWIOTBL (if I'm right) dma-mapping implementation.
The separation will also help in integrating the IOMMU based dma-mapping, which will probably share again some code with linear dma-mapping code. Having these 3 implementations mixed together might not help in code readability.
Best regards
This patch just performs a global cleanup in DMA mapping implementation for ARM architecture. Some of the tiny helper functions have been moved to the caller code, some have been merged together.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/mm/dma-mapping.c | 133 +++++++++------------------------------------ 1 files changed, 26 insertions(+), 107 deletions(-)
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 9536481..7c62e60 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -36,64 +36,12 @@ * the CPU does do speculative prefetches, which means we clean caches * before transfers and delay cache invalidation until transfer completion. * - * Private support functions: these are not part of the API and are - * liable to change. Drivers must not use these. */ -static inline void __dma_single_cpu_to_dev(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - extern void ___dma_single_cpu_to_dev(const void *, size_t, - enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_single_cpu_to_dev(kaddr, size, dir); -} - -static inline void __dma_single_dev_to_cpu(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - extern void ___dma_single_dev_to_cpu(const void *, size_t, - enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_single_dev_to_cpu(kaddr, size, dir); -} - -static inline void __dma_page_cpu_to_dev(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) -{ - extern void ___dma_page_cpu_to_dev(struct page *, unsigned long, +static void __dma_page_cpu_to_dev(struct page *, unsigned long, size_t, enum dma_data_direction); - - if (!arch_is_coherent()) - ___dma_page_cpu_to_dev(page, off, size, dir); -} - -static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) -{ - extern void ___dma_page_dev_to_cpu(struct page *, unsigned long, +static void __dma_page_dev_to_cpu(struct page *, unsigned long, size_t, enum dma_data_direction);
- if (!arch_is_coherent()) - ___dma_page_dev_to_cpu(page, off, size, dir); -} - - -static inline dma_addr_t __dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir) -{ - __dma_page_cpu_to_dev(page, offset, size, dir); - return pfn_to_dma(dev, page_to_pfn(page)) + offset; -} - -static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir) -{ - __dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)), - handle & ~PAGE_MASK, size, dir); -} - /** * dma_map_page - map a portion of a page for streaming DMA * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices @@ -108,11 +56,13 @@ static inline void __dma_unmap_page(struct device *dev, dma_addr_t handle, * The device owns this memory once this call has completed. The CPU * can regain ownership by calling dma_unmap_page(). */ -static inline dma_addr_t arm_dma_map_page(struct device *dev, struct page *page, +static dma_addr_t arm_dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { - return __dma_map_page(dev, page, offset, size, dir); + if (!arch_is_coherent()) + __dma_page_cpu_to_dev(page, offset, size, dir); + return pfn_to_dma(dev, page_to_pfn(page)) + offset; }
/** @@ -130,23 +80,29 @@ static inline dma_addr_t arm_dma_map_page(struct device *dev, struct page *page, * whatever the device wrote there. */
-static inline void arm_dma_unmap_page(struct device *dev, dma_addr_t handle, +static void arm_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { - __dma_unmap_page(dev, handle, size, dir); + if (!arch_is_coherent()) + __dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)), + handle & ~PAGE_MASK, size, dir); }
-static inline void arm_dma_sync_single_for_cpu(struct device *dev, +static void arm_dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - __dma_single_dev_to_cpu(dma_to_virt(dev, handle), size, dir); + if (!arch_is_coherent()) + __dma_page_dev_to_cpu(virt_to_page(dma_to_virt(dev, handle)), + handle & ~PAGE_MASK, size, dir); }
-static inline void arm_dma_sync_single_for_device(struct device *dev, +static void arm_dma_sync_single_for_device(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { - __dma_single_cpu_to_dev(dma_to_virt(dev, handle), size, dir); + if (!arch_is_coherent()) + __dma_page_cpu_to_dev(virt_to_page(dma_to_virt(dev, handle)), + handle & ~PAGE_MASK, size, dir); }
static int arm_dma_set_mask(struct device *dev, u64 dma_mask) @@ -567,47 +523,6 @@ void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr } EXPORT_SYMBOL(dma_free_coherent);
-/* - * Make an area consistent for devices. - * Note: Drivers should NOT use this function directly, as it will break - * platforms with CONFIG_DMABOUNCE. - * Use the driver DMA support - see dma-mapping.h (dma_sync_*) - */ -void ___dma_single_cpu_to_dev(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - unsigned long paddr; - - BUG_ON(!virt_addr_valid(kaddr) || !virt_addr_valid(kaddr + size - 1)); - - dmac_map_area(kaddr, size, dir); - - paddr = __pa(kaddr); - if (dir == DMA_FROM_DEVICE) { - outer_inv_range(paddr, paddr + size); - } else { - outer_clean_range(paddr, paddr + size); - } - /* FIXME: non-speculating: flush on bidirectional mappings? */ -} -EXPORT_SYMBOL(___dma_single_cpu_to_dev); - -void ___dma_single_dev_to_cpu(const void *kaddr, size_t size, - enum dma_data_direction dir) -{ - BUG_ON(!virt_addr_valid(kaddr) || !virt_addr_valid(kaddr + size - 1)); - - /* FIXME: non-speculating: not required */ - /* don't bother invalidating if DMA to device */ - if (dir != DMA_TO_DEVICE) { - unsigned long paddr = __pa(kaddr); - outer_inv_range(paddr, paddr + size); - } - - dmac_unmap_area(kaddr, size, dir); -} -EXPORT_SYMBOL(___dma_single_dev_to_cpu); - static void dma_cache_maint_page(struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, void (*op)(const void *, size_t, int)) @@ -652,7 +567,13 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset, } while (left); }
-void ___dma_page_cpu_to_dev(struct page *page, unsigned long off, +/* + * Make an area consistent for devices. + * Note: Drivers should NOT use this function directly, as it will break + * platforms with CONFIG_DMABOUNCE. + * Use the driver DMA support - see dma-mapping.h (dma_sync_*) + */ +static void __dma_page_cpu_to_dev(struct page *page, unsigned long off, size_t size, enum dma_data_direction dir) { unsigned long paddr; @@ -667,9 +588,8 @@ void ___dma_page_cpu_to_dev(struct page *page, unsigned long off, } /* FIXME: non-speculating: flush on bidirectional mappings? */ } -EXPORT_SYMBOL(___dma_page_cpu_to_dev);
-void ___dma_page_dev_to_cpu(struct page *page, unsigned long off, +static void __dma_page_dev_to_cpu(struct page *page, unsigned long off, size_t size, enum dma_data_direction dir) { unsigned long paddr = page_to_phys(page) + off; @@ -687,7 +607,6 @@ void ___dma_page_dev_to_cpu(struct page *page, unsigned long off, if (dir != DMA_TO_DEVICE && off == 0 && size >= PAGE_SIZE) set_bit(PG_dcache_clean, &page->flags); } -EXPORT_SYMBOL(___dma_page_dev_to_cpu);
/** * generic_map_sg - map a set of SG buffers for streaming mode DMA
Introduce new alloc/free/mmap methods that take attributes argument. alloc/free_coherent can be implemented on top of the new alloc/free calls with NULL attributes. dma_alloc_non_coherent can be implemented using DMA_ATTR_NONCOHERENT attribute, dma_alloc_writecombine can also use separate DMA_ATTR_WRITECOMBINE attribute. This way the drivers will get more generic, platform independent way of allocating dma memory buffers with specific parameters.
One more attribute can be usefull: DMA_ATTR_NOKERNELVADDR. Buffers with such attribute will not have valid kernel virtual address. They might be usefull for drivers that only exports the DMA buffers to userspace (like for example V4L2 or ALSA).
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
TODO: update all dma_map_ops clients for all architectures
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- include/linux/dma-mapping.h | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index ba8319a..08a6fa8 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -16,10 +16,15 @@ enum dma_data_direction { };
struct dma_map_ops { - void* (*alloc_coherent)(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp); - void (*free_coherent)(struct device *dev, size_t size, - void *vaddr, dma_addr_t dma_handle); + void* (*alloc)(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp, + struct dma_attrs *attrs); + void (*free)(struct device *dev, size_t size, + void *vaddr, dma_addr_t dma_handle, + struct dma_attrs *attrs); + int (*mmap)(struct device *, struct vm_area_struct *, + void *, dma_addr_t, size_t, struct dma_attrs *attrs); + dma_addr_t (*map_page)(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir,
Hi.
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
struct dma_map_ops {
- void* (*alloc_coherent)(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t gfp);
- void (*free_coherent)(struct device *dev, size_t size,
- void *vaddr, dma_addr_t dma_handle);
- void* (*alloc)(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t gfp,
- struct dma_attrs *attrs);
- void (*free)(struct device *dev, size_t size,
- void *vaddr, dma_addr_t dma_handle,
- struct dma_attrs *attrs);
- int (*mmap)(struct device *, struct vm_area_struct *,
- void *, dma_addr_t, size_t, struct dma_attrs *attrs);
dma_addr_t (*map_page)(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir,
I still don't agree with your idea that change alloc_coherent() with alloc(). As I said before, we actually do not need dma_alloc_writecombine() anymore because it is not different from dma_alloc_coherent() in ARM. Most of other architectures do not have dma_alloc_writecombine(). If you want dma_alloc_coherent() to allocate user virtual address, I believe that it is also available with mmap() you introduced.
Regards, Cho KyongHo.
On Mon, Jun 20, 2011 at 11:45:41PM +0900, KyongHo Cho wrote:
I still don't agree with your idea that change alloc_coherent() with alloc(). As I said before, we actually do not need dma_alloc_writecombine() anymore because it is not different from dma_alloc_coherent() in ARM.
Wrong - there is a difference. For pre-ARMv6 CPUs, it returns memory with different attributes from DMA coherent memory.
And we're not going to sweep away pre-ARMv6 CPUs any time soon. So you can't ignore dma_alloc_writecombine() which must remain to sanely support framebuffers.
On Tue, Jun 21, 2011 at 12:06 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
Wrong - there is a difference. For pre-ARMv6 CPUs, it returns memory with different attributes from DMA coherent memory.
And we're not going to sweep away pre-ARMv6 CPUs any time soon. So you can't ignore dma_alloc_writecombine() which must remain to sanely support framebuffers.
OK. Thanks.
Then, I think we can implement dma_alloc_writecombine() away from dma_map_ops. IMHO, those devices that use dma_alloc_writecombine() are enough with the default dma_map_ops. Removing a member from dma_map_ops is too heavy work.
Hello,
On Monday, June 20, 2011 4:46 PM KyongHo Cho wrote:
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
struct dma_map_ops {
- void* (*alloc_coherent)(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t gfp);
- void (*free_coherent)(struct device *dev, size_t size,
- void *vaddr, dma_addr_t dma_handle);
- void* (*alloc)(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t gfp,
- struct dma_attrs *attrs);
- void (*free)(struct device *dev, size_t size,
- void *vaddr, dma_addr_t dma_handle,
- struct dma_attrs *attrs);
- int (*mmap)(struct device *, struct vm_area_struct *,
- void *, dma_addr_t, size_t, struct dma_attrs
*attrs);
dma_addr_t (*map_page)(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir,
I still don't agree with your idea that change alloc_coherent() with alloc().
As I said before, we actually do not need dma_alloc_writecombine() anymore because it is not different from dma_alloc_coherent() in ARM.
You already got a reply that dropping dma_alloc_writecombine() is no go on ARM.
Most of other architectures do not have dma_alloc_writecombine().
That's not a problem. Once we agree on dma_alloc_attrs(), the drivers can be changed to use DMA_ATTR_WRITE_COMBINE attribute. If the platform doesn't support the attribute, it is just ignored. That's the whole point of the attributes extension. Once a driver is converted to dma_alloc_attrs(), it can be used without any changes either on platforms that supports some specific attributes or the one that doesn't implement support for any of them.
If you want dma_alloc_coherent() to allocate user virtual address, I believe that it is also available with mmap() you introduced.
Allocation is a separate operation from mapping to userspace. Mmap operation should just map the buffer (represented by a cookie of type dma_addr_t) to user address space.
Note that some drivers (like framebuffer drivers for example) also needs to have both types of mapping - one for user space and one for kernel virtual space.
Best regards
2011/6/21 Marek Szyprowski m.szyprowski@samsung.com:
You already got a reply that dropping dma_alloc_writecombine() is no go on ARM.
Yes.
That's not a problem. Once we agree on dma_alloc_attrs(), the drivers can be changed to use DMA_ATTR_WRITE_COMBINE attribute. If the platform doesn't support the attribute, it is just ignored. That's the whole point of the attributes extension. Once a driver is converted to dma_alloc_attrs(), it can be used without any changes either on platforms that supports some specific attributes or the one that doesn't implement support for any of them.
I just worried that removing an existing construct is not a simple job. You introduced 3 attributes: DMA_ATTR_WRITE_COMBINE, ***COHERENT and ***NO_KERNEL_VADDR
I replied earlier, I also indicated that write combining option for dma_alloc_*() is required. But it does not mean dma_map_ops must support that. I think dma_alloc_writecombine() can be implemented regardless of dma_map_ops.
Allocation is a separate operation from mapping to userspace. Mmap operation should just map the buffer (represented by a cookie of type dma_addr_t) to user address space.
Note that some drivers (like framebuffer drivers for example) also needs to have both types of mapping - one for user space and one for kernel virtual space.
I meant that I think DMA_ATTR_NOKERNELVADDR is not required because you also introduced mmap callback.
Hello,
On Wednesday, June 22, 2011 2:01 AM KyongHo Cho wrote:
2011/6/21 Marek Szyprowski m.szyprowski@samsung.com:
You already got a reply that dropping dma_alloc_writecombine() is no go on ARM.
Yes.
That's not a problem. Once we agree on dma_alloc_attrs(), the drivers can be changed to use DMA_ATTR_WRITE_COMBINE attribute. If the platform doesn't support the attribute, it is just ignored. That's the whole point of the attributes extension. Once a driver is converted to dma_alloc_attrs(), it can be used without any changes either on platforms that supports some specific attributes or the one that doesn't implement support for any of them.
I just worried that removing an existing construct is not a simple job. You introduced 3 attributes: DMA_ATTR_WRITE_COMBINE, ***COHERENT and ***NO_KERNEL_VADDR
COHERENT is the default behavior when no attribute is provided. I also don't want to remove existing calls to dma_alloc_coherent() and dma_alloc_writecombine() from the drivers. This can be done later, once dma_alloc_attrs() API will stabilize.
I replied earlier, I also indicated that write combining option for dma_alloc_*() is required. But it does not mean dma_map_ops must support that. I think dma_alloc_writecombine() can be implemented regardless of dma_map_ops.
The discussion about the need of dma_alloc_attrs() has been performed on Memory Management Summit at Linaro Meeting in Budapest. It simplifies the API and provides full backward compatibility for existing drivers.
Allocation is a separate operation from mapping to userspace. Mmap operation should just map the buffer (represented by a cookie of type dma_addr_t) to user address space.
Note that some drivers (like framebuffer drivers for example) also needs to have both types of mapping - one for user space and one for kernel virtual space.
I meant that I think DMA_ATTR_NOKERNELVADDR is not required because you also introduced mmap callback.
I've already said that mmap callback is not related to the fact that the buffer has been allocated with or without respective kernel virtual space mapping. I really don't get what do you mean here.
Best regards
On Monday 20 June 2011, Marek Szyprowski wrote:
Introduce new alloc/free/mmap methods that take attributes argument. alloc/free_coherent can be implemented on top of the new alloc/free calls with NULL attributes. dma_alloc_non_coherent can be implemented using DMA_ATTR_NONCOHERENT attribute, dma_alloc_writecombine can also use separate DMA_ATTR_WRITECOMBINE attribute. This way the drivers will get more generic, platform independent way of allocating dma memory buffers with specific parameters.
One more attribute can be usefull: DMA_ATTR_NOKERNELVADDR. Buffers with such attribute will not have valid kernel virtual address. They might be usefull for drivers that only exports the DMA buffers to userspace (like for example V4L2 or ALSA).
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
TODO: update all dma_map_ops clients for all architectures
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
Yes, I think that is good, but the change needs to be done atomically across all architectures. This should be easy enough as I believe all other architectures that use dma_map_ops don't even require dma_alloc_noncoherent but just define it to dma_alloc_coherent because they have only coherent memory in regular device drivers.
On a related note, do you plan to make the CMA work use this transparently, or do you want to have a DMA_ATTR_LARGE or DMA_ATTR_CONTIGUOUS for CMA?
Arnd
On Fri, 2011-06-24 at 17:51 +0200, Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
Introduce new alloc/free/mmap methods that take attributes argument. alloc/free_coherent can be implemented on top of the new alloc/free calls with NULL attributes. dma_alloc_non_coherent can be implemented using DMA_ATTR_NONCOHERENT attribute, dma_alloc_writecombine can also use separate DMA_ATTR_WRITECOMBINE attribute. This way the drivers will get more generic, platform independent way of allocating dma memory buffers with specific parameters.
One more attribute can be usefull: DMA_ATTR_NOKERNELVADDR. Buffers with such attribute will not have valid kernel virtual address. They might be usefull for drivers that only exports the DMA buffers to userspace (like for example V4L2 or ALSA).
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
TODO: update all dma_map_ops clients for all architectures
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
Yes, I think that is good, but the change needs to be done atomically across all architectures. This should be easy enough as I believe all other architectures that use dma_map_ops don't even require dma_alloc_noncoherent
This statement is definitely not true of parisc, and also, I believe, not true of sh, so that would have to figure in the conversion work too.
James
but just define it to dma_alloc_coherent because they have only coherent memory in regular device drivers.
On a related note, do you plan to make the CMA work use this transparently, or do you want to have a DMA_ATTR_LARGE or DMA_ATTR_CONTIGUOUS for CMA?
Arnd
To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 24 June 2011, James Bottomley wrote:
On Fri, 2011-06-24 at 17:51 +0200, Arnd Bergmann wrote:
Yes, I think that is good, but the change needs to be done atomically across all architectures. This should be easy enough as I believe all other architectures that use dma_map_ops don't even require dma_alloc_noncoherent
This statement is definitely not true of parisc, and also, I believe, not true of sh, so that would have to figure in the conversion work too.
As far as I can tell, parisc uses its own hppa_dma_ops, not dma_map_ops, and arch/sh/include/asm/dma-mapping.h contains an unconditional
#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
If you want to change parisc to use dma_map_ops then I would suggest adding another attribute for alloc_noncoherent.
Arnd
Hello,
On Friday, June 24, 2011 5:52 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
Introduce new alloc/free/mmap methods that take attributes argument. alloc/free_coherent can be implemented on top of the new alloc/free calls with NULL attributes. dma_alloc_non_coherent can be implemented using DMA_ATTR_NONCOHERENT attribute, dma_alloc_writecombine can also use separate DMA_ATTR_WRITECOMBINE attribute. This way the drivers will get more generic, platform independent way of allocating dma memory buffers with specific parameters.
One more attribute can be usefull: DMA_ATTR_NOKERNELVADDR. Buffers with such attribute will not have valid kernel virtual address. They might be usefull for drivers that only exports the DMA buffers to userspace (like for example V4L2 or ALSA).
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
TODO: update all dma_map_ops clients for all architectures
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
Yes, I think that is good, but the change needs to be done atomically across all architectures.
Yes, I'm aware of this and I will include such changes in the next version of my patches.
This should be easy enough as I believe all other architectures that use dma_map_ops don't even require dma_alloc_noncoherent but just define it to dma_alloc_coherent because they have only coherent memory in regular device drivers.
Right, this should be quite simple. I will also add DMA_ATTR_NON_COHERENT attribute for implementing dma_alloc_noncoherent() call.
On a related note, do you plan to make the CMA work use this transparently, or do you want to have a DMA_ATTR_LARGE or DMA_ATTR_CONTIGUOUS for CMA?
IMHO it will be better to hide the CMA from the drivers. Memory allocated from CMA doesn't really differ from the one allocated by dma_alloc_coherent() (which internally use alloc_pages()), so I really see no reason for adding additional attribute for it.
Best regards
On Monday 27 June 2011, Marek Szyprowski wrote:
On a related note, do you plan to make the CMA work use this transparently, or do you want to have a DMA_ATTR_LARGE or DMA_ATTR_CONTIGUOUS for CMA?
IMHO it will be better to hide the CMA from the drivers. Memory allocated from CMA doesn't really differ from the one allocated by dma_alloc_coherent() (which internally use alloc_pages()), so I really see no reason for adding additional attribute for it.
Ok, fair enough. On a semi-related topic, IIRC we still need to make sure that dma_alloc_coherent() pages are unmapped from the linear mapping. I hope this is independent of both CMA and this patch.
Arnd
Hello,
On Monday, June 27, 2011 3:22 PM Arnd Bergmann wrote:
On Monday 27 June 2011, Marek Szyprowski wrote:
On a related note, do you plan to make the CMA work use this transparently, or do you want to have a DMA_ATTR_LARGE or DMA_ATTR_CONTIGUOUS for CMA?
IMHO it will be better to hide the CMA from the drivers. Memory allocated from CMA doesn't really differ from the one allocated by
dma_alloc_coherent()
(which internally use alloc_pages()), so I really see no reason for
adding
additional attribute for it.
Ok, fair enough. On a semi-related topic, IIRC we still need to make sure that dma_alloc_coherent() pages are unmapped from the linear mapping. I hope this is independent of both CMA and this patch.
Right, that's one more big item that is still on the TODO list.
Best regards
On Monday 20 June 2011, Marek Szyprowski wrote:
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
One more thing: please split out the mmap change into a separate patch. I sense that there might be some objections to that, and it's better to let people know about it early on than having them complain later when that has already been merged.
Arnd
Hello,
On Friday, June 24, 2011 5:54 PM Arnd Bergmann wrote:
On Monday 20 June 2011, Marek Szyprowski wrote:
mmap method is introduced to let the drivers create a user space mapping for a DMA buffer in generic, architecture independent way.
One more thing: please split out the mmap change into a separate patch.
Ok, no problem.
I sense that there might be some objections to that, and it's better to let people know about it early on than having them complain later when that has already been merged.
Ok, I will also prepare much more detailed description for mmap patch.
Best regards
This patch converts dma_alloc/free/mmap_{coherent,writecombine} functions to use generic alloc/free/mmap methods from dma_map_ops structure. A new DMA_ATTR_WRITE_COMBINE DMA attribute have been introduced to implement writecombine methods.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/common/dmabounce.c | 3 + arch/arm/include/asm/dma-mapping.h | 105 ++++++++++++++++++++++++++---------- arch/arm/mm/dma-mapping.c | 54 +++++++------------ include/linux/dma-attrs.h | 1 + 4 files changed, 99 insertions(+), 64 deletions(-)
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c index 5b411ef..c8338e3 100644 --- a/arch/arm/common/dmabounce.c +++ b/arch/arm/common/dmabounce.c @@ -452,6 +452,9 @@ static int dmabounce_set_mask(struct device *dev, u64 dma_mask) }
static struct dma_map_ops dmabounce_ops = { + .alloc = arm_dma_alloc, + .free = arm_dma_free, + .mmap = arm_dma_mmap, .map_page = dmabounce_map_page, .unmap_page = dmabounce_unmap_page, .sync_single_for_cpu = dmabounce_sync_for_cpu, diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 7ceec8f..09518d9 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -5,6 +5,7 @@
#include <linux/mm_types.h> #include <linux/scatterlist.h> +#include <linux/dma-attrs.h> #include <linux/dma-debug.h>
#include <asm-generic/dma-coherent.h> @@ -122,69 +123,115 @@ static inline void dma_free_noncoherent(struct device *dev, size_t size, }
/** - * dma_alloc_coherent - allocate consistent memory for DMA + * arm_dma_alloc - allocate consistent memory for DMA * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @size: required memory size * @handle: bus-specific DMA address + * @attrs: optinal attributes that specific mapping properties * - * Allocate some uncached, unbuffered memory for a device for - * performing DMA. This function allocates pages, and will - * return the CPU-viewed address, and sets @handle to be the - * device-viewed address. + * Allocate some memory for a device for performing DMA. This function + * allocates pages, and will return the CPU-viewed address, and sets @handle + * to be the device-viewed address. */ -extern void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t); +extern void *arm_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, + gfp_t gfp, struct dma_attrs *attrs); + +#define dma_alloc_coherent(d,s,h,f) dma_alloc_attrs(d,s,h,f,NULL) + +static inline void *dma_alloc_attrs(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag, + struct dma_attrs *attrs) +{ + struct dma_map_ops *ops = get_dma_ops(dev); + void *cpu_addr; + BUG_ON(!ops); + + cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs); + debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr); + return cpu_addr; +}
/** - * dma_free_coherent - free memory allocated by dma_alloc_coherent + * arm_dma_free - free memory allocated by arm_dma_alloc * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @size: size of memory originally requested in dma_alloc_coherent * @cpu_addr: CPU-view address returned from dma_alloc_coherent * @handle: device-view address returned from dma_alloc_coherent + * @attrs: optinal attributes that specific mapping properties * * Free (and unmap) a DMA buffer previously allocated by - * dma_alloc_coherent(). + * arm_dma_alloc(). * * References to memory and mappings associated with cpu_addr/handle * during and after this call executing are illegal. */ -extern void dma_free_coherent(struct device *, size_t, void *, dma_addr_t); +extern void arm_dma_free(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t handle, struct dma_attrs *attrs); + +#define dma_free_coherent(d,s,c,h) dma_free_attrs(d,s,c,h,NULL) + +static inline void dma_free_attrs(struct device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle, + struct dma_attrs *attrs) +{ + struct dma_map_ops *ops = get_dma_ops(dev); + BUG_ON(!ops); + + debug_dma_free_coherent(dev, size, cpu_addr, dma_handle); + ops->free(dev, size, cpu_addr, dma_handle, attrs); +}
/** - * dma_mmap_coherent - map a coherent DMA allocation into user space + * arm_dma_mmap - map a coherent DMA allocation into user space * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @vma: vm_area_struct describing requested user mapping * @cpu_addr: kernel CPU-view address returned from dma_alloc_coherent * @handle: device-view address returned from dma_alloc_coherent * @size: size of memory originally requested in dma_alloc_coherent + * @attrs: optinal attributes that specific mapping properties * * Map a coherent DMA buffer previously allocated by dma_alloc_coherent * into user space. The coherent DMA buffer must not be freed by the * driver until the user space mapping has been released. */ -int dma_mmap_coherent(struct device *, struct vm_area_struct *, - void *, dma_addr_t, size_t); +extern int arm_dma_mmap(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + struct dma_attrs *attrs);
+#define dma_mmap_coherent(d,v,c,h,s) dma_mmap_attrs(d,v,c,h,s,NULL)
-/** - * dma_alloc_writecombine - allocate writecombining memory for DMA - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @size: required memory size - * @handle: bus-specific DMA address - * - * Allocate some uncached, buffered memory for a device for - * performing DMA. This function allocates pages, and will - * return the CPU-viewed address, and sets @handle to be the - * device-viewed address. - */ -extern void *dma_alloc_writecombine(struct device *, size_t, dma_addr_t *, - gfp_t); +static inline int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, + size_t size, struct dma_attrs *attrs) +{ + struct dma_map_ops *ops = get_dma_ops(dev); + BUG_ON(!ops); + return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs); +}
-#define dma_free_writecombine(dev,size,cpu_addr,handle) \ - dma_free_coherent(dev,size,cpu_addr,handle) +static inline void *dma_alloc_writecombine(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) +{ + DEFINE_DMA_ATTRS(attrs); + dma_set_attr(DMA_ATTR_WRITE_COMBINE, &attrs); + return dma_alloc_attrs(dev, size, dma_handle, flag, &attrs); +}
-int dma_mmap_writecombine(struct device *, struct vm_area_struct *, - void *, dma_addr_t, size_t); +static inline void dma_free_writecombine(struct device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle) +{ + DEFINE_DMA_ATTRS(attrs); + dma_set_attr(DMA_ATTR_WRITE_COMBINE, &attrs); + return dma_free_attrs(dev, size, cpu_addr, dma_handle, &attrs); +}
+static inline int dma_mmap_writecombine(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size) +{ + DEFINE_DMA_ATTRS(attrs); + dma_set_attr(DMA_ATTR_WRITE_COMBINE, &attrs); + return dma_mmap_attrs(dev, vma, cpu_addr, dma_addr, size, &attrs); +}
/* * For SA-1111, IXP425, and ADI systems the dma-mapping functions are "magic" diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 7c62e60..253e0b4 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -116,6 +116,9 @@ static int arm_dma_set_mask(struct device *dev, u64 dma_mask) }
struct dma_map_ops dma_ops = { + .alloc = arm_dma_alloc, + .free = arm_dma_free, + .mmap = arm_dma_mmap, .map_page = arm_dma_map_page, .unmap_page = arm_dma_unmap_page, .sync_single_for_cpu = arm_dma_sync_single_for_cpu, @@ -433,39 +436,36 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp, * Allocate DMA-coherent memory space and return both the kernel remapped * virtual and bus address for that space. */ -void * -dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp) +void *arm_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, + gfp_t gfp, struct dma_attrs *attrs) { + pgprot_t prot = dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs) ? + pgprot_writecombine(pgprot_kernel) : + pgprot_dmacoherent(pgprot_kernel); void *memory;
if (dma_alloc_from_coherent(dev, size, handle, &memory)) return memory;
- return __dma_alloc(dev, size, handle, gfp, - pgprot_dmacoherent(pgprot_kernel)); + return __dma_alloc(dev, size, handle, gfp, prot); } -EXPORT_SYMBOL(dma_alloc_coherent);
/* - * Allocate a writecombining region, in much the same way as - * dma_alloc_coherent above. + * Create userspace mapping for the DMA-coherent memory. */ -void * -dma_alloc_writecombine(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp) -{ - return __dma_alloc(dev, size, handle, gfp, - pgprot_writecombine(pgprot_kernel)); -} -EXPORT_SYMBOL(dma_alloc_writecombine); - -static int dma_mmap(struct device *dev, struct vm_area_struct *vma, - void *cpu_addr, dma_addr_t dma_addr, size_t size) +int arm_dma_mmap(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + struct dma_attrs *attrs) { int ret = -ENXIO; #ifdef CONFIG_MMU unsigned long user_size, kern_size; struct arm_vmregion *c;
+ vma->vm_page_prot = dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs) ? + pgprot_writecombine(vma->vm_page_prot) : + pgprot_dmacoherent(vma->vm_page_prot); + user_size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
c = arm_vmregion_find(&consistent_head, (unsigned long)cpu_addr); @@ -487,27 +487,12 @@ static int dma_mmap(struct device *dev, struct vm_area_struct *vma, return ret; }
-int dma_mmap_coherent(struct device *dev, struct vm_area_struct *vma, - void *cpu_addr, dma_addr_t dma_addr, size_t size) -{ - vma->vm_page_prot = pgprot_dmacoherent(vma->vm_page_prot); - return dma_mmap(dev, vma, cpu_addr, dma_addr, size); -} -EXPORT_SYMBOL(dma_mmap_coherent); - -int dma_mmap_writecombine(struct device *dev, struct vm_area_struct *vma, - void *cpu_addr, dma_addr_t dma_addr, size_t size) -{ - vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); - return dma_mmap(dev, vma, cpu_addr, dma_addr, size); -} -EXPORT_SYMBOL(dma_mmap_writecombine); - /* * free a page as defined by the above mapping. * Must not be called with IRQs disabled. */ -void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle) +void arm_dma_free(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t handle, struct dma_attrs *attrs) { WARN_ON(irqs_disabled());
@@ -521,7 +506,6 @@ void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr
__dma_free_buffer(pfn_to_page(dma_to_pfn(dev, handle)), size); } -EXPORT_SYMBOL(dma_free_coherent);
static void dma_cache_maint_page(struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h index 71ad34e..ada61e1 100644 --- a/include/linux/dma-attrs.h +++ b/include/linux/dma-attrs.h @@ -13,6 +13,7 @@ enum dma_attr { DMA_ATTR_WRITE_BARRIER, DMA_ATTR_WEAK_ORDERING, + DMA_ATTR_WRITE_COMBINE, DMA_ATTR_MAX, };
Hi.
On Mon, Jun 20, 2011 at 4:50 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
-extern void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t); +extern void *arm_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
- gfp_t gfp, struct dma_attrs *attrs);
+#define dma_alloc_coherent(d,s,h,f) dma_alloc_attrs(d,s,h,f,NULL)
+static inline void *dma_alloc_attrs(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t flag,
- struct dma_attrs *attrs)
+{
- struct dma_map_ops *ops = get_dma_ops(dev);
- void *cpu_addr;
- BUG_ON(!ops);
- cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs);
- debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr);
- return cpu_addr;
+}
Apart from the necessity of alloc_attr, I hope the callback implementations to check if a function pointer is NULL. Suppose that a system want to use default ARM implementation of dma_alloc_*() while it uses its own implementations of dma_map_*().
With your suggestion, we have only one option:
void *my_alloc(...) { return dma_alloc_coherent(NULL, ...); }
struct dma_map_ops ops = { .alloc_coherent = &my_alloc, ... };
I think the following method is simpler:
struct dma_map_ops ops = { .alloc_coherent = NULL, ... };
Hi Marek,
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There are a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
All the changes introduced in this patch series are intended to prepare a good ground for upcoming generic IOMMU integration to DMA mapping framework on ARM architecture.
For more information about proof-of-concept IOMMU implementation in DMA mapping framework, please refer to my previous set of patches: http://www.spinics.net/lists/linux-mm/msg19856.html
I've tried to split the redesign into a set of single-step changes for easier review and understanding. If there is anything that needs further clarification, please don't hesitate to ask.
The patches are prepared on top of Linux Kernel v3.0-rc3.
The proposed changes have been tested on Samsung Exynos4 platform. I've also tested dmabounce code (by manually registering support for DMA bounce for some of the devices available on my board), although my hardware have no such strict requirements. Would be great if one could test my patches on different ARM architectures to check if I didn't break anything.
Best regards
Regards, Subash SISO-SLG
Hello,
On Wednesday, June 22, 2011 6:53 AM Subash Patel wrote:
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There are a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
Nope. I proposed an extension which would allow you to allocate a buffer without creating the kernel mapping for it. Right now dma_alloc_coherent() performs 3 operations: 1. allocates memory for the buffer 2. creates coherent kernel mapping for the buffer 3. translates physical buffer address to DMA address that can be used by the hardware.
dma_mmap_coherent makes additional mapping for the buffer in user process virtual address space.
I want make the step 2 in dma_alloc_coherent() optional to save virtual address space: it is really limited resource. I really want to avoid wasting it for mapping 128MiB buffers just to create full-HD processing hardware pipeline, where no drivers will use kernel mapping at all.
Best regards
On 06/22/2011 12:29 PM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 6:53 AM Subash Patel wrote:
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There are a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
Nope. I proposed an extension which would allow you to allocate a buffer without creating the kernel mapping for it. Right now dma_alloc_coherent() performs 3 operations:
- allocates memory for the buffer
- creates coherent kernel mapping for the buffer
- translates physical buffer address to DMA address that can be used by
the hardware.
dma_mmap_coherent makes additional mapping for the buffer in user process virtual address space.
I want make the step 2 in dma_alloc_coherent() optional to save virtual address space: it is really limited resource. I really want to avoid wasting it for mapping 128MiB buffers just to create full-HD processing hardware pipeline, where no drivers will use kernel mapping at all.
I think by (2) above, you are referring to __dma_alloc_remap()->arm_vmregion_alloc() to allocate the kernel virtual address for the drivers use. That makes sense now.
I have a query in similar lines, but related to user virtual address space. Is it feasible to extend these DMA interfaces(and IOMMU), to map a user allocated buffer into the hardware?
Best regards
Regards, Subash SISO-SLG
Hello,
On Wednesday, June 22, 2011 10:53 AM Subash Patel wrote:
On 06/22/2011 12:29 PM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 6:53 AM Subash Patel wrote:
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I
initally
thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There
are
a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
Nope. I proposed an extension which would allow you to allocate a buffer without creating the kernel mapping for it. Right now
dma_alloc_coherent()
performs 3 operations:
- allocates memory for the buffer
- creates coherent kernel mapping for the buffer
- translates physical buffer address to DMA address that can be used by
the hardware.
dma_mmap_coherent makes additional mapping for the buffer in user process virtual address space.
I want make the step 2 in dma_alloc_coherent() optional to save virtual address space: it is really limited resource. I really want to avoid wasting it for mapping 128MiB buffers just to create full-HD processing hardware pipeline, where no drivers will use kernel mapping at all.
I think by (2) above, you are referring to __dma_alloc_remap()->arm_vmregion_alloc() to allocate the kernel virtual address for the drivers use. That makes sense now.
Well, this is particular implementation which is used on ARM. Other architectures might implement it differently, that's why I used generic description and didn't point to any particular function.
I have a query in similar lines, but related to user virtual address space. Is it feasible to extend these DMA interfaces(and IOMMU), to map a user allocated buffer into the hardware?
This can be done with the current API, although it may not look so straightforward. You just need to create a scatter list of user pages (these can be gathered with get_user_pages function) and use dma_map_sg() function. If the dma-mapping support iommu, it can map all these pages into a single contiguous buffer on device (DMA) address space.
Some additional 'magic' might be required to get access to pages that are mapped with pure PFN (VM_PFNMAP flag), but imho it still can be done.
I will try to implement this feature in videobuf2-dma-config allocator together with the next version of my patches for dma-mapping&iommu.
Best regards
On 06/22/2011 03:27 AM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 10:53 AM Subash Patel wrote:
On 06/22/2011 12:29 PM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 6:53 AM Subash Patel wrote:
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I
initally
thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There
are
a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
Nope. I proposed an extension which would allow you to allocate a buffer without creating the kernel mapping for it. Right now
dma_alloc_coherent()
performs 3 operations:
- allocates memory for the buffer
- creates coherent kernel mapping for the buffer
- translates physical buffer address to DMA address that can be used by
the hardware.
dma_mmap_coherent makes additional mapping for the buffer in user process virtual address space.
I want make the step 2 in dma_alloc_coherent() optional to save virtual address space: it is really limited resource. I really want to avoid wasting it for mapping 128MiB buffers just to create full-HD processing hardware pipeline, where no drivers will use kernel mapping at all.
I think by (2) above, you are referring to __dma_alloc_remap()->arm_vmregion_alloc() to allocate the kernel virtual address for the drivers use. That makes sense now.
Well, this is particular implementation which is used on ARM. Other architectures might implement it differently, that's why I used generic description and didn't point to any particular function.
I have a query in similar lines, but related to user virtual address space. Is it feasible to extend these DMA interfaces(and IOMMU), to map a user allocated buffer into the hardware?
This can be done with the current API, although it may not look so straightforward. You just need to create a scatter list of user pages (these can be gathered with get_user_pages function) and use dma_map_sg() function. If the dma-mapping support iommu, it can map all these pages into a single contiguous buffer on device (DMA) address space.
Some additional 'magic' might be required to get access to pages that are mapped with pure PFN (VM_PFNMAP flag), but imho it still can be done.
I will try to implement this feature in videobuf2-dma-config allocator together with the next version of my patches for dma-mapping&iommu.
With luck DMA_ATTRIB_NO_KERNEL_MAPPING should remove any lingering arguments for trying to map user pages. Given that our ultimate goal here is buffer sharing, user allocated pages have limited value and appeal. If anything, I vote that this be a far lower priority compared to the rest of the win you have here.
Jordan
Hi Jordan,
On 06/22/2011 09:30 PM, Jordan Crouse wrote:
On 06/22/2011 03:27 AM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 10:53 AM Subash Patel wrote:
On 06/22/2011 12:29 PM, Marek Szyprowski wrote:
Hello,
On Wednesday, June 22, 2011 6:53 AM Subash Patel wrote:
On 06/20/2011 01:20 PM, Marek Szyprowski wrote:
Hello,
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I
initally
thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
The only change required to dma_map_ops is a new alloc function. During the discussion on Linaro Memory Management meeting in Budapest we got the idea that we can have only one alloc/free/mmap function with additional attributes argument. This way all different kinds of architecture specific buffer mappings can be hidden behind the attributes without the need of creating several versions of dma_alloc_ function. I also noticed that the dma_alloc_noncoherent() function can be also implemented this way with DMA_ATTRIB_NON_COHERENT attribute. Systems that just defines dma_alloc_noncoherent as dma_alloc_coherent will just ignore such attribute.
Another good use case for alloc methods with attributes is the possibility to allocate buffer without a valid kernel mapping. There
are
a number of drivers (mainly V4L2 and ALSA) that only exports the DMA buffers to user space. Such drivers don't touch the buffer data at all. For such buffers we can avoid the creation of a mapping in kernel virtual address space, saving precious vmalloc area. Such buffers might be allocated once a new attribute DMA_ATTRIB_NO_KERNEL_MAPPING.
Are you trying to say here, that the buffer would be allocated in the user space, and we just use it to map it to the device in DMA+IOMMU framework?
Nope. I proposed an extension which would allow you to allocate a buffer without creating the kernel mapping for it. Right now
dma_alloc_coherent()
performs 3 operations:
- allocates memory for the buffer
- creates coherent kernel mapping for the buffer
- translates physical buffer address to DMA address that can be
used by the hardware.
dma_mmap_coherent makes additional mapping for the buffer in user process virtual address space.
I want make the step 2 in dma_alloc_coherent() optional to save virtual address space: it is really limited resource. I really want to avoid wasting it for mapping 128MiB buffers just to create full-HD processing hardware pipeline, where no drivers will use kernel mapping at all.
I think by (2) above, you are referring to __dma_alloc_remap()->arm_vmregion_alloc() to allocate the kernel virtual address for the drivers use. That makes sense now.
Well, this is particular implementation which is used on ARM. Other architectures might implement it differently, that's why I used generic description and didn't point to any particular function.
I have a query in similar lines, but related to user virtual address space. Is it feasible to extend these DMA interfaces(and IOMMU), to map a user allocated buffer into the hardware?
This can be done with the current API, although it may not look so straightforward. You just need to create a scatter list of user pages (these can be gathered with get_user_pages function) and use dma_map_sg() function. If the dma-mapping support iommu, it can map all these pages into a single contiguous buffer on device (DMA) address space.
Some additional 'magic' might be required to get access to pages that are mapped with pure PFN (VM_PFNMAP flag), but imho it still can be done.
I will try to implement this feature in videobuf2-dma-config allocator together with the next version of my patches for dma-mapping&iommu.
With luck DMA_ATTRIB_NO_KERNEL_MAPPING should remove any lingering arguments for trying to map user pages. Given that our ultimate goal here is buffer sharing, user allocated pages have limited value and appeal. If anything, I vote that this be a far lower priority compared to the rest of the win you have here.
We have some rare cases, where requirements like above are also there. So we require to have flexibility to map user allocated buffers to devices as well.
Please refer to my email (http://lists.linaro.org/pipermail/linaro-mm-sig/2011-June/000273.html)
Regards, Subash
Jordan
On Thu, Jun 23, 2011 at 6:09 AM, Subash Patel subashrp@gmail.com wrote:
We have some rare cases, where requirements like above are also there. So we require to have flexibility to map user allocated buffers to devices as well.
Not so rare, I think. When using the OpenGL back end, Qt routinely allocates buffers to hold image assets (e. g., decompressed JPEGs and the glyph cache) and then uses them as textures. Which, if there's a GPU that doesn't participate in the cache coherency protocol, is a problem. (One which we can reliably trigger on our embedded platform.)
The best workaround we have been able to come up with is for Qt's allocator API, which already has a "flags" parameter, to grow an "allocate for use as texture" flag, which makes the allocation come from a separate pool backed by a write-combining uncacheable mapping. Then we can grovel our way through the highest-frequency use cases, restructuring the code that writes these assets to use the approved write-combining tricks.
In the very near future, some of these assets are likely to come from other hardware blocks, such as a hardware JPEG decoder (Subash's use case), a V4L2 capture device, or a OpenMAX H.264 decoder. Those may add orthogonal allocation requirements, such as page alignment or allocation from tightly coupled memory. The only entity that knows what buffers might be passed where is the userland application (or potentially a userland media framework, like StageFright or GStreamer).
So the solution that I'd like to see is for none of these drivers to do their own allocation of buffers that aren't for strictly internal use. Instead, the userland application should ask each component for a "buffer attributes" structure, and merge the attributes of the components that may touch a given buffer in order to get the allocation attributes for that buffer (or for the hugepage from which it will carve out many like it).
The userland would ask the kernel to do the appropriate allocation -- ideally by passing in the merged allocation attributes and getting back a file descriptor, which can be passed around to other processes (over local domain sockets) and mmap'ed. The buffers themselves would have to be registered with each driver that uses them; i. e., the driver's buffer allocation API is replaced with a buffer registration API. If the driver doesn't like the attributes of the mapping from which the buffer was allocated, the registration fails.
I will try to get around to producing some code that does this soon, at least for the Qt/GPU texture asset allocation/registration use case.
Cheers, - Michael
Jonathan -
I'm inviting you to this conversation (and to linaro-mm-sig, if you'd care to participate!), because I'd really like your commentary on what it takes to make write-combining fully effective on various ARMv7 implementations.
The current threads: http://lists.linaro.org/pipermail/linaro-mm-sig/2011-June/000334.html http://lists.linaro.org/pipermail/linaro-mm-sig/2011-June/000263.html
Archive link for a related discussion: http://lists.linaro.org/pipermail/linaro-mm-sig/2011-April/000003.html
Getting full write-combining performance on Intel architectures involves a somewhat delicate dance: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-fr...
And I expect something similar to be necessary in order to avoid the read-modify-write penalty for write-combining buffers on ARMv7. (NEON store-multiple operations can fill an entire 64-byte entry in the victim buffer in one opcode; I don't know whether this is enough to stop the L3 memory system from reading the data before clobbering it.)
Cheers, - Michael
On 24 June 2011 01:09, Michael K. Edwards m.k.edwards@gmail.com wrote:
Jonathan -
I'm inviting you to this conversation (and to linaro-mm-sig, if you'd care to participate!), because I'd really like your commentary on what it takes to make write-combining fully effective on various ARMv7 implementations.
Thanks for the invite. I'm not fully conversant with the kernel-level intricacy, but I do know what application code sees, and I have a fairly good idea of what happens at the SDRAM pinout.
Getting full write-combining performance on Intel architectures involves a somewhat delicate dance: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-fr...
At a high level, that looks like a pretty effective technique (and well explained) that works cross-platform with detail changes. However, that describes *read* combining on an uncached area, rather than write combining.
Write combining is easy - in my experience it Just Works on ARMv7 SoCs in general. In practice, I've found that you can write pretty much anything to uncached memory and the write-combiner will deal with it fairly intelligently. Small writes sufficiently close together in time and contiguous in area will be combined reliably. This assumes that the region is marked write-combinable, which should always be the case for plain SDRAM. So memcpy() *to* an uncached zone works okay, even wtih an alignment mismatch.
Read combining is much harder as soon as you turn off the cache, which defeats all of the nice auto-prefetching mechanisms that tend to be built into modern caches. Even the newest ARM SoCs disable any and all speculative behaviour for uncached reads - it is then not possible to set up a "streaming read" even explicitly (even though you could reasonably express such using PLD).
There will typically be a full cache-miss latency per instruction (20+ cycles on A8), even if the addresses are exactly sequential. If they are not sequential, or if the memory controller does a read or write to somewhere else in the meantime, you also get the CAS or RAS latencies of about 25ns each, which hurt badly (CAS and RAS have not sped up appreciably, in real terms, since PC133 a decade ago - sense amps are not so good at fulfilling Moore's Law). So on a 1GHz Cortex-A8, you can spend 80 clock cycles waiting for a memory load to complete - that's about 20 for the memory system to figure out it's uncacheable and cause the CPU core to replay the instruction twice, 50 waiting for the SDRAM chip to spin up, and another 10 as a fudge factor and to allow the data to percolate up.
This situation is sufficiently common that I assume (and I tell my colleagues to assume) that this is the case. If a vendor were to turn off write-combining for a memory area, I would complain very loudly to them once I discovered it. So far, though, I can only wish that they would sort out the memory hierarchy to make framebuffer & video reads better.
I *have* found one vendor who appears to put GPU command buffers in cached memory, but this necessitates a manual cache cleaning exercise every time the command buffer is flushed. This is a substantial overhead too, but is perhaps easier to optimise.
IMO this whole problem is a hardware design fault. It's SDRAM directly wired to the chip; there's nothing going on that the memory controller doesn't know about. So why isn't the last cache level part of / attached to the memory controller, so that it can be used transparently by all relevant bus masters? It is, BTW, not only ARM that gets this wrong, but in the Wintel world there is so much horsepower to spare that few people notice.
And I expect something similar to be necessary in order to avoid the read-modify-write penalty for write-combining buffers on ARMv7. (NEON store-multiple operations can fill an entire 64-byte entry in the victim buffer in one opcode; I don't know whether this is enough to stop the L3 memory system from reading the data before clobbering it.)
Ah, now you are talking about store misses to cached memory.
Why 64 bytes? VLD1 does 32 bytes (4x64b) and VLDM can do 128 bytes (16x64b). The latter is, I think, bigger than Intel's fill buffers. Each of these have exactly equivalent store variants.
The manual for the Cortex-A8 states that store misses in the L1 cache are sent to the L2 cache; the L2 cache then has validity bits for every quadword (ie. four validity domains per line), so a 16-byte store (if aligned) is sufficient to avoid read traffic. I assume that the A9 and A5 are at least as sophisticated, not sure about Snapdragon.
- Jonathan Morton
Thanks, Jonathan! I agree with you that it is fundamentally a hardware design defect that on-chip devices like GPUs and video capture/encode/decode/display blocks do not participate in the cache coherency protocol, and thus the buffers passed to, from, and among them all have to be mapped uncacheable. Not, unfortunately, something likely to change soon, with one or two possible exceptions that I've heard rumors about ...
With regard to the use of NEON for data moves, I have appended a snippet of a conversation from the BeagleBoard list that veered off into a related direction. (My response is lightly edited, since I made some stupid errors in the original.) While this is somewhat off-topic from Marek's patch set, I think it's relevant to the question of whether "user-allocated" buffers are an important design consideration for his otherwise DMA-centric API. (And more to the point, buffers allocated suitably for one or more on-chip devices, and also mapped as uncacheable to userland.)
Perhaps you could correct misapprehensions and fill in gaps? And comment on what's likely to be different on other ARMv7-A implementations? And then I'll confine myself to review of Marek's patches, on this thread anyway. ;-)
On Jun 24, 4:50 am, Siarhei Siamashka siarhei.siamas...@gmail.com wrote:
2011/6/24 Måns Rullgård m...@mansr.com:
"Edwards, Michael" m.k.edwa...@gmail.com writes:
and do have dedicated lanes to memory for the NEON unit
No core released to date, including the A15, has dedicated memory lanes for NEON. All the Cortex-A* cores have a common load/store unit for all types of instructions. Some can do multiple concurrent accesses, but that's orthogonal to this discussion.
Probably he wanted to say that NEON unit from Cortex-A8 can load/store 128 bits of data per cycle when accessing L1 cache *memory*, while ordinary ARM load/store instructions can't handle more than 64 bits per cycle there. This makes sense in the context of this discussion because loading data to NEON/VFP registers directly without dragging it through ARM registers is not a bad idea.
That's close to what I meant. The load/store path *to main memory* is indeed shared. But within the cache hierarchy, at least on the Cortex-A8, ARM and NEON take separate paths. And that's a good thing, because the ARM stalls on an L1 miss, and it would be rather bad if it had to wait for a big NEON transfer to complete before it could fill from L2. Moreover, the only way to get "streaming" performance (back-to-back AXI burst transactions) on uncacheable regions is by using the NEON. That's almost impossible to determine from the TRM, but it's there: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch09s04s03.html . Compare against the LDM/STM section of http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch09s01s02.html .
On the A8, the NEON bypasses the L1 cache, and has a dedicated lane (probably the wrong word, sorry) into the L2 cache -- or for uncacheable mappings, *past* the L2 per se to its AXI scheduler. See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/ch08s02... . In addition, NEON load/store operations can be issued in parallel with integer code, and there can be as many as 12 NEON reads outstanding in L2 -- vs. the maximum of 4 total cache line refills and evictions. So if you are moving data around without doing non-SIMD operations on it, and without branching based on its contents, you can do so without polluting L1 cache, or contending with L1 misses that hit in L2.
There will be some contention between NEON-side loads and ARM-side L2 misses, but even that is negligible if you issue a preload early enough (which you should do anyway for fetches that you suspect will miss L2, because the compiler schedules loads based on the assumption of an L1 hit; an L1 miss stalls the ARM side until it's satisfied). Preloads do not have any effect if you miss in TLB, and they don't force premature evictions from L1 cache (they only load as far as L2). And the contention on the write side is negligible thanks to the write allocation mechanism, except insofar as you may approach saturation of the AXI interface due to the total rate of L2 evictions/linefills and cache-bypassing traffic -- in which case, congratulations! Your code is well tuned and operates at the maximum rate that the path to main memory permits.
If you are fetching data from an uncacheable region, using the NEON to trampoline into a cacheable region should be a *huge* win. Remember, an L1 miss stalls the ARM side, and the only way to get data into L1 is to fetch and miss. If you want it to hit in L2, you have to use the NEON to put it there, by fetching up to 128 bytes at a go from the uncacheable region (e. g., VLDM r1,{d16-d31}) and storing it to a cacheable buffer (i. e., only as far as L2, since you write it again and again without an eviction). You want to limit fetches from the ARM side to cacheable regions; otherwise every LDM is a round-trip to AXI.
The store story is similar. You want the equivalent of the x86's magic "fill buffers" -- which avoid the read-modify-write penalty when writing whole cache lines' worth of data through uncacheable write-combining mappings, but only if you use cache-bypassing SSE2 writes. To get it, you need to write from the ARM to cacheable memory, then load that data to NEON registers and store from there. That pushes up to two whole cache lines' worth of data at a time down to the L2 controller, which queues the write without blocking the NEON. (This is the only way to get an AXI burst longer than 4 128-bit transactions without using the preload engine.)
One more nice thing about doing your bulk data transfers this way, instead of monkeying with some DMA unit (which you probably can't do in userland anyway), is that there are no explicit cache operations to deal with. You don't have to worry about data stalling in L1, because the NEON loads do peek data *from* L1 even though they don't load data *to* L1. (Not unless you turn on the L1NEON bit in the Auxiliary Control Register, which you don't want to do unless you have no L2 cache, in which case you have a whole different set of problems.)
The Cortex-A9 is a whole different animal, with out-of-order issue on the ARM side and two automatic prefetch mechanisms (based on detection of miss patterns at L1 and, in MPCore only, at L2). It also has a far less detailed TRM, so I can't begin to analyze its memory hierarchy. Given that the L2 cache has been hived off to an external unit, and the penalty for transfers between the ARM and NEON units has been greatly decreased, I would guess that the NEON goes through the L1 just like the ARM. That changes the game a little -- the NEON transfers to/from cacheable memory can now cause eviction of the ARM's working set from L1 -- but in practice that's probably a wash. The basic premise (that you want to do your noncacheable transactions in big bursts, feasible only from the NEON side) still holds.
-- the compiler can tighten up the execution of rather a lot of code by trampolining structure fetches and stores through the NEON.
Do you have any numbers to back this up? I don't see how going through NEON registers would be faster than direct LDM/STM on any core.
My understanding is that it's exactly the other way around. Using hardfp allows to avoid going through ARM registers for floating point data, which otherwise might be needed for the sole purpose of fulfilling ABI requirements in some cases. You are going a bit overboard trying to argue with absolutely everything what Edwards has posted :)
Not just for floating point data, but for SIMD integer data as well, or really anything you want -- as long as you frame it as a "Homogeneous Aggregate of containerized vectors". That's an extra 64 bytes of structure that you can pass in, and let the callee decide whether and when to spill a copy to a cache-line-aligned buffer (so that it can then fetch the lot to the ARM L1 -- which might as well be registers, as far as memory latency is concerned -- in one L1 miss). Or you can do actual float/SIMD operations with the data, and return a healthy chunk in registers, without ever touching memory. (To be precise, per the AAPCS, you can pass in one 64-byte chunk as a "Homogeneous Aggregate with a Base Type of 128-bit containerized vectors with four Elements", and return a similar chunk in the same registers, with either the same or different contents.)
The point is not really to have "more registers"; the integer "registers" are just names anyway, and the L1 cache is almost as close. Nor is it to pass floating point values to and from public function calls cheaply; that's worth almost nothing on system scale. Even in code that uses no floating point or SIMD whatever, there are potentially big gains from:
* preserving an additional 64 bytes of VFP/NEON state across functions that don't need big operands or return values, if you are willing to alter their function signatures to do so (at zero run-time cost, if you're systematic about it); or alternately:
* postponing the transfer of up to 64 bytes of operands from the VFP/NEON bank to the integer side, allowing more time for pending NEON operations (especially structure loads) to complete;
* omitting the transfer from NEON to ARM entirely, if the operands turn out to be unneeded (or simply written elsewhere in memory without needing to be touched by the ARM);
* returning up to 64 bytes of results in the VFP/NEON register bank, possibly from an address that missed in L2, without stalling to wait for a pending load to complete;
* and, if you really do have to move those operands to the ARM, doing so explicitly and efficiently (by spilling the whole block to a cache-line-aligned buffer in L2, fetching it back into L1 with a single load, and filling the delay with some other useful work) instead of in the worst way possible (by transferring them from VFP to ARM registers, 4 bytes at a time, before entering the function).
As for NEON vs. LDM/STM. There are indeed no reasons why for example NEON memcpy should be faster than LDM/STM for the large memory buffers which do not fit caches. But still this is the case for OMAP3, along with some of other memory performance related WTF questions.
I hope I've clarified this a bit above. But don't take my word for it; these techniques are almost exactly the same as those described in Intel's cheat sheet at http://software.intel.com/en-us/articles/copying-accelerated-video-decode-fr... , except that there is no need for the equivalent of "fill buffers" / "write combining buffers" because VLDM/VSTM can move 128 bytes at a time. (It's probable that the right micro-optimization is to work in 64-byte chunks and pipeline more deeply; I haven't benchmarked yet.)
If, that is, it can schedule them appropriately to account for latencies to and from memory as well as the (reduced but non-zero) latency of VFP<->ARM transfers.
The out of order issue on A9 and later makes most such tricks unnecessary.
VFP/NEON unit from A9 is still in-order.
True but mostly irrelevant. If your code is at all tight, and your working set doesn't fit into L2 cache, all the mere arithmetic pipelines should be stalled most of the time. The name of the game is to race as quickly as possible from one fetch from an uncacheable / unpredictable address to the next that depends on it, and to get as high an interleave among such fetch chains as possible. If your working set isn't larger than L2 cache, why bother thinking about performance at all? Your algorithm could be O(n^3) and coded by banana slugs, and it would still get the job done.
Cheers, - Michael
On 25 June 2011 12:55, Michael K. Edwards m.k.edwards@gmail.com wrote:
With regard to the use of NEON for data moves, I have appended a snippet of a conversation from the BeagleBoard list that veered off into a related direction. (My response is lightly edited, since I made some stupid errors in the original.) While this is somewhat off-topic from Marek's patch set, I think it's relevant to the question of whether "user-allocated" buffers are an important design consideration for his otherwise DMA-centric API. (And more to the point, buffers allocated suitably for one or more on-chip devices, and also mapped as uncacheable to userland.)
As far as userspace is concerned, dealing with the memory hierarchy's quirks is already pretty much a black art, and that's *before* you start presenting it with uncached buffers. The best rule of thumb userspace can follow is to keep things in cache if they can, and use the biggest memory-move instructions (and prefetching if available) if they can't. Everything else they have to rely on the hardware to optimise for them. Indeed, when working in C, you barely even get *that* level of control (optimised copy routines have been known to use double simply because it is reliably 64 bits that can be loaded and stored efficiently), and most other languages are worse.
Small wonder that userspace code that knows it has to work with uncached buffers sometimes - such as Pixman - relies heavily on handwritten SIMD assembler.
Video decoders are a particularly fun case, because the correct solution is actually to DMA the output buffer to the GPU (or, better, to map one onto the other so that zero-copy semantics result) so that the CPU doesn't have to touch it. But then you have to find a common format that both VPU and GPU support, and you have to have a free DMA channel and a way to use it. Frankly though, this is a solution from the 20th century (remember MPEG2 decoders sitting beside the SVGA card?).
We *have* had to occasionally deal with hardware where no such common format could be found, although often this has been due to inadequate documentation or driver support (a familiar refrain). In one case I wrote a NEON NV12-to-RGB32 conversion routine which read directly from the video buffer and wrote directly into a texture buffer, both of which were of course uncached. This halved the CPU consumption of the video playback applet, but prefixing it with a routine which copied the video buffer into cached memory (using 32-byte VLD1 instead of 16-byte versions) halved it again. Profiling showed that the vast majority of the time was spent in the prefix copy loop. No doubt if further savings had been required, I'd have tried using VLDM in the copy loop. (There weren't enough registers to widen the load stage of the conversion routine itself.)
The takeaway from this is that if your code has to read from uncached memory at all, that will undoubtedly dominate it's performance. A read-modify-write cycle is at least as bad (because the memory has to go through at least one CAS latency and a write-to-read turnaround before the next read can be serviced). A pure write is, however, no problem.
On cached memory, the L2 cache of most modern (even ARM) CPUs has an auto-prefetcher which will help out with sequential transfers. This should get somewhere reasonably close to optimal performance.
- Jonathan Morton
On Wednesday 22 June 2011, Jordan Crouse wrote:
I have a query in similar lines, but related to user virtual address space. Is it feasible to extend these DMA interfaces(and IOMMU), to map a user allocated buffer into the hardware?
This can be done with the current API, although it may not look so straightforward. You just need to create a scatter list of user pages (these can be gathered with get_user_pages function) and use dma_map_sg() function. If the dma-mapping support iommu, it can map all these pages into a single contiguous buffer on device (DMA) address space.
Some additional 'magic' might be required to get access to pages that are mapped with pure PFN (VM_PFNMAP flag), but imho it still can be done.
I will try to implement this feature in videobuf2-dma-config allocator together with the next version of my patches for dma-mapping&iommu.
With luck DMA_ATTRIB_NO_KERNEL_MAPPING should remove any lingering arguments for trying to map user pages. Given that our ultimate goal here is buffer sharing, user allocated pages have limited value and appeal. If anything, I vote that this be a far lower priority compared to the rest of the win you have here.
I agree. Mapping user-allocated buffers is extremely hard to get right when there are extra constraints. If it doesn't work already for some driver, I wouldn't put too much effort into making it work for more special cases.
Arnd
Hi Marek,
On Mon, Jun 20, 2011 at 09:50:05AM +0200, Marek Szyprowski wrote:
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
I appreciate your progress on this generic dma_ops implementation. But for now it looks very ARM specific. Do you have plans to extend it to non-ARM iommu-api implementations too?
Joerg
Hello,
On Friday, June 24, 2011 11:18 AM Joerg Roedel wrote:
On Mon, Jun 20, 2011 at 09:50:05AM +0200, Marek Szyprowski wrote:
This patch series is a continuation of my works on implementing generic IOMMU support in DMA mapping framework for ARM architecture. Now I focused on the DMA mapping framework itself. It turned out that adding support for common dma_map_ops structure was not that hard as I initally thought. After some modification most of the code fits really well to the generic dma_map_ops methods.
I appreciate your progress on this generic dma_ops implementation. But for now it looks very ARM specific. Do you have plans to extend it to non-ARM iommu-api implementations too?
These works are just a first step to create an environment for real iommu & dma-mapping integration. It is much easier to work on IOMMU integration once the dma-mapping operations can be easily changed for particular devices. dma_map_ops gives such flexibility. It is also a de-facto standard interface for other architectures so it was really desired to work on iommu implementation on top of dma_map_ops.
Of course my patches will be ARM-centric, but I hope to isolate ARM-specific from generic parts, which can be easily reused on other platforms.
Best regards
linaro-mm-sig@lists.linaro.org