Hi,
ADF is an experimental display framework that I designed after experimenting with a KMS-based hardware composer for Android. ADF started as an proof-of-concept implemented from scratch, so right now it's a separate framework rather than a patchstack on top of KMS. If there's community interest, moving forward I'd like to merge its functionality into KMS rather than keep it as a separate thing.
I'm going to talk about ADF at the Android and Graphics session at Linux Plumbers. The documentation's not done but I wanted to post these patches to people a heads-up about ADF. If there's interest I can write up some more formal documentation ahead of Plumbers.
I designed ADF to deal with some serious issues I had working with KMS:
1. The API is geared toward updating one object at a time. Android's graphics stack needs the entire screen updated atomically to avoid tearing, and on some SoCs to avoid wedging the display hardware. Rob Clark's atomic modeset patchset worked, but copy/update/commit design meant the driver had to keep a lot more internal state.
2. Some SoCs don't map well to KMS's CRTC + planes + encoders + connector model. At the time I was dealing with hardware where the blending engines didn't have their own framebuffer (they could only scan out constant colors or mix the output of other blocks), and you needed to gang several mixers together to drive high-res displays.
3. KMS doesn't support custom pixel formats, which a lot of newer SoCs use internally to cut down on bandwidth between hardware blocks.
4. KMS doesn't have a way to exchange sync fences. As a hack I managed to pass sync fences into the kernel as properties of the atomic pageflip, but there was no good way to get them back out of the kernel without a side channel.
ADF represents display devices as collections of overlay engines and interfaces. Overlay engines (struct adf_overlay_engine) scan out images and interfaces (struct adf_interface) display those images. Overlay engines and interfaces can be connected in any n-to-n configuration that the hardware supports.
Clients issue atomic updates to the screen by passing in a list of buffers (struct adf_buffer) consisting of dma-buf handles, sync fences, and basic metadata like format and size. If this involves composing multiple buffers, clients include a block of custom data describing the actual composition (scaling, z-order, blending, etc.) in a driver-specific format.
Drivers provide hooks to validate these custom data blocks and commit the new configuration to hardware. ADF handles importing the dma-bufs and fences, waiting on incoming sync fences before committing, advancing the display's sync timeline, and releasing dma-bufs once they're removed from the screen.
ADF represents pixel formats using DRM-style fourccs, and automatically sanity-checks buffer sizes when using one of the formats listed in drm_fourcc.h. Drivers can support custom fourccs if they provide hooks to validate buffers that use them.
ADF also provides driver hooks for modesetting, managing and reporting hardware events like vsync, and changing DPMS state. These are documented in struct adf_{obj,overlay_engine,interface,device}_ops, and are similar to the equivalent DRM ops.
Greg Hackmann (3):
video: add atomic display framework
video: adf: add display core helper
video: adf: add memblock helper
Laurent Pinchart (1):
video: Add generic display entity core
drivers/video/Kconfig | 2 +
drivers/video/Makefile | 2 +
drivers/video/adf/Kconfig | 15 +
drivers/video/adf/Makefile | 14 +
drivers/video/adf/adf.c | 1013 ++++++++++++++++++++++++++++++++++
drivers/video/adf/adf.h | 49 ++
drivers/video/adf/adf_client.c | 853 ++++++++++++++++++++++++++++
drivers/video/adf/adf_display.c | 123 +++++
drivers/video/adf/adf_fops.c | 982 ++++++++++++++++++++++++++++++++
drivers/video/adf/adf_fops.h | 37 ++
drivers/video/adf/adf_fops32.c | 217 ++++++++
drivers/video/adf/adf_fops32.h | 78 +++
drivers/video/adf/adf_memblock.c | 150 +++++
drivers/video/adf/adf_sysfs.c | 291 ++++++++++
drivers/video/adf/adf_sysfs.h | 33 ++
drivers/video/adf/adf_trace.h | 93 ++++
drivers/video/display/Kconfig | 4 +
drivers/video/display/Makefile | 1 +
drivers/video/display/display-core.c | 362 ++++++++++++
include/video/adf.h | 743 +++++++++++++++++++++++++
include/video/adf_client.h | 61 ++
include/video/adf_display.h | 31 ++
include/video/adf_format.h | 282 ++++++++++
include/video/adf_memblock.h | 20 +
include/video/display.h | 150 +++++
25 files changed, 5606 insertions(+)
create mode 100644 drivers/video/adf/Kconfig
create mode 100644 drivers/video/adf/Makefile
create mode 100644 drivers/video/adf/adf.c
create mode 100644 drivers/video/adf/adf.h
create mode 100644 drivers/video/adf/adf_client.c
create mode 100644 drivers/video/adf/adf_display.c
create mode 100644 drivers/video/adf/adf_fops.c
create mode 100644 drivers/video/adf/adf_fops.h
create mode 100644 drivers/video/adf/adf_fops32.c
create mode 100644 drivers/video/adf/adf_fops32.h
create mode 100644 drivers/video/adf/adf_memblock.c
create mode 100644 drivers/video/adf/adf_sysfs.c
create mode 100644 drivers/video/adf/adf_sysfs.h
create mode 100644 drivers/video/adf/adf_trace.h
create mode 100644 drivers/video/display/Kconfig
create mode 100644 drivers/video/display/Makefile
create mode 100644 drivers/video/display/display-core.c
create mode 100644 include/video/adf.h
create mode 100644 include/video/adf_client.h
create mode 100644 include/video/adf_display.h
create mode 100644 include/video/adf_format.h
create mode 100644 include/video/adf_memblock.h
create mode 100644 include/video/display.h
--
1.8.0
Hello,
This is a sixth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. This time I've fixes
more issues pointed by Rob Herring and added support for 'status'
property. For more information, please check the change log at the end
of the mail.
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3. However during the discussion at Linaro Connect
in Dublin it has been decided that reserved memory is one of the
_system_ property. Such system properties are not necessarily the
hardware properties, but they are well-known and likely properties of
the system that is running, so they can be encoded to device tree. This
patch series tries to provide such encoding with the respect to the
common device tree style of bindings and documentation.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v6:
- added support for 'status' property, so memory regions can be disabled
like any other nodes
- fixed issues pointed by Rob: removed annotations from function
declarations and replaced macros with static inline functions.
- restored of_scan_flat_dt_by_path() function to simplify reserved memory
scanning function
- the code now uses #size-cells/#address-cells properties from root node
for interpreting 'reg' property in reserved memory regions
- fixed some issues in dt binding documentation
v5: http://thread.gmane.org/gmane.linux.ports.arm.kernel/259278
- renamed "contiguous-memory-region" compatibility string to
"linux,contiguous-memory-region" (this one is really specific to Linux
kernel implementation)
- renamed "dma-memory-region" property to "memory-region" (suggested by
Kumar Gala)
- added support for #address-cells, #size-cells for memory regions
(thanks to Rob Herring for suggestion)
- removed generic code to scan specific path in flat device tree (cannot
be used to fdt one-pass scan based initialization of memory regions with
#address-cells and #size-cells parsing)
- replaced dma_contiguous_set_default_area() and dma_contiguous_add_device()
functions with dev_set_cma_area() call
v4: http://thread.gmane.org/gmane.linux.ports.arm.kernel/256491
- corrected Devcie Tree mailing list address (resend)
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (4):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add function to scan fdt nodes given by path
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 166 ++++++++++++++++++++++++
arch/arm/include/asm/dma-contiguous.h | 1 -
arch/arm/mm/init.c | 3 +
arch/x86/include/asm/dma-contiguous.h | 1 -
drivers/base/dma-contiguous.c | 119 +++++++-----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 76 +++++++++++
drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++
drivers/of/platform.c | 5 +
include/asm-generic/dma-contiguous.h | 28 -----
include/linux/dma-contiguous.h | 55 +++++++-
include/linux/of_fdt.h | 3 +
include/linux/of_reserved_mem.h | 14 +++
14 files changed, 546 insertions(+), 107 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
delete mode 100644 include/asm-generic/dma-contiguous.h
create mode 100644 include/linux/of_reserved_mem.h
--
1.7.9.5
A fence can be attached to a buffer which is being filled or consumed
by hw, to allow userspace to pass the buffer without waiting to another
device. For example, userspace can call page_flip ioctl to display the
next frame of graphics after kicking the GPU but while the GPU is still
rendering. The display device sharing the buffer with the GPU would
attach a callback to get notified when the GPU's rendering-complete IRQ
fires, to update the scan-out address of the display, without having to
wake up userspace.
A driver must allocate a fence context for each execution ring that can
run in parallel. The function for this takes an argument with how many
contexts to allocate:
+ fence_context_alloc()
A fence is transient, one-shot deal. It is allocated and attached
to one or more dma-buf's. When the one that attached it is done, with
the pending operation, it can signal the fence:
+ fence_signal()
To have a rough approximation whether a fence is fired, call:
+ fence_is_signaled()
The dma-buf-mgr handles tracking, and waiting on, the fences associated
with a dma-buf.
The one pending on the fence can add an async callback:
+ fence_add_callback()
The callback can optionally be cancelled with:
+ fence_remove_callback()
To wait synchronously, optionally with a timeout:
+ fence_wait()
+ fence_wait_timeout()
A default software-only implementation is provided, which can be used
by drivers attaching a fence to a buffer when they have no other means
for hw sync. But a memory backed fence is also envisioned, because it
is common that GPU's can write to, or poll on some memory location for
synchronization. For example:
fence = custom_get_fence(...);
if ((seqno_fence = to_seqno_fence(fence)) != NULL) {
dma_buf *fence_buf = fence->sync_buf;
get_dma_buf(fence_buf);
... tell the hw the memory location to wait ...
custom_wait_on(fence_buf, fence->seqno_ofs, fence->seqno);
} else {
/* fall-back to sw sync * /
fence_add_callback(fence, my_cb);
}
On SoC platforms, if some other hw mechanism is provided for synchronizing
between IP blocks, it could be supported as an alternate implementation
with it's own fence ops in a similar way.
enable_signaling callback is used to provide sw signaling in case a cpu
waiter is requested or no compatible hardware signaling could be used.
The intention is to provide a userspace interface (presumably via eventfd)
later, to be used in conjunction with dma-buf's mmap support for sw access
to buffers (or for userspace apps that would prefer to do their own
synchronization).
v1: Original
v2: After discussion w/ danvet and mlankhorst on #dri-devel, we decided
that dma-fence didn't need to care about the sw->hw signaling path
(it can be handled same as sw->sw case), and therefore the fence->ops
can be simplified and more handled in the core. So remove the signal,
add_callback, cancel_callback, and wait ops, and replace with a simple
enable_signaling() op which can be used to inform a fence supporting
hw->hw signaling that one or more devices which do not support hw
signaling are waiting (and therefore it should enable an irq or do
whatever is necessary in order that the CPU is notified when the
fence is passed).
v3: Fix locking fail in attach_fence() and get_fence()
v4: Remove tie-in w/ dma-buf.. after discussion w/ danvet and mlankorst
we decided that we need to be able to attach one fence to N dma-buf's,
so using the list_head in dma-fence struct would be problematic.
v5: [ Maarten Lankhorst ] Updated for dma-bikeshed-fence and dma-buf-manager.
v6: [ Maarten Lankhorst ] I removed dma_fence_cancel_callback and some comments
about checking if fence fired or not. This is broken by design.
waitqueue_active during destruction is now fatal, since the signaller
should be holding a reference in enable_signalling until it signalled
the fence. Pass the original dma_fence_cb along, and call __remove_wait
in the dma_fence_callback handler, so that no cleanup needs to be
performed.
v7: [ Maarten Lankhorst ] Set cb->func and only enable sw signaling if
fence wasn't signaled yet, for example for hardware fences that may
choose to signal blindly.
v8: [ Maarten Lankhorst ] Tons of tiny fixes, moved __dma_fence_init to
header and fixed include mess. dma-fence.h now includes dma-buf.h
All members are now initialized, so kmalloc can be used for
allocating a dma-fence. More documentation added.
v9: Change compiler bitfields to flags, change return type of
enable_signaling to bool. Rework dma_fence_wait. Added
dma_fence_is_signaled and dma_fence_wait_timeout.
s/dma// and change exports to non GPL. Added fence_is_signaled and
fence_enable_sw_signaling calls, add ability to override default
wait operation.
v10: remove event_queue, use a custom list, export try_to_wake_up from
scheduler. Remove fence lock and use a global spinlock instead,
this should hopefully remove all the locking headaches I was having
on trying to implement this. enable_signaling is called with this
lock held.
v11:
Use atomic ops for flags, lifting the need for some spin_lock_irqsaves.
However I kept the guarantee that after fence_signal returns, it is
guaranteed that enable_signaling has either been called to completion,
or will not be called any more.
Add contexts and seqno to base fence implementation. This allows you
to wait for less fences, by testing for seqno + signaled, and then only
wait on the later fence.
Add FENCE_TRACE, FENCE_WARN, and FENCE_ERR. This makes debugging easier.
An CONFIG_DEBUG_FENCE will be added to turn off the FENCE_TRACE
spam, and another runtime option can turn it off at runtime.
v12:
Add CONFIG_FENCE_TRACE. Add missing documentation for the fence->context
and fence->seqno members.
v13:
Fixup CONFIG_FENCE_TRACE kconfig description.
Move fence_context_alloc to fence.
Simplify fence_later.
Kill priv member to fence_cb.
v14:
Remove priv argument from fence_add_callback, oops!
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
---
Documentation/DocBook/device-drivers.tmpl | 2
drivers/base/Kconfig | 10 +
drivers/base/Makefile | 2
drivers/base/fence.c | 311 ++++++++++++++++++++++++++
include/linux/fence.h | 344 +++++++++++++++++++++++++++++
5 files changed, 668 insertions(+), 1 deletion(-)
create mode 100644 drivers/base/fence.c
create mode 100644 include/linux/fence.h
diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl
index fe397f9..95d0db9 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -126,6 +126,8 @@ X!Edrivers/base/interface.c
</sect1>
<sect1><title>Device Drivers DMA Management</title>
!Edrivers/base/dma-buf.c
+!Edrivers/base/fence.c
+!Iinclude/linux/fence.h
!Edrivers/base/reservation.c
!Iinclude/linux/reservation.h
!Edrivers/base/dma-coherent.c
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 5daa259..2bf0add 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -200,6 +200,16 @@ config DMA_SHARED_BUFFER
APIs extension; the file's descriptor can then be passed on to other
driver.
+config FENCE_TRACE
+ bool "Enable verbose FENCE_TRACE messages"
+ default n
+ depends on DMA_SHARED_BUFFER
+ help
+ Enable the FENCE_TRACE printks. This will add extra
+ spam to the console log, but will make it easier to diagnose
+ lockup related problems for dma-buffers shared across multiple
+ devices.
+
config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 48029aa..8a55cb9 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_CMA) += dma-contiguous.o
obj-y += power/
obj-$(CONFIG_HAS_DMA) += dma-mapping.o
obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
-obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o reservation.o
+obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o fence.o reservation.o
obj-$(CONFIG_ISA) += isa.o
obj-$(CONFIG_FW_LOADER) += firmware_class.o
obj-$(CONFIG_NUMA) += node.o
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
new file mode 100644
index 0000000..a95df7d
--- /dev/null
+++ b/drivers/base/fence.c
@@ -0,0 +1,311 @@
+/*
+ * Fence mechanism for dma-buf and to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <robdclark(a)gmail.com>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/fence.h>
+
+/**
+ * fence context counter: each execution context should have its own
+ * fence context, this allows checking if fences belong to the same
+ * context or not. One device can have multiple separate contexts,
+ * and they're used if some engine can run independently of another.
+ */
+static atomic_t fence_context_counter = ATOMIC_INIT(0);
+
+/**
+ * fence_context_alloc - allocate an array of fence contexts
+ * @num: [in] amount of contexts to allocate
+ *
+ * This function will return the first index of the number of fences allocated.
+ * The fence context is used for setting fence->context to a unique number.
+ */
+unsigned fence_context_alloc(unsigned num)
+{
+ BUG_ON(!num);
+ return atomic_add_return(num, &fence_context_counter) - num;
+}
+EXPORT_SYMBOL(fence_context_alloc);
+
+int __fence_signal(struct fence *fence)
+{
+ struct fence_cb *cur, *tmp;
+ int ret = 0;
+
+ if (WARN_ON(!fence))
+ return -EINVAL;
+
+ if (test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ ret = -EINVAL;
+
+ /*
+ * we might have raced with the unlocked fence_signal,
+ * still run through all callbacks
+ */
+ }
+
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__fence_signal);
+
+/**
+ * fence_signal - signal completion of a fence
+ * @fence: the fence to signal
+ *
+ * Signal completion for software callbacks on a fence, this will unblock
+ * fence_wait() calls and run all the callbacks added with
+ * fence_add_callback(). Can be called multiple times, but since a fence
+ * can only go from unsignaled to signaled state, it will only be effective
+ * the first time.
+ */
+int fence_signal(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!fence || test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags)) {
+ struct fence_cb *cur, *tmp;
+
+ spin_lock_irqsave(fence->lock, flags);
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+ return 0;
+}
+EXPORT_SYMBOL(fence_signal);
+
+void release_fence(struct kref *kref)
+{
+ struct fence *fence =
+ container_of(kref, struct fence, refcount);
+
+ BUG_ON(!list_empty(&fence->cb_list));
+
+ if (fence->ops->release)
+ fence->ops->release(fence);
+ else
+ kfree(fence);
+}
+EXPORT_SYMBOL(release_fence);
+
+/**
+ * fence_enable_sw_signaling - enable signaling on fence
+ * @fence: [in] the fence to enable
+ *
+ * this will request for sw signaling to be enabled, to make the fence
+ * complete as soon as possible
+ */
+void fence_enable_sw_signaling(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags) &&
+ !test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (!fence->ops->enable_signaling(fence))
+ __fence_signal(fence);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+}
+EXPORT_SYMBOL(fence_enable_sw_signaling);
+
+/**
+ * fence_add_callback - add a callback to be called when the fence
+ * is signaled
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to register
+ * @func: [in] the function to call
+ *
+ * cb will be initialized by fence_add_callback, no initialization
+ * by the caller is required. Any number of callbacks can be registered
+ * to a fence, but a callback can only be registered to one fence at a time.
+ *
+ * Note that the callback can be called from an atomic context. If
+ * fence is already signaled, this function will return -ENOENT (and
+ * *not* call the callback)
+ *
+ * Add a software callback to the fence. Same restrictions apply to
+ * refcount as it does to fence_wait, however the caller doesn't need to
+ * keep a refcount to fence afterwards: when software access is enabled,
+ * the creator of the fence is required to keep the fence alive until
+ * after it signals with fence_signal. The callback itself can be called
+ * from irq context.
+ *
+ */
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func)
+{
+ unsigned long flags;
+ int ret = 0;
+ bool was_set;
+
+ if (WARN_ON(!fence || !func))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -ENOENT;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ ret = -ENOENT;
+ else if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ ret = -ENOENT;
+ }
+
+ if (!ret) {
+ cb->func = func;
+ list_add_tail(&cb->node, &fence->cb_list);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_add_callback);
+
+/**
+ * fence_remove_callback - remove a callback from the signaling list
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to remove
+ *
+ * Remove a previously queued callback from the fence. This function returns
+ * true is the callback is succesfully removed, or false if the fence has
+ * already been signaled.
+ *
+ * *WARNING*:
+ * Cancelling a callback should only be done if you really know what you're
+ * doing, since deadlocks and race conditions could occur all too easily. For
+ * this reason, it should only ever be done on hardware lockup recovery,
+ * with a reference held to the fence.
+ */
+bool
+fence_remove_callback(struct fence *fence, struct fence_cb *cb)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ ret = !list_empty(&cb->node);
+ if (ret)
+ list_del_init(&cb->node);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_remove_callback);
+
+struct default_wait_cb {
+ struct fence_cb base;
+ struct task_struct *task;
+};
+
+static void
+fence_default_wait_cb(struct fence *fence, struct fence_cb *cb)
+{
+ struct default_wait_cb *wait =
+ container_of(cb, struct default_wait_cb, base);
+
+ try_to_wake_up(wait->task, TASK_NORMAL, 0);
+}
+
+/**
+ * fence_default_wait - default sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success.
+ */
+long
+fence_default_wait(struct fence *fence, bool intr, signed long timeout)
+{
+ struct default_wait_cb cb;
+ unsigned long flags;
+ long ret = timeout;
+ bool was_set;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return timeout;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (intr && signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ goto out;
+ }
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ goto out;
+
+ if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ goto out;
+ }
+
+ cb.base.func = fence_default_wait_cb;
+ cb.task = current;
+ list_add(&cb.base.node, &fence->cb_list);
+
+ while (!test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags) && ret > 0) {
+ if (intr)
+ __set_current_state(TASK_INTERRUPTIBLE);
+ else
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ ret = schedule_timeout(ret);
+
+ spin_lock_irqsave(fence->lock, flags);
+ if (ret > 0 && intr && signal_pending(current))
+ ret = -ERESTARTSYS;
+ }
+
+ if (!list_empty(&cb.base.node))
+ list_del(&cb.base.node);
+ __set_current_state(TASK_RUNNING);
+
+out:
+ spin_unlock_irqrestore(fence->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(fence_default_wait);
diff --git a/include/linux/fence.h b/include/linux/fence.h
new file mode 100644
index 0000000..3ce2155
--- /dev/null
+++ b/include/linux/fence.h
@@ -0,0 +1,344 @@
+/*
+ * Fence mechanism for dma-buf to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <robdclark(a)gmail.com>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_FENCE_H
+#define __LINUX_FENCE_H
+
+#include <linux/err.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/bitops.h>
+#include <linux/kref.h>
+#include <linux/sched.h>
+#include <linux/printk.h>
+
+struct fence;
+struct fence_ops;
+struct fence_cb;
+
+/**
+ * struct fence - software synchronization primitive
+ * @refcount: refcount for this fence
+ * @ops: fence_ops associated with this fence
+ * @cb_list: list of all callbacks to call
+ * @lock: spin_lock_irqsave used for locking
+ * @context: execution context this fence belongs to, returned by
+ * fence_context_alloc()
+ * @seqno: the sequence number of this fence inside the execution context,
+ * can be compared to decide which fence would be signaled later.
+ * @flags: A mask of FENCE_FLAG_* defined below
+ *
+ * the flags member must be manipulated and read using the appropriate
+ * atomic ops (bit_*), so taking the spinlock will not be needed most
+ * of the time.
+ *
+ * FENCE_FLAG_SIGNALED_BIT - fence is already signaled
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT - enable_signaling might have been called*
+ * FENCE_FLAG_USER_BITS - start of the unused bits, can be used by the
+ * implementer of the fence for its own purposes. Can be used in different
+ * ways by different fence implementers, so do not rely on this.
+ *
+ * *) Since atomic bitops are used, this is not guaranteed to be the case.
+ * Particularly, if the bit was set, but fence_signal was called right
+ * before this bit was set, it would have been able to set the
+ * FENCE_FLAG_SIGNALED_BIT, before enable_signaling was called.
+ * Adding a check for FENCE_FLAG_SIGNALED_BIT after setting
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT closes this race, and makes sure that
+ * after fence_signal was called, any enable_signaling call will have either
+ * been completed, or never called at all.
+ */
+struct fence {
+ struct kref refcount;
+ const struct fence_ops *ops;
+ struct list_head cb_list;
+ spinlock_t *lock;
+ unsigned context, seqno;
+ unsigned long flags;
+};
+
+enum fence_flag_bits {
+ FENCE_FLAG_SIGNALED_BIT,
+ FENCE_FLAG_ENABLE_SIGNAL_BIT,
+ FENCE_FLAG_USER_BITS, /* must always be last member */
+};
+
+typedef void (*fence_func_t)(struct fence *fence, struct fence_cb *cb);
+
+/**
+ * struct fence_cb - callback for fence_add_callback
+ * @node: used by fence_add_callback to append this struct to fence::cb_list
+ * @func: fence_func_t to call
+ * @priv: value of priv to pass to function
+ *
+ * This struct will be initialized by fence_add_callback, additional
+ * data can be passed along by embedding fence_cb in another struct.
+ */
+struct fence_cb {
+ struct list_head node;
+ fence_func_t func;
+};
+
+/**
+ * struct fence_ops - operations implemented for fence
+ * @enable_signaling: enable software signaling of fence
+ * @signaled: [optional] peek whether the fence is signaled, can be null
+ * @wait: custom wait implementation
+ * @release: [optional] called on destruction of fence, can be null
+ *
+ * Notes on enable_signaling:
+ * For fence implementations that have the capability for hw->hw
+ * signaling, they can implement this op to enable the necessary
+ * irqs, or insert commands into cmdstream, etc. This is called
+ * in the first wait() or add_callback() path to let the fence
+ * implementation know that there is another driver waiting on
+ * the signal (ie. hw->sw case).
+ *
+ * This function can be called called from atomic context, but not
+ * from irq context, so normal spinlocks can be used.
+ *
+ * A return value of false indicates the fence already passed,
+ * or some failure occured that made it impossible to enable
+ * signaling. True indicates succesful enabling.
+ *
+ * Calling fence_signal before enable_signaling is called allows
+ * for a tiny race window in which enable_signaling is called during,
+ * before, or after fence_signal. To fight this, it is recommended
+ * that before enable_signaling returns true an extra reference is
+ * taken on the fence, to be released when the fence is signaled.
+ * This will mean fence_signal will still be called twice, but
+ * the second time will be a noop since it was already signaled.
+ *
+ * Notes on wait:
+ * Must not be NULL, set to fence_default_wait for default implementation.
+ * the fence_default_wait implementation should work for any fence, as long
+ * as enable_signaling works correctly.
+ *
+ * Must return -ERESTARTSYS if the wait is intr = true and the wait was
+ * interrupted, and remaining jiffies if fence has signaled, or 0 if wait
+ * timed out. Can also return other error values on custom implementations,
+ * which should be treated as if the fence is signaled. For example a hardware
+ * lockup could be reported like that.
+ *
+ * Notes on release:
+ * Can be NULL, this function allows additional commands to run on
+ * destruction of the fence. Can be called from irq context.
+ * If pointer is set to NULL, kfree will get called instead.
+ */
+
+struct fence_ops {
+ bool (*enable_signaling)(struct fence *fence);
+ bool (*signaled)(struct fence *fence);
+ long (*wait)(struct fence *fence, bool intr, signed long timeout);
+ void (*release)(struct fence *fence);
+};
+
+/**
+ * __fence_init - Initialize a custom fence.
+ * @fence: [in] the fence to initialize
+ * @ops: [in] the fence_ops for operations on this fence
+ * @lock: [in] the irqsafe spinlock to use for locking this fence
+ * @context: [in] the execution context this fence is run on
+ * @seqno: [in] a linear increasing sequence number for this context
+ *
+ * Initializes an allocated fence, the caller doesn't have to keep its
+ * refcount after committing with this fence, but it will need to hold a
+ * refcount again if fence_ops.enable_signaling gets called. This can
+ * be used for other implementing other types of fence.
+ *
+ * context and seqno are used for easy comparison between fences, allowing
+ * to check which fence is later by simply using fence_later.
+ */
+static inline void
+__fence_init(struct fence *fence, const struct fence_ops *ops,
+ spinlock_t *lock, unsigned context, unsigned seqno)
+{
+ BUG_ON(!ops || !lock || !ops->enable_signaling || !ops->wait);
+
+ kref_init(&fence->refcount);
+ fence->ops = ops;
+ INIT_LIST_HEAD(&fence->cb_list);
+ fence->lock = lock;
+ fence->context = context;
+ fence->seqno = seqno;
+ fence->flags = 0UL;
+}
+
+/**
+ * fence_get - increases refcount of the fence
+ * @fence: [in] fence to increase refcount of
+ */
+static inline void fence_get(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_get(&fence->refcount);
+}
+
+extern void release_fence(struct kref *kref);
+
+/**
+ * fence_put - decreases refcount of the fence
+ * @fence: [in] fence to reduce refcount of
+ */
+static inline void fence_put(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_put(&fence->refcount, release_fence);
+}
+
+int fence_signal(struct fence *fence);
+int __fence_signal(struct fence *fence);
+long fence_default_wait(struct fence *fence, bool intr, signed long timeout);
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func);
+bool fence_remove_callback(struct fence *fence, struct fence_cb *cb);
+void fence_enable_sw_signaling(struct fence *fence);
+
+/**
+ * fence_is_signaled - Return an indication if the fence is signaled yet.
+ * @fence: [in] the fence to check
+ *
+ * Returns true if the fence was already signaled, false if not. Since this
+ * function doesn't enable signaling, it is not guaranteed to ever return true
+ * If fence_add_callback, fence_wait or fence_enable_sw_signaling
+ * haven't been called before.
+ *
+ * It's recommended for seqno fences to call fence_signal when the
+ * operation is complete, it makes it possible to prevent issues from
+ * wraparound between time of issue and time of use by checking the return
+ * value of this function before calling hardware-specific wait instructions.
+ */
+static inline bool
+fence_is_signaled(struct fence *fence)
+{
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return true;
+
+ if (fence->ops->signaled && fence->ops->signaled(fence)) {
+ fence_signal(fence);
+ return true;
+ }
+
+ return false;
+}
+
+/**
+ * fence_later - return the chronologically later fence
+ * @f1: [in] the first fence from the same context
+ * @f2: [in] the second fence from the same context
+ *
+ * Returns NULL if both fences are signaled, otherwise the fence that would be
+ * signaled last. Both fences must be from the same context, since a seqno is
+ * not re-used across contexts.
+ */
+static inline struct fence *fence_later(struct fence *f1, struct fence *f2)
+{
+ BUG_ON(f1->context != f2->context);
+
+ /*
+ * can't check just FENCE_FLAG_SIGNALED_BIT here, it may never have been
+ * set called if enable_signaling wasn't, and enabling that here is
+ * overkill.
+ */
+ if (f2->seqno - f1->seqno <= INT_MAX)
+ return fence_is_signaled(f2) ? NULL : f2;
+ else
+ return fence_is_signaled(f1) ? NULL : f1;
+}
+
+/**
+ * fence_wait_timeout - sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long
+fence_wait_timeout(struct fence *fence, bool intr, signed long timeout)
+{
+ if (WARN_ON(timeout < 0))
+ return -EINVAL;
+
+ return fence->ops->wait(fence, intr, timeout);
+}
+
+/**
+ * fence_wait - sleep until the fence gets signaled
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ *
+ * This function will return -ERESTARTSYS if interrupted by a signal,
+ * or 0 if the fence was signaled. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long fence_wait(struct fence *fence, bool intr)
+{
+ long ret;
+
+ /* Since fence_wait_timeout cannot timeout with
+ * MAX_SCHEDULE_TIMEOUT, only valid return values are
+ * -ERESTARTSYS and MAX_SCHEDULE_TIMEOUT.
+ */
+ ret = fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT);
+
+ return ret < 0 ? ret : 0;
+}
+
+unsigned fence_context_alloc(unsigned num);
+
+#define FENCE_TRACE(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ if (config_enabled(CONFIG_FENCE_TRACE)) \
+ pr_info("f %u#%u: " fmt, \
+ __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_WARN(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_warn("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_ERR(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_err("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#endif /* __LINUX_FENCE_H */
Hello,
This is a fifth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. I've fixed issues
pointed by Michal Nazarewicz, Rob Herring and Kumar Gala. For more
information, please check the change log at the end of the mail.
In fourth version of this patch set, after the comments from Grant
Likely I moved back memory region definitions back to /memory node (as
it was in the first version of this proposal). I've also extended the
code and made it more generic, added support for so called reserved dma
memory (special dma memory regions created by dma_alloc_coherent()
function, for exclusive usage for dma allocation for the given device).
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v5:
- renamed "contiguous-memory-region" compatibility string to
"linux,contiguous-memory-region" (this one is really specific to Linux
kernel implementation)
- renamed "dma-memory-region" property to "memory-region" (suggested by
Kumar Gala)
- added support for #address-cells, #size-cells for memory regions
(thanks to Rob Herring for suggestion)
- removed generic code to scan specific path in flat device tree (cannot
be used to fdt one-pass scan based initialization of memory regions with
#address-cells and #size-cells parsing)
- replaced dma_contiguous_set_default_area() and dma_contiguous_add_device()
functions with dev_set_cma_area() call
v4: http://thread.gmane.org/gmane.linux.ports.arm.kernel/256491
- corrected Devcie Tree mailing list address (resend)
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (3):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++
arch/arm/include/asm/dma-contiguous.h | 1 -
arch/arm/mm/init.c | 3 +
arch/x86/include/asm/dma-contiguous.h | 1 -
drivers/base/dma-contiguous.c | 119 ++++++----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/of_reserved_mem.c | 197 ++++++++++++++++++++++++++
drivers/of/platform.c | 5 +
include/asm-generic/dma-contiguous.h | 28 ----
include/linux/dma-contiguous.h | 55 ++++++-
include/linux/of_reserved_mem.h | 14 ++
12 files changed, 475 insertions(+), 107 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
delete mode 100644 include/asm-generic/dma-contiguous.h
create mode 100644 include/linux/of_reserved_mem.h
--
1.7.9.5
A fence can be attached to a buffer which is being filled or consumed
by hw, to allow userspace to pass the buffer without waiting to another
device. For example, userspace can call page_flip ioctl to display the
next frame of graphics after kicking the GPU but while the GPU is still
rendering. The display device sharing the buffer with the GPU would
attach a callback to get notified when the GPU's rendering-complete IRQ
fires, to update the scan-out address of the display, without having to
wake up userspace.
A driver must allocate a fence context for each execution ring that can
run in parallel. The function for this takes an argument with how many
contexts to allocate:
+ fence_context_alloc()
A fence is transient, one-shot deal. It is allocated and attached
to one or more dma-buf's. When the one that attached it is done, with
the pending operation, it can signal the fence:
+ fence_signal()
To have a rough approximation whether a fence is fired, call:
+ fence_is_signaled()
The dma-buf-mgr handles tracking, and waiting on, the fences associated
with a dma-buf.
The one pending on the fence can add an async callback:
+ fence_add_callback()
The callback can optionally be cancelled with:
+ fence_remove_callback()
To wait synchronously, optionally with a timeout:
+ fence_wait()
+ fence_wait_timeout()
A default software-only implementation is provided, which can be used
by drivers attaching a fence to a buffer when they have no other means
for hw sync. But a memory backed fence is also envisioned, because it
is common that GPU's can write to, or poll on some memory location for
synchronization. For example:
fence = custom_get_fence(...);
if ((seqno_fence = to_seqno_fence(fence)) != NULL) {
dma_buf *fence_buf = fence->sync_buf;
get_dma_buf(fence_buf);
... tell the hw the memory location to wait ...
custom_wait_on(fence_buf, fence->seqno_ofs, fence->seqno);
} else {
/* fall-back to sw sync * /
fence_add_callback(fence, my_cb);
}
On SoC platforms, if some other hw mechanism is provided for synchronizing
between IP blocks, it could be supported as an alternate implementation
with it's own fence ops in a similar way.
enable_signaling callback is used to provide sw signaling in case a cpu
waiter is requested or no compatible hardware signaling could be used.
The intention is to provide a userspace interface (presumably via eventfd)
later, to be used in conjunction with dma-buf's mmap support for sw access
to buffers (or for userspace apps that would prefer to do their own
synchronization).
v1: Original
v2: After discussion w/ danvet and mlankhorst on #dri-devel, we decided
that dma-fence didn't need to care about the sw->hw signaling path
(it can be handled same as sw->sw case), and therefore the fence->ops
can be simplified and more handled in the core. So remove the signal,
add_callback, cancel_callback, and wait ops, and replace with a simple
enable_signaling() op which can be used to inform a fence supporting
hw->hw signaling that one or more devices which do not support hw
signaling are waiting (and therefore it should enable an irq or do
whatever is necessary in order that the CPU is notified when the
fence is passed).
v3: Fix locking fail in attach_fence() and get_fence()
v4: Remove tie-in w/ dma-buf.. after discussion w/ danvet and mlankorst
we decided that we need to be able to attach one fence to N dma-buf's,
so using the list_head in dma-fence struct would be problematic.
v5: [ Maarten Lankhorst ] Updated for dma-bikeshed-fence and dma-buf-manager.
v6: [ Maarten Lankhorst ] I removed dma_fence_cancel_callback and some comments
about checking if fence fired or not. This is broken by design.
waitqueue_active during destruction is now fatal, since the signaller
should be holding a reference in enable_signalling until it signalled
the fence. Pass the original dma_fence_cb along, and call __remove_wait
in the dma_fence_callback handler, so that no cleanup needs to be
performed.
v7: [ Maarten Lankhorst ] Set cb->func and only enable sw signaling if
fence wasn't signaled yet, for example for hardware fences that may
choose to signal blindly.
v8: [ Maarten Lankhorst ] Tons of tiny fixes, moved __dma_fence_init to
header and fixed include mess. dma-fence.h now includes dma-buf.h
All members are now initialized, so kmalloc can be used for
allocating a dma-fence. More documentation added.
v9: Change compiler bitfields to flags, change return type of
enable_signaling to bool. Rework dma_fence_wait. Added
dma_fence_is_signaled and dma_fence_wait_timeout.
s/dma// and change exports to non GPL. Added fence_is_signaled and
fence_enable_sw_signaling calls, add ability to override default
wait operation.
v10: remove event_queue, use a custom list, export try_to_wake_up from
scheduler. Remove fence lock and use a global spinlock instead,
this should hopefully remove all the locking headaches I was having
on trying to implement this. enable_signaling is called with this
lock held.
v11:
Use atomic ops for flags, lifting the need for some spin_lock_irqsaves.
However I kept the guarantee that after fence_signal returns, it is
guaranteed that enable_signaling has either been called to completion,
or will not be called any more.
Add contexts and seqno to base fence implementation. This allows you
to wait for less fences, by testing for seqno + signaled, and then only
wait on the later fence.
Add FENCE_TRACE, FENCE_WARN, and FENCE_ERR. This makes debugging easier.
An CONFIG_DEBUG_FENCE will be added to turn off the FENCE_TRACE
spam, and another runtime option can turn it off at runtime.
v12:
Add CONFIG_FENCE_TRACE. Add missing documentation for the fence->context
and fence->seqno members.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
---
Documentation/DocBook/device-drivers.tmpl | 2
drivers/base/Kconfig | 10 +
drivers/base/Makefile | 2
drivers/base/fence.c | 286 +++++++++++++++++++++++
include/linux/fence.h | 365 +++++++++++++++++++++++++++++
5 files changed, 664 insertions(+), 1 deletion(-)
create mode 100644 drivers/base/fence.c
create mode 100644 include/linux/fence.h
diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl
index cbfdf54..241f4c5 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -126,6 +126,8 @@ X!Edrivers/base/interface.c
</sect1>
<sect1><title>Device Drivers DMA Management</title>
!Edrivers/base/dma-buf.c
+!Edrivers/base/fence.c
+!Iinclude/linux/fence.h
!Edrivers/base/reservation.c
!Iinclude/linux/reservation.h
!Edrivers/base/dma-coherent.c
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 5daa259..0ad35df 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -200,6 +200,16 @@ config DMA_SHARED_BUFFER
APIs extension; the file's descriptor can then be passed on to other
driver.
+config FENCE_TRACE
+ bool "Enable verbose FENCE_TRACE messages"
+ default n
+ depends on DMA_SHARED_BUFFER
+ help
+ Enable the FENCE_TRACE printks. This will add extra
+ spam to the config log, but will make it easier to diagnose
+ lockup related problems for dma-buffers shared across multiple
+ devices.
+
config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 48029aa..8a55cb9 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_CMA) += dma-contiguous.o
obj-y += power/
obj-$(CONFIG_HAS_DMA) += dma-mapping.o
obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
-obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o reservation.o
+obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o fence.o reservation.o
obj-$(CONFIG_ISA) += isa.o
obj-$(CONFIG_FW_LOADER) += firmware_class.o
obj-$(CONFIG_NUMA) += node.o
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
new file mode 100644
index 0000000..28e5ffd
--- /dev/null
+++ b/drivers/base/fence.c
@@ -0,0 +1,286 @@
+/*
+ * Fence mechanism for dma-buf and to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <rob.clark(a)linaro.org>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/fence.h>
+
+atomic_t fence_context_counter = ATOMIC_INIT(0);
+EXPORT_SYMBOL(fence_context_counter);
+
+int __fence_signal(struct fence *fence)
+{
+ struct fence_cb *cur, *tmp;
+ int ret = 0;
+
+ if (WARN_ON(!fence))
+ return -EINVAL;
+
+ if (test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ ret = -EINVAL;
+
+ /*
+ * we might have raced with the unlocked fence_signal,
+ * still run through all callbacks
+ */
+ }
+
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur, cur->priv);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__fence_signal);
+
+/**
+ * fence_signal - signal completion of a fence
+ * @fence: the fence to signal
+ *
+ * Signal completion for software callbacks on a fence, this will unblock
+ * fence_wait() calls and run all the callbacks added with
+ * fence_add_callback(). Can be called multiple times, but since a fence
+ * can only go from unsignaled to signaled state, it will only be effective
+ * the first time.
+ */
+int fence_signal(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!fence || test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags)) {
+ struct fence_cb *cur, *tmp;
+
+ spin_lock_irqsave(fence->lock, flags);
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur, cur->priv);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+ return 0;
+}
+EXPORT_SYMBOL(fence_signal);
+
+void release_fence(struct kref *kref)
+{
+ struct fence *fence =
+ container_of(kref, struct fence, refcount);
+
+ BUG_ON(!list_empty(&fence->cb_list));
+
+ if (fence->ops->release)
+ fence->ops->release(fence);
+ else
+ kfree(fence);
+}
+EXPORT_SYMBOL(release_fence);
+
+/**
+ * fence_enable_sw_signaling - enable signaling on fence
+ * @fence: [in] the fence to enable
+ *
+ * this will request for sw signaling to be enabled, to make the fence
+ * complete as soon as possible
+ */
+void fence_enable_sw_signaling(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags) &&
+ !test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (!fence->ops->enable_signaling(fence))
+ __fence_signal(fence);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+}
+EXPORT_SYMBOL(fence_enable_sw_signaling);
+
+/**
+ * fence_add_callback - add a callback to be called when the fence
+ * is signaled
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to register
+ * @func: [in] the function to call
+ * @priv: [in] the argument to pass to function
+ *
+ * cb will be initialized by fence_add_callback, no initialization
+ * by the caller is required. Any number of callbacks can be registered
+ * to a fence, but a callback can only be registered to one fence at a time.
+ *
+ * Note that the callback can be called from an atomic context. If
+ * fence is already signaled, this function will return -ENOENT (and
+ * *not* call the callback)
+ *
+ * Add a software callback to the fence. Same restrictions apply to
+ * refcount as it does to fence_wait, however the caller doesn't need to
+ * keep a refcount to fence afterwards: when software access is enabled,
+ * the creator of the fence is required to keep the fence alive until
+ * after it signals with fence_signal. The callback itself can be called
+ * from irq context.
+ *
+ */
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func, void *priv)
+{
+ unsigned long flags;
+ int ret = 0;
+ bool was_set;
+
+ if (WARN_ON(!fence || !func))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -ENOENT;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ ret = -ENOENT;
+ else if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ ret = -ENOENT;
+ }
+
+ if (!ret) {
+ cb->func = func;
+ cb->priv = priv;
+ list_add_tail(&cb->node, &fence->cb_list);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_add_callback);
+
+/**
+ * fence_remove_callback - remove a callback from the signaling list
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to remove
+ *
+ * Remove a previously queued callback from the fence. This function returns
+ * true is the callback is succesfully removed, or false if the fence has
+ * already been signaled.
+ *
+ * *WARNING*:
+ * Cancelling a callback should only be done if you really know what you're
+ * doing, since deadlocks and race conditions could occur all too easily. For
+ * this reason, it should only ever be done on hardware lockup recovery,
+ * with a reference held to the fence.
+ */
+bool
+fence_remove_callback(struct fence *fence, struct fence_cb *cb)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ ret = !list_empty(&cb->node);
+ if (ret)
+ list_del_init(&cb->node);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_remove_callback);
+
+static void
+fence_default_wait_cb(struct fence *fence, struct fence_cb *cb, void *priv)
+{
+ try_to_wake_up(priv, TASK_NORMAL, 0);
+}
+
+/**
+ * fence_default_wait - default sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success.
+ */
+long
+fence_default_wait(struct fence *fence, bool intr, signed long timeout)
+{
+ struct fence_cb cb;
+ unsigned long flags;
+ long ret = timeout;
+ bool was_set;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return timeout;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (intr && signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ goto out;
+ }
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ goto out;
+
+ if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ goto out;
+ }
+
+ cb.func = fence_default_wait_cb;
+ cb.priv = current;
+ list_add(&cb.node, &fence->cb_list);
+
+ while (!test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags) && ret > 0) {
+ if (intr)
+ __set_current_state(TASK_INTERRUPTIBLE);
+ else
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ ret = schedule_timeout(ret);
+
+ spin_lock_irqsave(fence->lock, flags);
+ if (ret > 0 && intr && signal_pending(current))
+ ret = -ERESTARTSYS;
+ }
+
+ if (!list_empty(&cb.node))
+ list_del(&cb.node);
+ __set_current_state(TASK_RUNNING);
+
+out:
+ spin_unlock_irqrestore(fence->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(fence_default_wait);
diff --git a/include/linux/fence.h b/include/linux/fence.h
new file mode 100644
index 0000000..8c7a126
--- /dev/null
+++ b/include/linux/fence.h
@@ -0,0 +1,365 @@
+/*
+ * Fence mechanism for dma-buf to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <rob.clark(a)linaro.org>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_FENCE_H
+#define __LINUX_FENCE_H
+
+#include <linux/err.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/bitops.h>
+#include <linux/kref.h>
+#include <linux/sched.h>
+#include <linux/printk.h>
+
+struct fence;
+struct fence_ops;
+struct fence_cb;
+
+/**
+ * struct fence - software synchronization primitive
+ * @refcount: refcount for this fence
+ * @ops: fence_ops associated with this fence
+ * @cb_list: list of all callbacks to call
+ * @lock: spin_lock_irqsave used for locking
+ * @context: execution context this fence belongs to, returned by
+ * fence_context_alloc()
+ * @seqno: the sequence number of this fence inside the execution context,
+ * can be compared to decide which fence would be signaled later.
+ * @flags: A mask of FENCE_FLAG_* defined below
+ *
+ * the flags member must be manipulated and read using the appropriate
+ * atomic ops (bit_*), so taking the spinlock will not be needed most
+ * of the time.
+ *
+ * FENCE_FLAG_SIGNALED_BIT - fence is already signaled
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT - enable_signaling might have been called*
+ * FENCE_FLAG_USER_BITS - start of the unused bits, can be used by the
+ * implementer of the fence for its own purposes. Can be used in different
+ * ways by different fence implementers, so do not rely on this.
+ *
+ * *) Since atomic bitops are used, this is not guaranteed to be the case.
+ * Particularly, if the bit was set, but fence_signal was called right
+ * before this bit was set, it would have been able to set the
+ * FENCE_FLAG_SIGNALED_BIT, before enable_signaling was called.
+ * Adding a check for FENCE_FLAG_SIGNALED_BIT after setting
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT closes this race, and makes sure that
+ * after fence_signal was called, any enable_signaling call will have either
+ * been completed, or never called at all.
+ */
+struct fence {
+ struct kref refcount;
+ const struct fence_ops *ops;
+ struct list_head cb_list;
+ spinlock_t *lock;
+ unsigned context, seqno;
+ unsigned long flags;
+};
+
+enum fence_flag_bits {
+ FENCE_FLAG_SIGNALED_BIT,
+ FENCE_FLAG_ENABLE_SIGNAL_BIT,
+ FENCE_FLAG_USER_BITS, /* must always be last member */
+};
+
+typedef void (*fence_func_t)(struct fence *fence, struct fence_cb *cb, void *priv);
+
+/**
+ * struct fence_cb - callback for fence_add_callback
+ * @node: used by fence_add_callback to append this struct to fence::cb_list
+ * @func: fence_func_t to call
+ * @priv: value of priv to pass to function
+ *
+ * This struct will be initialized by fence_add_callback, additional
+ * data can be passed along by embedding fence_cb in another struct.
+ */
+struct fence_cb {
+ struct list_head node;
+ fence_func_t func;
+ void *priv;
+};
+
+/**
+ * struct fence_ops - operations implemented for fence
+ * @enable_signaling: enable software signaling of fence
+ * @signaled: [optional] peek whether the fence is signaled, can be null
+ * @wait: custom wait implementation
+ * @release: [optional] called on destruction of fence, can be null
+ *
+ * Notes on enable_signaling:
+ * For fence implementations that have the capability for hw->hw
+ * signaling, they can implement this op to enable the necessary
+ * irqs, or insert commands into cmdstream, etc. This is called
+ * in the first wait() or add_callback() path to let the fence
+ * implementation know that there is another driver waiting on
+ * the signal (ie. hw->sw case).
+ *
+ * This function can be called called from atomic context, but not
+ * from irq context, so normal spinlocks can be used.
+ *
+ * A return value of false indicates the fence already passed,
+ * or some failure occured that made it impossible to enable
+ * signaling. True indicates succesful enabling.
+ *
+ * Calling fence_signal before enable_signaling is called allows
+ * for a tiny race window in which enable_signaling is called during,
+ * before, or after fence_signal. To fight this, it is recommended
+ * that before enable_signaling returns true an extra reference is
+ * taken on the fence, to be released when the fence is signaled.
+ * This will mean fence_signal will still be called twice, but
+ * the second time will be a noop since it was already signaled.
+ *
+ * Notes on wait:
+ * Must not be NULL, set to fence_default_wait for default implementation.
+ * the fence_default_wait implementation should work for any fence, as long
+ * as enable_signaling works correctly.
+ *
+ * Must return -ERESTARTSYS if the wait is intr = true and the wait was
+ * interrupted, and remaining jiffies if fence has signaled, or 0 if wait
+ * timed out. Can also return other error values on custom implementations,
+ * which should be treated as if the fence is signaled. For example a hardware
+ * lockup could be reported like that.
+ *
+ * Notes on release:
+ * Can be NULL, this function allows additional commands to run on
+ * destruction of the fence. Can be called from irq context.
+ * If pointer is set to NULL, kfree will get called instead.
+ */
+
+struct fence_ops {
+ bool (*enable_signaling)(struct fence *fence);
+ bool (*signaled)(struct fence *fence);
+ long (*wait)(struct fence *fence, bool intr, signed long timeout);
+ void (*release)(struct fence *fence);
+};
+
+/**
+ * __fence_init - Initialize a custom fence.
+ * @fence: [in] the fence to initialize
+ * @ops: [in] the fence_ops for operations on this fence
+ * @lock: [in] the irqsafe spinlock to use for locking this fence
+ * @context: [in] the execution context this fence is run on
+ * @seqno: [in] a linear increasing sequence number for this context
+ *
+ * Initializes an allocated fence, the caller doesn't have to keep its
+ * refcount after committing with this fence, but it will need to hold a
+ * refcount again if fence_ops.enable_signaling gets called. This can
+ * be used for other implementing other types of fence.
+ *
+ * context and seqno are used for easy comparison between fences, allowing
+ * to check which fence is later by simply using fence_later.
+ */
+static inline void
+__fence_init(struct fence *fence, const struct fence_ops *ops,
+ spinlock_t *lock, unsigned context, unsigned seqno)
+{
+ BUG_ON(!ops || !lock || !ops->enable_signaling || !ops->wait);
+
+ kref_init(&fence->refcount);
+ fence->ops = ops;
+ INIT_LIST_HEAD(&fence->cb_list);
+ fence->lock = lock;
+ fence->context = context;
+ fence->seqno = seqno;
+ fence->flags = 0UL;
+}
+
+/**
+ * fence_get - increases refcount of the fence
+ * @fence: [in] fence to increase refcount of
+ */
+static inline void fence_get(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_get(&fence->refcount);
+}
+
+extern void release_fence(struct kref *kref);
+
+/**
+ * fence_put - decreases refcount of the fence
+ * @fence: [in] fence to reduce refcount of
+ */
+static inline void fence_put(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_put(&fence->refcount, release_fence);
+}
+
+int fence_signal(struct fence *fence);
+int __fence_signal(struct fence *fence);
+long fence_default_wait(struct fence *fence, bool intr, signed long timeout);
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func, void *priv);
+bool fence_remove_callback(struct fence *fence, struct fence_cb *cb);
+void fence_enable_sw_signaling(struct fence *fence);
+
+/**
+ * fence_is_signaled - Return an indication if the fence is signaled yet.
+ * @fence: [in] the fence to check
+ *
+ * Returns true if the fence was already signaled, false if not. Since this
+ * function doesn't enable signaling, it is not guaranteed to ever return true
+ * If fence_add_callback, fence_wait or fence_enable_sw_signaling
+ * haven't been called before.
+ *
+ * It's recommended for seqno fences to call fence_signal when the
+ * operation is complete, it makes it possible to prevent issues from
+ * wraparound between time of issue and time of use by checking the return
+ * value of this function before calling hardware-specific wait instructions.
+ */
+static inline bool
+fence_is_signaled(struct fence *fence)
+{
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return true;
+
+ if (fence->ops->signaled && fence->ops->signaled(fence)) {
+ fence_signal(fence);
+ return true;
+ }
+
+ return false;
+}
+
+/**
+ * fence_later - return the chronologically later fence
+ * @f1: [in] the first fence from the same context
+ * @f2: [in] the second fence from the same context
+ *
+ * Returns NULL if both fences are signaled, otherwise the fence that would be
+ * signaled last. Both fences must be from the same context, since a seqno is
+ * not re-used across contexts.
+ */
+static inline struct fence *fence_later(struct fence *f1, struct fence *f2)
+{
+ bool sig1, sig2;
+
+ /*
+ * can't check just FENCE_FLAG_SIGNALED_BIT here, it may never have been
+ * set called if enable_signaling wasn't, and enabling that here is
+ * overkill.
+ */
+ sig1 = fence_is_signaled(f1);
+ sig2 = fence_is_signaled(f2);
+
+ if (sig1 && sig2)
+ return NULL;
+
+ BUG_ON(f1->context != f2->context);
+
+ if (sig1 || f2->seqno - f1->seqno <= INT_MAX)
+ return f2;
+ else
+ return f1;
+}
+
+/**
+ * fence_wait_timeout - sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long
+fence_wait_timeout(struct fence *fence, bool intr, signed long timeout)
+{
+ if (WARN_ON(timeout < 0))
+ return -EINVAL;
+
+ return fence->ops->wait(fence, intr, timeout);
+}
+
+/**
+ * fence_wait - sleep until the fence gets signaled
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ *
+ * This function will return -ERESTARTSYS if interrupted by a signal,
+ * or 0 if the fence was signaled. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long fence_wait(struct fence *fence, bool intr)
+{
+ long ret;
+
+ /* Since fence_wait_timeout cannot timeout with
+ * MAX_SCHEDULE_TIMEOUT, only valid return values are
+ * -ERESTARTSYS and MAX_SCHEDULE_TIMEOUT.
+ */
+ ret = fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT);
+
+ return ret < 0 ? ret : 0;
+}
+
+/**
+ * fence context counter: each execution context should have its own
+ * fence context, this allows checking if fences belong to the same
+ * context or not. One device can have multiple separate contexts,
+ * and they're used if some engine can run independently of another.
+ */
+extern atomic_t fence_context_counter;
+
+static inline unsigned fence_context_alloc(unsigned num)
+{
+ BUG_ON(!num);
+ return atomic_add_return(num, &fence_context_counter) - num;
+}
+
+#define FENCE_TRACE(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ if (config_enabled(CONFIG_FENCE_TRACE)) \
+ pr_info("f %u#%u: " fmt, \
+ __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_WARN(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_warn("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_ERR(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_err("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#endif /* __LINUX_FENCE_H */
A fence can be attached to a buffer which is being filled or consumed
by hw, to allow userspace to pass the buffer without waiting to another
device. For example, userspace can call page_flip ioctl to display the
next frame of graphics after kicking the GPU but while the GPU is still
rendering. The display device sharing the buffer with the GPU would
attach a callback to get notified when the GPU's rendering-complete IRQ
fires, to update the scan-out address of the display, without having to
wake up userspace.
A driver must allocate a fence context for each execution ring that can
run in parallel. The function for this takes an argument with how many
contexts to allocate:
+ fence_context_alloc()
A fence is transient, one-shot deal. It is allocated and attached
to one or more dma-buf's. When the one that attached it is done, with
the pending operation, it can signal the fence:
+ fence_signal()
To have a rough approximation whether a fence is fired, call:
+ fence_is_signaled()
The dma-buf-mgr handles tracking, and waiting on, the fences associated
with a dma-buf.
The one pending on the fence can add an async callback:
+ fence_add_callback()
The callback can optionally be cancelled with:
+ fence_remove_callback()
To wait synchronously, optionally with a timeout:
+ fence_wait()
+ fence_wait_timeout()
A default software-only implementation is provided, which can be used
by drivers attaching a fence to a buffer when they have no other means
for hw sync. But a memory backed fence is also envisioned, because it
is common that GPU's can write to, or poll on some memory location for
synchronization. For example:
fence = custom_get_fence(...);
if ((seqno_fence = to_seqno_fence(fence)) != NULL) {
dma_buf *fence_buf = fence->sync_buf;
get_dma_buf(fence_buf);
... tell the hw the memory location to wait ...
custom_wait_on(fence_buf, fence->seqno_ofs, fence->seqno);
} else {
/* fall-back to sw sync * /
fence_add_callback(fence, my_cb);
}
On SoC platforms, if some other hw mechanism is provided for synchronizing
between IP blocks, it could be supported as an alternate implementation
with it's own fence ops in a similar way.
enable_signaling callback is used to provide sw signaling in case a cpu
waiter is requested or no compatible hardware signaling could be used.
The intention is to provide a userspace interface (presumably via eventfd)
later, to be used in conjunction with dma-buf's mmap support for sw access
to buffers (or for userspace apps that would prefer to do their own
synchronization).
v1: Original
v2: After discussion w/ danvet and mlankhorst on #dri-devel, we decided
that dma-fence didn't need to care about the sw->hw signaling path
(it can be handled same as sw->sw case), and therefore the fence->ops
can be simplified and more handled in the core. So remove the signal,
add_callback, cancel_callback, and wait ops, and replace with a simple
enable_signaling() op which can be used to inform a fence supporting
hw->hw signaling that one or more devices which do not support hw
signaling are waiting (and therefore it should enable an irq or do
whatever is necessary in order that the CPU is notified when the
fence is passed).
v3: Fix locking fail in attach_fence() and get_fence()
v4: Remove tie-in w/ dma-buf.. after discussion w/ danvet and mlankorst
we decided that we need to be able to attach one fence to N dma-buf's,
so using the list_head in dma-fence struct would be problematic.
v5: [ Maarten Lankhorst ] Updated for dma-bikeshed-fence and dma-buf-manager.
v6: [ Maarten Lankhorst ] I removed dma_fence_cancel_callback and some comments
about checking if fence fired or not. This is broken by design.
waitqueue_active during destruction is now fatal, since the signaller
should be holding a reference in enable_signalling until it signalled
the fence. Pass the original dma_fence_cb along, and call __remove_wait
in the dma_fence_callback handler, so that no cleanup needs to be
performed.
v7: [ Maarten Lankhorst ] Set cb->func and only enable sw signaling if
fence wasn't signaled yet, for example for hardware fences that may
choose to signal blindly.
v8: [ Maarten Lankhorst ] Tons of tiny fixes, moved __dma_fence_init to
header and fixed include mess. dma-fence.h now includes dma-buf.h
All members are now initialized, so kmalloc can be used for
allocating a dma-fence. More documentation added.
v9: Change compiler bitfields to flags, change return type of
enable_signaling to bool. Rework dma_fence_wait. Added
dma_fence_is_signaled and dma_fence_wait_timeout.
s/dma// and change exports to non GPL. Added fence_is_signaled and
fence_enable_sw_signaling calls, add ability to override default
wait operation.
v10: remove event_queue, use a custom list, export try_to_wake_up from
scheduler. Remove fence lock and use a global spinlock instead,
this should hopefully remove all the locking headaches I was having
on trying to implement this. enable_signaling is called with this
lock held.
v11:
Use atomic ops for flags, lifting the need for some spin_lock_irqsaves.
However I kept the guarantee that after fence_signal returns, it is
guaranteed that enable_signaling has either been called to completion,
or will not be called any more.
Add contexts and seqno to base fence implementation. This allows you
to wait for less fences, by testing for seqno + signaled, and then only
wait on the later fence.
Add FENCE_TRACE, FENCE_WARN, and FENCE_ERR. This makes debugging easier.
An CONFIG_DEBUG_FENCE will be added to turn off the FENCE_TRACE
spam, and another runtime option can turn it off at runtime.
v12:
Add CONFIG_FENCE_TRACE. Add missing documentation for the fence->context
and fence->seqno members.
v13:
Fixup CONFIG_FENCE_TRACE kconfig description.
Move fence_context_alloc to fence.
Simplify fence_later.
Kill priv member to fence_cb.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
---
Documentation/DocBook/device-drivers.tmpl | 2
drivers/base/Kconfig | 10 +
drivers/base/Makefile | 2
drivers/base/fence.c | 312 ++++++++++++++++++++++++++
include/linux/fence.h | 344 +++++++++++++++++++++++++++++
5 files changed, 669 insertions(+), 1 deletion(-)
create mode 100644 drivers/base/fence.c
create mode 100644 include/linux/fence.h
diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl
index fe397f9..95d0db9 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -126,6 +126,8 @@ X!Edrivers/base/interface.c
</sect1>
<sect1><title>Device Drivers DMA Management</title>
!Edrivers/base/dma-buf.c
+!Edrivers/base/fence.c
+!Iinclude/linux/fence.h
!Edrivers/base/reservation.c
!Iinclude/linux/reservation.h
!Edrivers/base/dma-coherent.c
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 5daa259..2bf0add 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -200,6 +200,16 @@ config DMA_SHARED_BUFFER
APIs extension; the file's descriptor can then be passed on to other
driver.
+config FENCE_TRACE
+ bool "Enable verbose FENCE_TRACE messages"
+ default n
+ depends on DMA_SHARED_BUFFER
+ help
+ Enable the FENCE_TRACE printks. This will add extra
+ spam to the console log, but will make it easier to diagnose
+ lockup related problems for dma-buffers shared across multiple
+ devices.
+
config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 48029aa..8a55cb9 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_CMA) += dma-contiguous.o
obj-y += power/
obj-$(CONFIG_HAS_DMA) += dma-mapping.o
obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
-obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o reservation.o
+obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o fence.o reservation.o
obj-$(CONFIG_ISA) += isa.o
obj-$(CONFIG_FW_LOADER) += firmware_class.o
obj-$(CONFIG_NUMA) += node.o
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
new file mode 100644
index 0000000..a0b1357
--- /dev/null
+++ b/drivers/base/fence.c
@@ -0,0 +1,312 @@
+/*
+ * Fence mechanism for dma-buf and to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <robdclark(a)gmail.com>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/fence.h>
+
+/**
+ * fence context counter: each execution context should have its own
+ * fence context, this allows checking if fences belong to the same
+ * context or not. One device can have multiple separate contexts,
+ * and they're used if some engine can run independently of another.
+ */
+static atomic_t fence_context_counter = ATOMIC_INIT(0);
+
+/**
+ * fence_context_alloc - allocate an array of fence contexts
+ * @num: [in] amount of contexts to allocate
+ *
+ * This function will return the first index of the number of fences allocated.
+ * The fence context is used for setting fence->context to a unique number.
+ */
+unsigned fence_context_alloc(unsigned num)
+{
+ BUG_ON(!num);
+ return atomic_add_return(num, &fence_context_counter) - num;
+}
+EXPORT_SYMBOL(fence_context_alloc);
+
+int __fence_signal(struct fence *fence)
+{
+ struct fence_cb *cur, *tmp;
+ int ret = 0;
+
+ if (WARN_ON(!fence))
+ return -EINVAL;
+
+ if (test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ ret = -EINVAL;
+
+ /*
+ * we might have raced with the unlocked fence_signal,
+ * still run through all callbacks
+ */
+ }
+
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__fence_signal);
+
+/**
+ * fence_signal - signal completion of a fence
+ * @fence: the fence to signal
+ *
+ * Signal completion for software callbacks on a fence, this will unblock
+ * fence_wait() calls and run all the callbacks added with
+ * fence_add_callback(). Can be called multiple times, but since a fence
+ * can only go from unsignaled to signaled state, it will only be effective
+ * the first time.
+ */
+int fence_signal(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!fence || test_and_set_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags)) {
+ struct fence_cb *cur, *tmp;
+
+ spin_lock_irqsave(fence->lock, flags);
+ list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
+ list_del_init(&cur->node);
+ cur->func(fence, cur);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+ return 0;
+}
+EXPORT_SYMBOL(fence_signal);
+
+void release_fence(struct kref *kref)
+{
+ struct fence *fence =
+ container_of(kref, struct fence, refcount);
+
+ BUG_ON(!list_empty(&fence->cb_list));
+
+ if (fence->ops->release)
+ fence->ops->release(fence);
+ else
+ kfree(fence);
+}
+EXPORT_SYMBOL(release_fence);
+
+/**
+ * fence_enable_sw_signaling - enable signaling on fence
+ * @fence: [in] the fence to enable
+ *
+ * this will request for sw signaling to be enabled, to make the fence
+ * complete as soon as possible
+ */
+void fence_enable_sw_signaling(struct fence *fence)
+{
+ unsigned long flags;
+
+ if (!test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags) &&
+ !test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (!fence->ops->enable_signaling(fence))
+ __fence_signal(fence);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+ }
+}
+EXPORT_SYMBOL(fence_enable_sw_signaling);
+
+/**
+ * fence_add_callback - add a callback to be called when the fence
+ * is signaled
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to register
+ * @func: [in] the function to call
+ * @priv: [in] the argument to pass to function
+ *
+ * cb will be initialized by fence_add_callback, no initialization
+ * by the caller is required. Any number of callbacks can be registered
+ * to a fence, but a callback can only be registered to one fence at a time.
+ *
+ * Note that the callback can be called from an atomic context. If
+ * fence is already signaled, this function will return -ENOENT (and
+ * *not* call the callback)
+ *
+ * Add a software callback to the fence. Same restrictions apply to
+ * refcount as it does to fence_wait, however the caller doesn't need to
+ * keep a refcount to fence afterwards: when software access is enabled,
+ * the creator of the fence is required to keep the fence alive until
+ * after it signals with fence_signal. The callback itself can be called
+ * from irq context.
+ *
+ */
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func, void *priv)
+{
+ unsigned long flags;
+ int ret = 0;
+ bool was_set;
+
+ if (WARN_ON(!fence || !func))
+ return -EINVAL;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return -ENOENT;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ ret = -ENOENT;
+ else if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ ret = -ENOENT;
+ }
+
+ if (!ret) {
+ cb->func = func;
+ list_add_tail(&cb->node, &fence->cb_list);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_add_callback);
+
+/**
+ * fence_remove_callback - remove a callback from the signaling list
+ * @fence: [in] the fence to wait on
+ * @cb: [in] the callback to remove
+ *
+ * Remove a previously queued callback from the fence. This function returns
+ * true is the callback is succesfully removed, or false if the fence has
+ * already been signaled.
+ *
+ * *WARNING*:
+ * Cancelling a callback should only be done if you really know what you're
+ * doing, since deadlocks and race conditions could occur all too easily. For
+ * this reason, it should only ever be done on hardware lockup recovery,
+ * with a reference held to the fence.
+ */
+bool
+fence_remove_callback(struct fence *fence, struct fence_cb *cb)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ ret = !list_empty(&cb->node);
+ if (ret)
+ list_del_init(&cb->node);
+
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(fence_remove_callback);
+
+struct default_wait_cb {
+ struct fence_cb base;
+ struct task_struct *task;
+};
+
+static void
+fence_default_wait_cb(struct fence *fence, struct fence_cb *cb)
+{
+ struct default_wait_cb *wait =
+ container_of(cb, struct default_wait_cb, base);
+
+ try_to_wake_up(wait->task, TASK_NORMAL, 0);
+}
+
+/**
+ * fence_default_wait - default sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success.
+ */
+long
+fence_default_wait(struct fence *fence, bool intr, signed long timeout)
+{
+ struct default_wait_cb cb;
+ unsigned long flags;
+ long ret = timeout;
+ bool was_set;
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return timeout;
+
+ spin_lock_irqsave(fence->lock, flags);
+
+ if (intr && signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ goto out;
+ }
+
+ was_set = test_and_set_bit(FENCE_FLAG_ENABLE_SIGNAL_BIT, &fence->flags);
+
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ goto out;
+
+ if (!was_set && !fence->ops->enable_signaling(fence)) {
+ __fence_signal(fence);
+ goto out;
+ }
+
+ cb.base.func = fence_default_wait_cb;
+ cb.task = current;
+ list_add(&cb.base.node, &fence->cb_list);
+
+ while (!test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags) && ret > 0) {
+ if (intr)
+ __set_current_state(TASK_INTERRUPTIBLE);
+ else
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ ret = schedule_timeout(ret);
+
+ spin_lock_irqsave(fence->lock, flags);
+ if (ret > 0 && intr && signal_pending(current))
+ ret = -ERESTARTSYS;
+ }
+
+ if (!list_empty(&cb.base.node))
+ list_del(&cb.base.node);
+ __set_current_state(TASK_RUNNING);
+
+out:
+ spin_unlock_irqrestore(fence->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(fence_default_wait);
diff --git a/include/linux/fence.h b/include/linux/fence.h
new file mode 100644
index 0000000..0319c59
--- /dev/null
+++ b/include/linux/fence.h
@@ -0,0 +1,344 @@
+/*
+ * Fence mechanism for dma-buf to allow for asynchronous dma access
+ *
+ * Copyright (C) 2012 Canonical Ltd
+ * Copyright (C) 2012 Texas Instruments
+ *
+ * Authors:
+ * Rob Clark <robdclark(a)gmail.com>
+ * Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_FENCE_H
+#define __LINUX_FENCE_H
+
+#include <linux/err.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/bitops.h>
+#include <linux/kref.h>
+#include <linux/sched.h>
+#include <linux/printk.h>
+
+struct fence;
+struct fence_ops;
+struct fence_cb;
+
+/**
+ * struct fence - software synchronization primitive
+ * @refcount: refcount for this fence
+ * @ops: fence_ops associated with this fence
+ * @cb_list: list of all callbacks to call
+ * @lock: spin_lock_irqsave used for locking
+ * @context: execution context this fence belongs to, returned by
+ * fence_context_alloc()
+ * @seqno: the sequence number of this fence inside the execution context,
+ * can be compared to decide which fence would be signaled later.
+ * @flags: A mask of FENCE_FLAG_* defined below
+ *
+ * the flags member must be manipulated and read using the appropriate
+ * atomic ops (bit_*), so taking the spinlock will not be needed most
+ * of the time.
+ *
+ * FENCE_FLAG_SIGNALED_BIT - fence is already signaled
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT - enable_signaling might have been called*
+ * FENCE_FLAG_USER_BITS - start of the unused bits, can be used by the
+ * implementer of the fence for its own purposes. Can be used in different
+ * ways by different fence implementers, so do not rely on this.
+ *
+ * *) Since atomic bitops are used, this is not guaranteed to be the case.
+ * Particularly, if the bit was set, but fence_signal was called right
+ * before this bit was set, it would have been able to set the
+ * FENCE_FLAG_SIGNALED_BIT, before enable_signaling was called.
+ * Adding a check for FENCE_FLAG_SIGNALED_BIT after setting
+ * FENCE_FLAG_ENABLE_SIGNAL_BIT closes this race, and makes sure that
+ * after fence_signal was called, any enable_signaling call will have either
+ * been completed, or never called at all.
+ */
+struct fence {
+ struct kref refcount;
+ const struct fence_ops *ops;
+ struct list_head cb_list;
+ spinlock_t *lock;
+ unsigned context, seqno;
+ unsigned long flags;
+};
+
+enum fence_flag_bits {
+ FENCE_FLAG_SIGNALED_BIT,
+ FENCE_FLAG_ENABLE_SIGNAL_BIT,
+ FENCE_FLAG_USER_BITS, /* must always be last member */
+};
+
+typedef void (*fence_func_t)(struct fence *fence, struct fence_cb *cb);
+
+/**
+ * struct fence_cb - callback for fence_add_callback
+ * @node: used by fence_add_callback to append this struct to fence::cb_list
+ * @func: fence_func_t to call
+ * @priv: value of priv to pass to function
+ *
+ * This struct will be initialized by fence_add_callback, additional
+ * data can be passed along by embedding fence_cb in another struct.
+ */
+struct fence_cb {
+ struct list_head node;
+ fence_func_t func;
+};
+
+/**
+ * struct fence_ops - operations implemented for fence
+ * @enable_signaling: enable software signaling of fence
+ * @signaled: [optional] peek whether the fence is signaled, can be null
+ * @wait: custom wait implementation
+ * @release: [optional] called on destruction of fence, can be null
+ *
+ * Notes on enable_signaling:
+ * For fence implementations that have the capability for hw->hw
+ * signaling, they can implement this op to enable the necessary
+ * irqs, or insert commands into cmdstream, etc. This is called
+ * in the first wait() or add_callback() path to let the fence
+ * implementation know that there is another driver waiting on
+ * the signal (ie. hw->sw case).
+ *
+ * This function can be called called from atomic context, but not
+ * from irq context, so normal spinlocks can be used.
+ *
+ * A return value of false indicates the fence already passed,
+ * or some failure occured that made it impossible to enable
+ * signaling. True indicates succesful enabling.
+ *
+ * Calling fence_signal before enable_signaling is called allows
+ * for a tiny race window in which enable_signaling is called during,
+ * before, or after fence_signal. To fight this, it is recommended
+ * that before enable_signaling returns true an extra reference is
+ * taken on the fence, to be released when the fence is signaled.
+ * This will mean fence_signal will still be called twice, but
+ * the second time will be a noop since it was already signaled.
+ *
+ * Notes on wait:
+ * Must not be NULL, set to fence_default_wait for default implementation.
+ * the fence_default_wait implementation should work for any fence, as long
+ * as enable_signaling works correctly.
+ *
+ * Must return -ERESTARTSYS if the wait is intr = true and the wait was
+ * interrupted, and remaining jiffies if fence has signaled, or 0 if wait
+ * timed out. Can also return other error values on custom implementations,
+ * which should be treated as if the fence is signaled. For example a hardware
+ * lockup could be reported like that.
+ *
+ * Notes on release:
+ * Can be NULL, this function allows additional commands to run on
+ * destruction of the fence. Can be called from irq context.
+ * If pointer is set to NULL, kfree will get called instead.
+ */
+
+struct fence_ops {
+ bool (*enable_signaling)(struct fence *fence);
+ bool (*signaled)(struct fence *fence);
+ long (*wait)(struct fence *fence, bool intr, signed long timeout);
+ void (*release)(struct fence *fence);
+};
+
+/**
+ * __fence_init - Initialize a custom fence.
+ * @fence: [in] the fence to initialize
+ * @ops: [in] the fence_ops for operations on this fence
+ * @lock: [in] the irqsafe spinlock to use for locking this fence
+ * @context: [in] the execution context this fence is run on
+ * @seqno: [in] a linear increasing sequence number for this context
+ *
+ * Initializes an allocated fence, the caller doesn't have to keep its
+ * refcount after committing with this fence, but it will need to hold a
+ * refcount again if fence_ops.enable_signaling gets called. This can
+ * be used for other implementing other types of fence.
+ *
+ * context and seqno are used for easy comparison between fences, allowing
+ * to check which fence is later by simply using fence_later.
+ */
+static inline void
+__fence_init(struct fence *fence, const struct fence_ops *ops,
+ spinlock_t *lock, unsigned context, unsigned seqno)
+{
+ BUG_ON(!ops || !lock || !ops->enable_signaling || !ops->wait);
+
+ kref_init(&fence->refcount);
+ fence->ops = ops;
+ INIT_LIST_HEAD(&fence->cb_list);
+ fence->lock = lock;
+ fence->context = context;
+ fence->seqno = seqno;
+ fence->flags = 0UL;
+}
+
+/**
+ * fence_get - increases refcount of the fence
+ * @fence: [in] fence to increase refcount of
+ */
+static inline void fence_get(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_get(&fence->refcount);
+}
+
+extern void release_fence(struct kref *kref);
+
+/**
+ * fence_put - decreases refcount of the fence
+ * @fence: [in] fence to reduce refcount of
+ */
+static inline void fence_put(struct fence *fence)
+{
+ if (WARN_ON(!fence))
+ return;
+ kref_put(&fence->refcount, release_fence);
+}
+
+int fence_signal(struct fence *fence);
+int __fence_signal(struct fence *fence);
+long fence_default_wait(struct fence *fence, bool intr, signed long timeout);
+int fence_add_callback(struct fence *fence, struct fence_cb *cb,
+ fence_func_t func, void *priv);
+bool fence_remove_callback(struct fence *fence, struct fence_cb *cb);
+void fence_enable_sw_signaling(struct fence *fence);
+
+/**
+ * fence_is_signaled - Return an indication if the fence is signaled yet.
+ * @fence: [in] the fence to check
+ *
+ * Returns true if the fence was already signaled, false if not. Since this
+ * function doesn't enable signaling, it is not guaranteed to ever return true
+ * If fence_add_callback, fence_wait or fence_enable_sw_signaling
+ * haven't been called before.
+ *
+ * It's recommended for seqno fences to call fence_signal when the
+ * operation is complete, it makes it possible to prevent issues from
+ * wraparound between time of issue and time of use by checking the return
+ * value of this function before calling hardware-specific wait instructions.
+ */
+static inline bool
+fence_is_signaled(struct fence *fence)
+{
+ if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->flags))
+ return true;
+
+ if (fence->ops->signaled && fence->ops->signaled(fence)) {
+ fence_signal(fence);
+ return true;
+ }
+
+ return false;
+}
+
+/**
+ * fence_later - return the chronologically later fence
+ * @f1: [in] the first fence from the same context
+ * @f2: [in] the second fence from the same context
+ *
+ * Returns NULL if both fences are signaled, otherwise the fence that would be
+ * signaled last. Both fences must be from the same context, since a seqno is
+ * not re-used across contexts.
+ */
+static inline struct fence *fence_later(struct fence *f1, struct fence *f2)
+{
+ BUG_ON(f1->context != f2->context);
+
+ /*
+ * can't check just FENCE_FLAG_SIGNALED_BIT here, it may never have been
+ * set called if enable_signaling wasn't, and enabling that here is
+ * overkill.
+ */
+ if (f2->seqno - f1->seqno <= INT_MAX)
+ return fence_is_signaled(f2) ? NULL : f2;
+ else
+ return fence_is_signaled(f1) ? NULL : f1;
+}
+
+/**
+ * fence_wait_timeout - sleep until the fence gets signaled
+ * or until timeout elapses
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ * @timeout: [in] timeout value in jiffies, or MAX_SCHEDULE_TIMEOUT
+ *
+ * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or the
+ * remaining timeout in jiffies on success. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long
+fence_wait_timeout(struct fence *fence, bool intr, signed long timeout)
+{
+ if (WARN_ON(timeout < 0))
+ return -EINVAL;
+
+ return fence->ops->wait(fence, intr, timeout);
+}
+
+/**
+ * fence_wait - sleep until the fence gets signaled
+ * @fence: [in] the fence to wait on
+ * @intr: [in] if true, do an interruptible wait
+ *
+ * This function will return -ERESTARTSYS if interrupted by a signal,
+ * or 0 if the fence was signaled. Other error values may be
+ * returned on custom implementations.
+ *
+ * Performs a synchronous wait on this fence. It is assumed the caller
+ * directly or indirectly (buf-mgr between reservation and committing)
+ * holds a reference to the fence, otherwise the fence might be
+ * freed before return, resulting in undefined behavior.
+ */
+static inline long fence_wait(struct fence *fence, bool intr)
+{
+ long ret;
+
+ /* Since fence_wait_timeout cannot timeout with
+ * MAX_SCHEDULE_TIMEOUT, only valid return values are
+ * -ERESTARTSYS and MAX_SCHEDULE_TIMEOUT.
+ */
+ ret = fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT);
+
+ return ret < 0 ? ret : 0;
+}
+
+unsigned fence_context_alloc(unsigned num);
+
+#define FENCE_TRACE(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ if (config_enabled(CONFIG_FENCE_TRACE)) \
+ pr_info("f %u#%u: " fmt, \
+ __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_WARN(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_warn("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#define FENCE_ERR(f, fmt, args...) \
+ do { \
+ struct fence *__ff = (f); \
+ pr_err("f %u#%u: " fmt, __ff->context, __ff->seqno, ##args); \
+ } while (0)
+
+#endif /* __LINUX_FENCE_H */
Each dma-buf has an associated size and it's reasonable for userspace
to want to know what it is.
Since userspace already has an fd, expose the size using the
size = lseek(fd, SEEK_END, 0); lseek(fd, SEEK_CUR, 0);
idiom.
Signed-off-by: Christopher James Halse Rogers <christopher.halse.rogers(a)canonical.com>
---
I've run into a point in the radeon DRM userspace where I need the
size of a dma-buf. I could add a radeon-specific mechanism to get that,
but this seems like something that would be more generally useful.
I'm not entirely sure about supporting both SEEK_END and SEEK_CUR; this
is somewhat of an abuse of lseek, as seeking obviously doesn't make sense.
It's the obivous idiom for getting the size of what's on the other end of a
file descriptor, though.
I didn't notice anywhere to document this; Documentation/dma-buf-api didn't
seem like the right place. Is there somewhere I've overlooked?
drivers/base/dma-buf.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 6687ba7..c33a857 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -77,9 +77,36 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
return dmabuf->ops->mmap(dmabuf, vma);
}
+static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
+{
+ struct dma_buf *dmabuf;
+ loff_t base;
+
+ if (!is_dma_buf_file(file))
+ return -EBADF;
+
+ dmabuf = file->private_data;
+
+ /* only support discovering the end of the buffer,
+ but also allow SEEK_SET to maintain the idiomatic
+ SEEK_END(0), SEEK_CUR(0) pattern */
+ if (whence == SEEK_END)
+ base = dmabuf->size;
+ else if (whence == SEEK_SET)
+ base = 0;
+ else
+ return -EINVAL;
+
+ if (offset != 0)
+ return -EINVAL;
+
+ return base + offset;
+}
+
static const struct file_operations dma_buf_fops = {
.release = dma_buf_release,
.mmap = dma_buf_mmap_internal,
+ .llseek = dma_buf_llseek,
};
/*
@@ -133,6 +160,7 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops,
dmabuf->exp_name = exp_name;
file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf, flags);
+ file->f_mode |= FMODE_LSEEK;
dmabuf->file = file;
--
1.8.3.2
On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> > Turning to DRM/KMS, it seems the supported formats of a plane can be
>> > queried using drm_mode_get_plane. However, there doesn't seem to be a
>> > way to query the supported formats of a crtc? If display HW only
>> > supports scanning out from a single buffer (like pl111 does), I think
>> > it won't have any planes and a fb can only be set on the crtc. In
>> > which case, how should user-space query which pixel formats that crtc
>> > supports?
>>
>> it is exposed for drm plane's. What is missing is to expose the
>> primary-plane associated with the crtc.
>
> Cool - so a patch which adds a way to query the what formats a crtc
> supports would be welcome?
well, I kinda think we want something that exposes the "primary plane"
of the crtc.. I'm thinking something roughly like:
---------
diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
index 53db7ce..c7ffca8 100644
--- a/include/uapi/drm/drm_mode.h
+++ b/include/uapi/drm/drm_mode.h
@@ -157,6 +157,12 @@ struct drm_mode_get_plane {
struct drm_mode_get_plane_res {
__u64 plane_id_ptr;
__u32 count_planes;
+ /* The primary planes are in matching order to crtc_id_ptr in
+ * drm_mode_card_res (and same length). For crtc_id[n], it's
+ * primary plane is given by primary_plane_id[n].
+ */
+ __u32 count_primary_planes;
+ __u64 primary_plane_id_ptr;
};
#define DRM_MODE_ENCODER_NONE 0
---------
then use the existing GETPLANE ioctl to query the capabilities
> What about a way to query the stride alignment constraints?
>
> Presumably using the drm_mode_get_property mechanism would be the
> right way to implement that?
>
I suppose you could try that.. typically that is in userspace,
however. It seems like get_property would get messy quickly (ie. is
it a pitch alignment constraint, or stride alignment? What if this is
different for different formats (in particular tiled)? etc)
>
>> > As with v4l2, DRM doesn't appear to have a way to query the stride
>> > constraints? Assuming there is a way to query the stride constraints,
>> > there also isn't a way to specify them when creating a buffer with
>> > DRM, though perhaps the existing pitch parameter of
>> > drm_mode_create_dumb could be used to allow user-space to pass in a
>> > minimum stride as well as receive the allocated stride?
>> >
>>
>> well, you really shouldn't be using create_dumb.. you should have a
>> userspace piece that is specific to the drm driver, and knows how to
>> use that driver's gem allocate ioctl.
>
> Sorry, why does this need a driver-specific allocation function? It's
> just a display controller driver and I just want to allocate a scan-
> out buffer - all I'm asking is for the display controller driver to
> use a minimum stride alignment so I can export the buffer and use
> another device to fill it with data.
Sure.. but userspace has more information readily available to make a
better choice. For example, for omapdrm I'd do things differently
depending on whether I wanted to scan out that buffer (or a portion of
it) rotated. This is something I know in the DDX driver, but not in
the kernel. And it is quite likely something that is driver specific.
Sure, we could add that to a generic "allocate me a buffer" ioctl.
But that doesn't really seem better, and it becomes a problem as soon
as we come across some hw that needs to know something different. In
userspace, you have a lot more flexibility, since you don't really
need to commit to an API for life.
And to bring back the GStreamer argument (since that seems a fitting
example when you start talking about sharing buffers between many
devices, for example camera+codec+display), it would already be
negotiating format between v4l2src + fooencoder + displaysink.. the
pitch/stride is part of that format information. If the display isn't
the one with the strictest requirements, we don't want the display
driver deciding what pitch to use.
> The whole point is to be able to allocate the buffer in such a way
> that another device can access it. So the driver _can't_ use a
> special, device specific format, nor can it allocate it from a
> private memory pool because doing so would preclude it from being
> shared with another device.
>
> That other device doesn't need to be a GPU wither, it could just as
> easily be a camera/ISP or video decoder.
>
>
>
>> >> > So presumably you're talking about a GPU driver being the exporter
>> >> > here? If so, how could the GPU driver do these kind of tricks on
>> >> > memory shared with another device?
>> >>
>> >> Yes, that is gpu-as-exporter. If someone else is allocating
>> >> buffers, it is up to them to do these tricks or not. Probably
>> >> there is a pretty good chance that if you aren't a GPU you don't
>> >> need those sort of tricks for fast allocation of transient upload
>> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't
>> >> really think a v4l camera or video decoder would benefit from that
>> >> sort of optimization.
>> >
>> > Right - but none of those are really buffers you'd want to export
>>
>> > with dma_buf to share with another device are they? In which case,
>> > why not just have dma_buf figure out the constraints and allocate
>> > the memory?
>>
>> maybe not.. but (a) you don't necessarily know at creation time if it
>> is going to be exported (maybe you know if it is definitely not going
>> to be exported, but the converse is not true),
>
> I can't actually think of an example where you would not know if a
> buffer was going to be exported or not at allocation time? Do you have
> a case in mind?
yeah, dri2.. when the front buffer is allocated it is just a regular
pixmap. If you swap/flip it becomes the back buffer and now shared
;-)
And pixmaps are allocated w/ enough frequency that it is the sort of
thing you might want to optimize.
And even when you know it will be shared, you don't know with who.
> Regardless, you'd certainly have to know if a buffer will be exported
> pretty quickly, before it's used so that you can import it into
> whatever devices are going to access it. Otherwise if it gets
> allocated before you export it, the allocation won't satisfy the
> constraints of the other devices which will need to access it and
> importing will fail. Assuming of course deferred allocation of the
> backing pages as discussed earlier in the thread.
>
>
>
>> and (b) there isn't
>> really any reason to special case the allocation in the driver because
>> it is going to be exported.
>
> Not sure I follow you here? Surely you absolutely have to special-case
> the allocation if the buffer is to be exported because you have to
> take the other devices' constraints into account when you allocate? Or
> do you mean you don't need to special-case the GEM buffer object
> creation, only the allocation of the backing pages? Though I'm not
> sure how that distinction is useful - at the end of the day, you need
> to special-case allocation of the backing pages.
>
well, you need to consider separately what is (a) in the pages, and
(b) where the pages come from.
By moving the allocation into dmabuf you restrict (b). For sharing
buffers, (a) may be restricted, but there is at least some examples of
hardware where (b) would not otherwise be restricted by sharing.
>
>> helpers that can be used by simple drivers, yes. Forcing the way the
>> buffer is allocated, for sure not. Currently, for example, there is
>> no issue to export a buffer allocated from stolen-mem.
>
> Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk
> of memory reserved at boot for the GPU which the OS can't touch? I
> guess I view such memory as accessible to all media devices on the
> system and as such, needs to be managed by a central allocator which
> dma_buf can use to allocate from.
think carve-out created by bios. In all the cases I am aware of, the
drm driver handles allocation of buffer(s) from the carveout.
> I guess if that stolen-mem is managed by a single device then in
> essence that device becomes the central allocator you have to use to
> be able to allocate from that stolen mem?
>
>
>> > If a driver needs to allocate memory in a special way for a
>> > particular device, I can't really imagine how it would be able
>> > to share that buffer with another device using dma_buf? I guess
>> > a driver is likely to need some magic voodoo to configure access
>> > to the buffer for its device, but surely that would be done by
>> > the dma_mapping framework when dma_buf_map happens?
>> >
>>
>> if, what it has to configure actually manages to fit in the
>> dma-mapping framework
>
> But if it doesn't, surely that's an issue which needs to be addressed
> in the dma_mapping framework or else you won't be able to import
> buffers for use by that device anyway?
>
I'm not sure if we have to fit everything in dma-mapping framework, at
least in cases where you have something that is specific to one
platform.
Currently dma-buf provides enough flexibility for other drivers to be
able to import these buffers.
>
>> anyways, where the pages come from has nothing to do with whether a
>> buffer can be shared or not
>
> Sure, but where they are located in physical memory really does
> matter.
>
s/does/can/
it doesn't always matter. And in cases where it does matter, as long
as we can express the restrictions in dma_parms (which we can already
for the case of range-of-memory restrictions) we are covered
>
>> >> At any rate, for both xorg and wayland/gbm, you know when a buffer
>> >> is going to be a scanout buffer. What I'd recommend is define a
>> >> small userspace API that your customers (the SoC vendors) implement
>> >> to allocate a scanout buffer and hand you back a dmabuf fd. That
>> >> could be used both for x11 and for gbm. Inputs should be requested
>> >> width/height and format. And outputs pitch plus dmabuf fd.
>> >>
>> >> (Actually you might even just want to use gbm as your starting
>> >> point. You could probably just use gbm from xf86-video-armsoc for
>> >> allocation, to have one thing that works for both wayland and x11.
>> >> Scanout and cursor buffers should go to vendor/SoC specific fxn,
>> >> rest can be allocated from mali kernel driver.)
>> >
>> > What does that buy us over just using drm_mode_create_dumb on the
>> > display's DRM driver?
>>
>> well, for example, if there was actually some hw w/ omap's dss + mali,
>> you could actually have mali render transparently to tiled buffers
>> which could be scanned out rotated. Which would not be possible w/
>> dumb buffers.
>
> Why not? As you said earlier, the format is defined when you setup the
> fb with drm_mode_fb_cmd2. If you wanted to share the buffer between
> devices, you have to be explicit about what format that buffer is in,
> so you'd have to add an entry to drm_fourcc.h for the tiled format.
no, that doesn't really work
in this case, the format to any device (or userspace) accessing the
buffer is not tiled. (Ie. it would look like normal NV12 or
whatever). But there are some different requirements on stride. And
there are cases where you would prefer not to use tiled buffers, but
the kernel doesn't know enough in the dumb-buffer alloc ioctl to make
the correct decision.
> So userspace queries what formats the GPU DRM supports and what
> formats the OMAP DSS DRM supports, selects the tiled format and then
> uses drm_mode_create_dumb to allocate a buffer of the correct size and
> sets the appropriate drm_fourcc.h enum value when creating an fb for
> that buffer. Or have I missed something?
>
>
>
>> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> >> >> (where phys contig is not required for scanout), but causes CMA
>> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care.
>> >> >> It just knows that it wants to be able to scanout that particular
>> >> >> buffer.
>> >> >
>> >> > I think that's the idea? The omap3's allocator driver would use
>> >> > contiguous memory when it detects the SCANOUT flag whereas the
>> >> > omap4 allocator driver wouldn't have to. No complex negotiation
>> >> > of constraints - it just "knows".
>> >> >
>> >>
>> >> well, it is same allocating driver in both cases (although maybe
>> >> that is unimportant). The "it" that just knows it wants to scanout
>> >> is userspace. The "it" that just knows that scanout translates to
>> >> contiguous (or not) is the kernel. Perhaps we are saying the same
>> >> thing ;-)
>> >
>> > Yeah - I think we are... so what's the issue with having a per-SoC
>> > allocation driver again?
>> >
>>
>> In a way the display driver is a per-SoC allocator. But not
>> necessarily the *central* allocator for everything. Ie. no need for
>> display driver to allocate vertex buffers for a separate gpu driver,
>> and that sort of thing.
>
> Again, I'm only talking about allocating buffers which will be shared
> between different devices. At no point have I mentioned the allocation
> of buffers which aren't to be shared between devices. Sorry if that's
> not been clear.
ok, I guess we were talking about slightly different things ;-)
> So for buffers which are to be shared between devices, your suggesting
> that the display driver is the per-SoC allocator? But as I say, and
> how this thread got started, the same display driver can be used on
> different SoCs, so having _it_ be the central allocator isn't ideal.
> Though this is our current solution and why we're "abusing" the dumb
> buffer allocation functions. :-)
>
which is why you want to let userspace figure out the pitch and then
tell the display driver what size it wants, rather than using dumb
buffer ioctl ;-)
Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or
property or what have you.. but I think that would be hard to get
right for all cases, and most people don't really care about that
because they already need a gpu/display specific xorg driver and/or
gl/egl talking to their kernel driver. You are in a slightly special
case, since you are providing GL driver independently of the display
driver. But I think that is easier to handle by just telling your
customers "here, fill out this function(s) to allocate buffer for
scanout" (and, well, I guess you'd need one to query for
pitch/stride), rather than trying to cram everything into the kernel.
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> > Didn't you say that programmatically describing device placement
>> >> > constraints was an unbounded problem? I guess we would have to
>> >> > accept that it's not possible to describe all possible constraints
>> >> > and instead find a way to describe the common ones?
>> >>
>> >> well, the point I'm trying to make, is by dividing your constraints
>> >> into two groups, one that impacts and is handled by userspace, and
>> >> one that is in the kernel (ie. where the pages go), you cut down
>> >> the number of permutations that the kernel has to care about
>> >> considerably. And kernel already cares about, for example, what
>> >> range of addresses that a device can dma to/from. I think really
>> >> the only thing missing is the max # of sglist entries (contiguous
>> >> or not)
>> >
>> > I think it's more than physically contiguous or not.
>> >
>> > For example, it can be more efficient to use large page sizes on
>> > devices with IOMMUs to reduce TLB traffic. I think the size and even
>> > the availability of large pages varies between different IOMMUs.
>>
>> sure.. but I suppose if we can spiff out dma_params to express "I need
>> contiguous", perhaps we can add some way to express "I prefer
>> as-contiguous-as-possible".. either way, this is about where the pages
>> are placed, and not about the layout of pixels within the page, so
>> should be in kernel. It's something that is missing, but I believe
>> that it belongs in dma_params and hidden behind dma_alloc_*() for
>> simple drivers.
>
> Thinking about it, isn't this more a property of the IOMMU? I mean,
> are there any cases where an IOMMU had a large page mode but you
> wouldn't want to use it? So when allocating the memory, you'd have to
> take into account not just the constraints of the devices themselves,
> but also of any IOMMUs any of the device sit behind?
>
perhaps yes. But the device is associated w/ the iommu it is attached
to, so this shouldn't be a problem
>
>> > There's also the issue of buffer stride alignment. As I say, if the
>> > buffer is to be written by a tile-based GPU like Mali, it's more
>> > efficient if the buffer's stride is aligned to the max AXI bus burst
>> > length. Though I guess a buffer stride only makes sense as a concept
>> > when interpreting the data as a linear-layout 2D image, so perhaps
>> > belongs in user-space along with format negotiation?
>> >
>>
>> Yeah.. this isn't about where the pages go, but about the arrangement
>> within a page.
>>
>> And, well, except for hw that supports the same tiling (or
>> compressed-fb) in display+gpu, you probably aren't sharing tiled
>> buffers.
>
> You'd only want to share a buffer between devices if those devices can
> understand the same pixel format. That pixel format can't be device-
> specific or opaque, it has to be explicit. I think drm_fourcc.h is
> what defines all the possible pixel formats. This is the enum I used
> in EGL_EXT_image_dma_buf_import at least. So if we get to the point
> where multiple devices can understand a tiled or compressed format, I
> assume we could just add that format to drm_fourcc.h and possibly
> v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.
>
> For user-space to negotiate a common pixel format and now stride
> alignment, I guess it will obviously need a way to query what pixel
> formats a device supports and what its stride alignment requirements
> are.
>
> I don't know v4l2 very well, but it certainly seems the pixel format
> can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
> a particular format. I couldn't however find a way to retrieve a list
> of supported formats - it seems the mechanism is to try out each
> format in turn to determine if it is supported. Is that right?
it is exposed for drm plane's. What is missing is to expose the
primary-plane associated with the crtc.
> There doesn't however seem a way to query what stride constraints a
> V4l2 device might have. Does HW abstracted by v4l2 typically have
> such constraints? If so, how can we query them such that a buffer
> allocated by a DRM driver can be imported into v4l2 and used with
> that HW?
>
> Turning to DRM/KMS, it seems the supported formats of a plane can be
> queried using drm_mode_get_plane. However, there doesn't seem to be a
> way to query the supported formats of a crtc? If display HW only
> supports scanning out from a single buffer (like pl111 does), I think
> it won't have any planes and a fb can only be set on the crtc. In
> which case, how should user-space query which pixel formats that crtc
> supports?
>
> Assuming user-space can query the supported formats and find a common
> one, it will need to allocate a buffer. Looks like
> drm_mode_create_dumb can do that, but it only takes a bpp parameter,
> there's no format parameter. I assume then that user-space defines
> the format and tells the DRM driver which format the buffer is in
> when creating the fb with drm_mode_fb_cmd2, which does take a format
> parameter? Is that right?
Right, the gem object has no inherent format, it is just some bytes.
The format/width/height/pitch are all attributes of the fb.
> As with v4l2, DRM doesn't appear to have a way to query the stride
> constraints? Assuming there is a way to query the stride constraints,
> there also isn't a way to specify them when creating a buffer with
> DRM, though perhaps the existing pitch parameter of
> drm_mode_create_dumb could be used to allow user-space to pass in a
> minimum stride as well as receive the allocated stride?
>
well, you really shouldn't be using create_dumb.. you should have a
userspace piece that is specific to the drm driver, and knows how to
use that driver's gem allocate ioctl.
>
>> >> > One problem with this is it duplicates a lot of logic in each
>> >> > driver which can export a dma_buf buffer. Each exporter will
>> >> > need to do pretty much the same thing: iterate over all the
>> >> > attachments, determine of all the constraints (assuming that
>> >> > can be done) and allocate pages such that the lowest-common-
>> >> > denominator is satisfied. Perhaps rather than duplicating that
>> >> > logic in every driver, we could instead move allocation of the
>> >> > backing pages into dma_buf itself?
>> >>
>> >> I tend to think it is better to add helpers as we see common
>>
>> >> patterns emerge, which drivers can opt-in to using. I don't
>> >> think that we should move allocation into dma_buf itself, but
>> >> it would perhaps be useful to have dma_alloc_*() variants that
>> >> could allocate for multiple devices.
>> >
>> > A helper could work I guess, though I quite like the idea of
>> > having dma_alloc_*() variants which take a list of devices to
>> > allocate memory for.
>> >
>> >
>> >> That would help for simple stuff, although I'd suspect
>> >> eventually a GPU driver will move away from that. (Since you
>> >> probably want to play tricks w/ pools of pages that are
>> >> pre-zero'd and in the correct cache state, use spare cycles on
>> >> the gpu or dma engine to pre-zero uncached pages, and games
>> >> like that.)
>> >
>> > So presumably you're talking about a GPU driver being the exporter
>> > here? If so, how could the GPU driver do these kind of tricks on
>> > memory shared with another device?
>>
>> Yes, that is gpu-as-exporter. If someone else is allocating buffers,
>> it is up to them to do these tricks or not. Probably there is a
>> pretty good chance that if you aren't a GPU you don't need those sort
>> of tricks for fast allocation of transient upload buffers, staging
>> textures, temporary pixmaps, etc. Ie. I don't really think a v4l
>> camera or video decoder would benefit from that sort of optimization.
>
> Right - but none of those are really buffers you'd want to export with
> dma_buf to share with another device are they? In which case, why not
> just have dma_buf figure out the constraints and allocate the memory?
maybe not.. but (a) you don't necessarily know at creation time if it
is going to be exported (maybe you know if it is definitely not going
to be exported, but the converse is not true), and (b) there isn't
really any reason to special case the allocation in the driver because
it is going to be exported.
helpers that can be used by simple drivers, yes. Forcing the way the
buffer is allocated, for sure not. Currently, for example, there is
no issue to export a buffer allocated from stolen-mem. If we put the
page allocation in dma-buf, this would not be possible. That is just
one quick example off the top of my head, I'm sure there are plenty
more. But we definitely do not want the allocate in dma_buf itself.
> If a driver needs to allocate memory in a special way for a particular
> device, I can't really imagine how it would be able to share that
> buffer with another device using dma_buf? I guess a driver is likely
> to need some magic voodoo to configure access to the buffer for its
> device, but surely that would be done by the dma_mapping framework
> when dma_buf_map happens?
>
if, what it has to configure actually manages to fit in the
dma-mapping framework
anyways, where the pages come from has nothing to do with whether a
buffer can be shared or not
>
>
>> >> You probably want to get out of the SoC mindset, otherwise you are
>> >> going to make bad assumptions that come back to bite you later on.
>> >
>> > Sure - there are always going to be PC-like devices where the
>> > hardware configuration isn't fixed like it is on a traditional SoC.
>> > But I'd rather have a simple solution which works on traditional SoCs
>> > than no solution at all. Today our solution is to over-load the dumb
>> > buffer alloc functions of the display's DRM driver - For now I'm just
>> > looking for the next step up from that! ;-)
>>
>> True.. the original intention, which is perhaps a bit desktop-centric,
>> really was for there to be a userspace component talking to the drm
>> driver for allocation, ie. xf86-video-foo and/or
>> src/gallium/drivers/foo (for example) ;-)
>>
>> Which means for x11 having a SoC vendor specific xf86-video-foo for
>> x11.. or vendor specific gbm implementation for wayland. (Although
>> at least in the latter case it is a pretty small piece of code.) But
>> that is probably what you are trying to avoid.
>
> I've been trying to get my head around how PRIME relates to DDX
> drivers. As I understand it (which is likely wrong), you have a laptop
> with both an Intel & an nVidia GPU. You have both the i915 & nouveau
> kernel drivers loaded. What I'm not sure about is which GPU's display
> controller is actually hooked up to the physical connector? Perhaps
> there is a MUX like there is on Versatile Express?
afaiu it can be a, b, or c (ie. either gpu can have the display or
there can be a mux)..
> What I also don't understand is what DDX driver is loaded? Is it
> xf86-video-intel, xf86-video-nouveau or both? I get the impression
> that there's a "master" DDX which implements 2D operations but can
> import buffers using PRIME from the other driver and draw to them.
> Or is it more that it's able to export rendered buffers to the
> "slave" DRM for scanout? Either way, it's pretty similar to an ARM
> SoC setup which has the GPU and the display as two totally
> independent devices.
>
>
>
>> At any rate, for both xorg and wayland/gbm, you know when a buffer is
>> going to be a scanout buffer. What I'd recommend is define a small
>> userspace API that your customers (the SoC vendors) implement to
>> allocate a scanout buffer and hand you back a dmabuf fd. That could
>> be used both for x11 and for gbm. Inputs should be requested
>> width/height and format. And outputs pitch plus dmabuf fd.
>>
>> (Actually you might even just want to use gbm as your starting point.
>> You could probably just use gbm from xf86-video-armsoc for allocation,
>> to have one thing that works for both wayland and x11. Scanout and
>> cursor buffers should go to vendor/SoC specific fxn, rest can be
>> allocated from mali kernel driver.)
>
> What does that buy us over just using drm_mode_create_dumb on the
> display's DRM driver?
>
well, for example, if there was actually some hw w/ omap's dss + mali,
you could actually have mali render transparently to tiled buffers
which could be scanned out rotated. Which would not be possible w/
dumb buffers.
>
>
>> >> > wouldn't need a way to programmatically describe the constraints
>> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would
>> >> > just "know" that on this SoC, that buffer needs to be physically
>> >> > contiguous for example.
>> >>
>> >> not really.. it just knows it wants to scanout the buffer, and tells
>> >> this as a hint to the kernel.
>> >>
>> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> >> (where phys contig is not required for scanout), but causes CMA
>> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care.
>> >> It just knows that it wants to be able to scanout that particular
>> >> buffer.
>> >
>> > I think that's the idea? The omap3's allocator driver would use
>> > contiguous memory when it detects the SCANOUT flag whereas the omap4
>> > allocator driver wouldn't have to. No complex negotiation of
>> > constraints - it just "knows".
>> >
>>
>> well, it is same allocating driver in both cases (although maybe that
>> is unimportant). The "it" that just knows it wants to scanout is
>> userspace. The "it" that just knows that scanout translates to
>> contiguous (or not) is the kernel. Perhaps we are saying the same
>> thing ;-)
>
> Yeah - I think we are... so what's the issue with having a per-SoC
> allocation driver again?
>
In a way the display driver is a per-SoC allocator. But not
necessarily the *central* allocator for everything. Ie. no need for
display driver to allocate vertex buffers for a separate gpu driver,
and that sort of thing.
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> > Didn't you say that programmatically describing device placement
>> >> > constraints was an unbounded problem? I guess we would have to
>> >> > accept that it's not possible to describe all possible constraints
>> >> > and instead find a way to describe the common ones?
>> >>
>> >> well, the point I'm trying to make, is by dividing your constraints
>> >> into two groups, one that impacts and is handled by userspace, and
>> >> one that is in the kernel (ie. where the pages go), you cut down
>> >> the number of permutations that the kernel has to care about
>> >> considerably. And kernel already cares about, for example, what
>> >> range of addresses that a device can dma to/from. I think really
>> >> the only thing missing is the max # of sglist entries (contiguous
>> >> or not)
>> >
>> > I think it's more than physically contiguous or not.
>> >
>> > For example, it can be more efficient to use large page sizes on
>> > devices with IOMMUs to reduce TLB traffic. I think the size and even
>> > the availability of large pages varies between different IOMMUs.
>>
>> sure.. but I suppose if we can spiff out dma_params to express "I need
>> contiguous", perhaps we can add some way to express "I prefer
>> as-contiguous-as-possible".. either way, this is about where the pages
>> are placed, and not about the layout of pixels within the page, so
>> should be in kernel. It's something that is missing, but I believe
>> that it belongs in dma_params and hidden behind dma_alloc_*() for
>> simple drivers.
>
> Thinking about it, isn't this more a property of the IOMMU? I mean,
> are there any cases where an IOMMU had a large page mode but you
> wouldn't want to use it? So when allocating the memory, you'd have to
> take into account not just the constraints of the devices themselves,
> but also of any IOMMUs any of the device sit behind?
>
>
>> > There's also the issue of buffer stride alignment. As I say, if the
>> > buffer is to be written by a tile-based GPU like Mali, it's more
>> > efficient if the buffer's stride is aligned to the max AXI bus burst
>> > length. Though I guess a buffer stride only makes sense as a concept
>> > when interpreting the data as a linear-layout 2D image, so perhaps
>> > belongs in user-space along with format negotiation?
>> >
>>
>> Yeah.. this isn't about where the pages go, but about the arrangement
>> within a page.
>>
>> And, well, except for hw that supports the same tiling (or
>> compressed-fb) in display+gpu, you probably aren't sharing tiled
>> buffers.
>
> You'd only want to share a buffer between devices if those devices can
> understand the same pixel format. That pixel format can't be device-
> specific or opaque, it has to be explicit. I think drm_fourcc.h is
> what defines all the possible pixel formats. This is the enum I used
> in EGL_EXT_image_dma_buf_import at least. So if we get to the point
> where multiple devices can understand a tiled or compressed format, I
> assume we could just add that format to drm_fourcc.h and possibly
> v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.
>
> For user-space to negotiate a common pixel format and now stride
> alignment, I guess it will obviously need a way to query what pixel
> formats a device supports and what its stride alignment requirements
> are.
>
> I don't know v4l2 very well, but it certainly seems the pixel format
> can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
> a particular format. I couldn't however find a way to retrieve a list
> of supported formats - it seems the mechanism is to try out each
> format in turn to determine if it is supported. Is that right?
>
> There doesn't however seem a way to query what stride constraints a
> V4l2 device might have. Does HW abstracted by v4l2 typically have
> such constraints? If so, how can we query them such that a buffer
> allocated by a DRM driver can be imported into v4l2 and used with
> that HW?
>
> Turning to DRM/KMS, it seems the supported formats of a plane can be
> queried using drm_mode_get_plane. However, there doesn't seem to be a
> way to query the supported formats of a crtc? If display HW only
> supports scanning out from a single buffer (like pl111 does), I think
> it won't have any planes and a fb can only be set on the crtc. In
> which case, how should user-space query which pixel formats that crtc
> supports?
>
> Assuming user-space can query the supported formats and find a common
> one, it will need to allocate a buffer. Looks like
> drm_mode_create_dumb can do that, but it only takes a bpp parameter,
> there's no format parameter. I assume then that user-space defines
> the format and tells the DRM driver which format the buffer is in
> when creating the fb with drm_mode_fb_cmd2, which does take a format
> parameter? Is that right?
>
> As with v4l2, DRM doesn't appear to have a way to query the stride
> constraints? Assuming there is a way to query the stride constraints,
> there also isn't a way to specify them when creating a buffer with
> DRM, though perhaps the existing pitch parameter of
> drm_mode_create_dumb could be used to allow user-space to pass in a
> minimum stride as well as receive the allocated stride?
>
>
>> >> > One problem with this is it duplicates a lot of logic in each
>> >> > driver which can export a dma_buf buffer. Each exporter will
>> >> > need to do pretty much the same thing: iterate over all the
>> >> > attachments, determine of all the constraints (assuming that
>> >> > can be done) and allocate pages such that the lowest-common-
>> >> > denominator is satisfied. Perhaps rather than duplicating that
>> >> > logic in every driver, we could instead move allocation of the
>> >> > backing pages into dma_buf itself?
>> >>
>> >> I tend to think it is better to add helpers as we see common
>>
>> >> patterns emerge, which drivers can opt-in to using. I don't
>> >> think that we should move allocation into dma_buf itself, but
>> >> it would perhaps be useful to have dma_alloc_*() variants that
>> >> could allocate for multiple devices.
>> >
>> > A helper could work I guess, though I quite like the idea of
>> > having dma_alloc_*() variants which take a list of devices to
>> > allocate memory for.
>> >
>> >
>> >> That would help for simple stuff, although I'd suspect
>> >> eventually a GPU driver will move away from that. (Since you
>> >> probably want to play tricks w/ pools of pages that are
>> >> pre-zero'd and in the correct cache state, use spare cycles on
>> >> the gpu or dma engine to pre-zero uncached pages, and games
>> >> like that.)
>> >
>> > So presumably you're talking about a GPU driver being the exporter
>> > here? If so, how could the GPU driver do these kind of tricks on
>> > memory shared with another device?
>>
>> Yes, that is gpu-as-exporter. If someone else is allocating buffers,
>> it is up to them to do these tricks or not. Probably there is a
>> pretty good chance that if you aren't a GPU you don't need those sort
>> of tricks for fast allocation of transient upload buffers, staging
>> textures, temporary pixmaps, etc. Ie. I don't really think a v4l
>> camera or video decoder would benefit from that sort of optimization.
>
> Right - but none of those are really buffers you'd want to export with
> dma_buf to share with another device are they? In which case, why not
> just have dma_buf figure out the constraints and allocate the memory?
>
> If a driver needs to allocate memory in a special way for a particular
> device, I can't really imagine how it would be able to share that
> buffer with another device using dma_buf? I guess a driver is likely
> to need some magic voodoo to configure access to the buffer for its
> device, but surely that would be done by the dma_mapping framework
> when dma_buf_map happens?
>
>
>
>> >> You probably want to get out of the SoC mindset, otherwise you are
>> >> going to make bad assumptions that come back to bite you later on.
>> >
>> > Sure - there are always going to be PC-like devices where the
>> > hardware configuration isn't fixed like it is on a traditional SoC.
>> > But I'd rather have a simple solution which works on traditional SoCs
>> > than no solution at all. Today our solution is to over-load the dumb
>> > buffer alloc functions of the display's DRM driver - For now I'm just
>> > looking for the next step up from that! ;-)
>>
>> True.. the original intention, which is perhaps a bit desktop-centric,
>> really was for there to be a userspace component talking to the drm
>> driver for allocation, ie. xf86-video-foo and/or
>> src/gallium/drivers/foo (for example) ;-)
>>
>> Which means for x11 having a SoC vendor specific xf86-video-foo for
>> x11.. or vendor specific gbm implementation for wayland. (Although
>> at least in the latter case it is a pretty small piece of code.) But
>> that is probably what you are trying to avoid.
>
> I've been trying to get my head around how PRIME relates to DDX
> drivers. As I understand it (which is likely wrong), you have a laptop
> with both an Intel & an nVidia GPU. You have both the i915 & nouveau
> kernel drivers loaded. What I'm not sure about is which GPU's display
> controller is actually hooked up to the physical connector? Perhaps
> there is a MUX like there is on Versatile Express?
>
> What I also don't understand is what DDX driver is loaded? Is it
> xf86-video-intel, xf86-video-nouveau or both? I get the impression
> that there's a "master" DDX which implements 2D operations but can
> import buffers using PRIME from the other driver and draw to them.
> Or is it more that it's able to export rendered buffers to the
> "slave" DRM for scanout? Either way, it's pretty similar to an ARM
> SoC setup which has the GPU and the display as two totally
> independent devices.
>
In the early days, there was a MUX to switch the displays between the
two GPUs, but most modern systems are MUX-less and the dGPU is either
connected to no displays or in some cases the local panel is attached
to the integrated GPU and the external displays are connected to the
dGPU.
In the MUX-less case, the dGPU can be used to render, and then the
result is copied to the integrated GPU for display.
Alex
On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> >> ... This is the purpose of the attach step,
>> >> so you know all the devices involved in sharing up front before
>> >> allocating the backing pages. (Or in the worst case, if you have a
>> >> "late attacher" you at least know when no device is doing dma access
>> >> to a buffer and can reallocate and move the buffer.) A long time
>> >> back, I had a patch that added a field or two to 'struct
>> >> device_dma_parameters' so that it could be known if a device
>> >> required contiguous buffers.. looks like that never got merged, so
>> >> I'd need to dig that back up and resend it. But the idea was to
>> >> have the 'struct device' encapsulate all the information that would
>> >> be needed to do-the-right-thing when it comes to placement.
>> >
>> > As I understand it, it's up to the exporting device to allocate the
>> > memory backing the dma_buf buffer. I guess the latest possible point
>> > you can allocate the backing pages is when map_dma_buf is first
>> > called? At that point the exporter can iterate over the current set
>> > of attachments, programmatically determine the all the constraints of
>> > all the attached drivers and attempt to allocate the backing pages
>> > in such a way as to satisfy all those constraints?
>>
>> yes, this is the idea.. possibly some room for some helpers to help
>> out with this, but that is all under the hood from userspace
>> perspective
>>
>> > Didn't you say that programmatically describing device placement
>> > constraints was an unbounded problem? I guess we would have to
>> > accept that it's not possible to describe all possible constraints
>> > and instead find a way to describe the common ones?
>>
>> well, the point I'm trying to make, is by dividing your constraints
>> into two groups, one that impacts and is handled by userspace, and one
>> that is in the kernel (ie. where the pages go), you cut down the
>> number of permutations that the kernel has to care about considerably.
>> And kernel already cares about, for example, what range of addresses
>> that a device can dma to/from. I think really the only thing missing
>> is the max # of sglist entries (contiguous or not)
>
> I think it's more than physically contiguous or not.
>
> For example, it can be more efficient to use large page sizes on
> devices with IOMMUs to reduce TLB traffic. I think the size and even
> the availability of large pages varies between different IOMMUs.
sure.. but I suppose if we can spiff out dma_params to express "I need
contiguous", perhaps we can add some way to express "I prefer
as-contiguous-as-possible".. either way, this is about where the pages
are placed, and not about the layout of pixels within the page, so
should be in kernel. It's something that is missing, but I believe
that it belongs in dma_params and hidden behind dma_alloc_*() for
simple drivers.
> There's also the issue of buffer stride alignment. As I say, if the
> buffer is to be written by a tile-based GPU like Mali, it's more
> efficient if the buffer's stride is aligned to the max AXI bus burst
> length. Though I guess a buffer stride only makes sense as a concept
> when interpreting the data as a linear-layout 2D image, so perhaps
> belongs in user-space along with format negotiation?
>
Yeah.. this isn't about where the pages go, but about the arrangement
within a page.
And, well, except for hw that supports the same tiling (or
compressed-fb) in display+gpu, you probably aren't sharing tiled
buffers.
>
>> > One problem with this is it duplicates a lot of logic in each
>> > driver which can export a dma_buf buffer. Each exporter will need to
>> > do pretty much the same thing: iterate over all the attachments,
>> > determine of all the constraints (assuming that can be done) and
>> > allocate pages such that the lowest-common-denominator is satisfied.
>> >
>> > Perhaps rather than duplicating that logic in every driver, we could
>> > Instead move allocation of the backing pages into dma_buf itself?
>> >
>>
>> I tend to think it is better to add helpers as we see common patterns
>> emerge, which drivers can opt-in to using. I don't think that we
>> should move allocation into dma_buf itself, but it would perhaps be
>> useful to have dma_alloc_*() variants that could allocate for multiple
>> devices.
>
> A helper could work I guess, though I quite like the idea of having
> dma_alloc_*() variants which take a list of devices to allocate memory
> for.
>
>
>> That would help for simple stuff, although I'd suspect
>> eventually a GPU driver will move away from that. (Since you probably
>> want to play tricks w/ pools of pages that are pre-zero'd and in the
>> correct cache state, use spare cycles on the gpu or dma engine to
>> pre-zero uncached pages, and games like that.)
>
> So presumably you're talking about a GPU driver being the exporter
> here? If so, how could the GPU driver do these kind of tricks on
> memory shared with another device?
>
Yes, that is gpu-as-exporter. If someone else is allocating buffers,
it is up to them to do these tricks or not. Probably there is a
pretty good chance that if you aren't a GPU you don't need those sort
of tricks for fast allocation of transient upload buffers, staging
textures, temporary pixmaps, etc. Ie. I don't really think a v4l
camera or video decoder would benefit from that sort of optimization.
>
>> >> > Anyway, assuming user-space can figure out how a buffer should be
>> >> > stored in memory, how does it indicate this to a kernel driver and
>> >> > actually allocate it? Which ioctl on which device does user-space
>> >> > call, with what parameters? Are you suggesting using something
>> >> > like ION which exposes the low-level details of how buffers are
>> > >> laid out in physical memory to userspace? If not, what?
>> >> >
>> >>
>> >> no, userspace should not need to know this. And having a central
>> >> driver that knows this for all the other drivers in the system
>> >> doesn't really solve anything and isn't really scalable. At best
>> >> you might want, in some cases, a flag you can pass when allocating.
>> >> For example, some of the drivers have a 'SCANOUT' flag that can be
>> >> passed when allocating a GEM buffer, as a hint to the kernel that
>> >> 'if this hw requires contig memory for scanout, allocate this
>> >> buffer contig'. But really, when it comes to sharing buffers
>> >> between devices, we want this sort of information in dev->dma_params
>> >> of the importing device(s).
>> >
>> > If you had a single driver which knew the constraints of all devices
>> > on that particular SoC and the interface allowed user-space to
>> > specify which devices a buffer is intended to be used with, I guess
>> > it could pretty trivially allocate pages which satisfy those
> constraints?
>>
>> keep in mind, even a number of SoC's come with pcie these days. You
>> already have things like
>>
>> https://developer.nvidia.com/content/kayla-platform
>>
>> You probably want to get out of the SoC mindset, otherwise you are
>> going to make bad assumptions that come back to bite you later on.
>
> Sure - there are always going to be PC-like devices where the
> hardware configuration isn't fixed like it is on a traditional SoC.
> But I'd rather have a simple solution which works on traditional SoCs
> than no solution at all. Today our solution is to over-load the dumb
> buffer alloc functions of the display's DRM driver - For now I'm just
> looking for the next step up from that! ;-)
>
True.. the original intention, which is perhaps a bit desktop-centric,
really was for there to be a userspace component talking to the drm
driver for allocation, ie. xf86-video-foo and/or
src/gallium/drivers/foo (for example) ;-)
Which means for x11 having a SoC vendor specific xf86-video-foo for
x11.. or vendor specific gbm implementation for wayland. (Although
at least in the latter case it is a pretty small piece of code.) But
that is probably what you are trying to avoid.
At any rate, for both xorg and wayland/gbm, you know when a buffer is
going to be a scanout buffer. What I'd recommend is define a small
userspace API that your customers (the SoC vendors) implement to
allocate a scanout buffer and hand you back a dmabuf fd. That could
be used both for x11 and for gbm. Inputs should be requested
width/height and format. And outputs pitch plus dmabuf fd.
(Actually you might even just want to use gbm as your starting point.
You could probably just use gbm from xf86-video-armsoc for allocation,
to have one thing that works for both wayland and x11. Scanout and
cursor buffers should go to vendor/SoC specific fxn, rest can be
allocated from mali kernel driver.)
>
>> > wouldn't need a way to programmatically describe the constraints
>> > either: As you say, if userspace sets the "SCANOUT" flag, it would
>> > just "know" that on this SoC, that buffer needs to be physically
>> > contiguous for example.
>>
>> not really.. it just knows it wants to scanout the buffer, and tells
>> this as a hint to the kernel.
>>
>> For example, on omapdrm, the SCANOUT flag does nothing on omap4+
>> (where phys contig is not required for scanout), but causes CMA
>> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just
>> knows that it wants to be able to scanout that particular buffer.
>
> I think that's the idea? The omap3's allocator driver would use
> contiguous memory when it detects the SCANOUT flag whereas the omap4
> allocator driver wouldn't have to. No complex negotiation of
> constraints - it just "knows".
>
well, it is same allocating driver in both cases (although maybe that
is unimportant). The "it" that just knows it wants to scanout is
userspace. The "it" that just knows that scanout translates to
contiguous (or not) is the kernel. Perhaps we are saying the same
thing ;-)
BR,
-R
>
> Cheers,
>
> Tom
>
>
>
>
Hello,
This is a fourth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. After the comments from
Grant Likely I moved back memory region definitions back to /memory node
(as it was in the first version of this proposal). I've also extended
the code and made it more generic, added support for so called reserved
dma memory (special dma memory regions created by dma_alloc_coherent()
function, for exclusive usage for dma allocation for the given device).
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v4:
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (4):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add function to scan fdt nodes given by path
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++++
arch/arm/mm/init.c | 3 +
drivers/base/dma-contiguous.c | 147 +++++++++++-----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 76 +++++++++++
drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++
include/asm-generic/dma-coherent.h | 6 +
include/asm-generic/dma-contiguous.h | 2 -
include/linux/dma-contiguous.h | 49 +++++++-
include/linux/of_fdt.h | 3 +
11 files changed, 541 insertions(+), 79 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
--
1.7.9.5
On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz(a)linaro.org> wrote:
> On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark(a)gmail.com> wrote:
>> well, let's divide things up into two categories:
>>
>> 1) the arrangement and format of pixels.. ie. what userspace would
>> need to know if it mmap's a buffer. This includes pixel format,
>> stride, etc. This should be negotiated in userspace, it would be
>> crazy to try to do this in the kernel.
>>
>> 2) the physical placement of the pages. Ie. whether it is contiguous
>> or not. Which bank the pages in the buffer are placed in, etc. This
>> is not visible to userspace. This is the purpose of the attach step,
>> so you know all the devices involved in sharing up front before
>> allocating the backing pages. (Or in the worst case, if you have a
>> "late attacher" you at least know when no device is doing dma access
>> to a buffer and can reallocate and move the buffer.) A long time
>
> One concern I know the Android folks have expressed previously (and
> correct me if its no longer an objection), is that this attach time
> in-kernel constraint solving / moving or reallocating buffers is
> likely to hurt determinism. If I understood, their perspective was
> that userland knows the device path the buffers will travel through,
> so why not leverage that knowledge, rather then having the kernel have
> to sort it out for itself after the fact.
If you know the device path, then attach the buffer at all the devices
before you start using it. Problem solved.. kernel knows all devices
before pages need be allocated ;-)
BR,
-R
On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
> Hi Rob,
>
>> >> > We may also then have additional constraints when sharing buffers
>> >> > between the display HW and video decode or even camera ISP HW.
>> >> > Programmatically describing buffer allocation constraints is very
>> >> > difficult and I'm not sure you can actually do it - there's some
>> >> > pretty complex constraints out there! E.g. I believe there's a
>> >> > platform where Y and UV planes of the reference frame need to be
>> >> > in separate DRAM banks for real-time 1080p decode, or something
>> >> > like that?
>> >>
>> >> yes, this was discussed. This is different from pitch/format/size
>> >> constraints.. it is really just a placement constraint (ie. where
>> >> do the physical pages go). IIRC the conclusion was to use a dummy
>> >> devices with it's own CMA pool for attaching the Y vs UV buffers.
>> >>
>> >> > Anyway, I guess my point is that even if we solve how to allocate
>> >> > buffers which will be shared between the GPU and display HW such
>> >> > that both sets of constraints are satisfied, that may not be the
>> >> > end of the story.
>> >> >
>> >>
>> >> that was part of the reason to punt this problem to userspace ;-)
>> >>
>> >> In practice, the kernel drivers doesn't usually know too much about
>> >> the dimensions/format/etc.. that is really userspace level
>> >> knowledge. There are a few exceptions when the kernel needs to know
>> >> how to setup GTT/etc for tiled buffers, but normally this sort of
>> >> information is up at the next level up (userspace, and
>> >> drm_framebuffer in case of scanout). Userspace media frameworks
>> >> like GStreamer already have a concept of format/caps negotiation.
>> >> For non-display<->gpu sharing, I think this is probably where this
>> >> sort of constraint negotiation should be handled.
>> >
>> > I agree that user-space will know which devices will access the
>> > buffer and thus can figure out at least a common pixel format.
>> > Though I'm not so sure userspace can figure out more low-level
>> > details like alignment and placement in physical memory, etc.
>> >
>>
>> well, let's divide things up into two categories:
>>
>> 1) the arrangement and format of pixels.. ie. what userspace would
>> need to know if it mmap's a buffer. This includes pixel format,
>> stride, etc. This should be negotiated in userspace, it would be
>> crazy to try to do this in the kernel.
>
> Absolutely. Pixel format has to be negotiated by user-space as in
> most cases, user-space can map the buffer and thus will need to
> know how to interpret the data.
>
>
>
>> 2) the physical placement of the pages. Ie. whether it is contiguous
>> or not. Which bank the pages in the buffer are placed in, etc. This
>> is not visible to userspace.
>
> Seems sensible to me.
>
>
>> ... This is the purpose of the attach step,
>> so you know all the devices involved in sharing up front before
>> allocating the backing pages. (Or in the worst case, if you have a
>> "late attacher" you at least know when no device is doing dma access
>> to a buffer and can reallocate and move the buffer.) A long time
>> back, I had a patch that added a field or two to 'struct
>> device_dma_parameters' so that it could be known if a device required
>> contiguous buffers.. looks like that never got merged, so I'd need to
>> dig that back up and resend it. But the idea was to have the 'struct
>> device' encapsulate all the information that would be needed to
>> do-the-right-thing when it comes to placement.
>
> As I understand it, it's up to the exporting device to allocate the
> memory backing the dma_buf buffer. I guess the latest possible point
> you can allocate the backing pages is when map_dma_buf is first
> called? At that point the exporter can iterate over the current set
> of attachments, programmatically determine the all the constraints of
> all the attached drivers and attempt to allocate the backing pages
> in such a way as to satisfy all those constraints?
yes, this is the idea.. possibly some room for some helpers to help
out with this, but that is all under the hood from userspace
perspective
> Didn't you say that programmatically describing device placement
> constraints was an unbounded problem? I guess we would have to
> accept that it's not possible to describe all possible constraints
> and instead find a way to describe the common ones?
well, the point I'm trying to make, is by dividing your constraints
into two groups, one that impacts and is handled by userspace, and one
that is in the kernel (ie. where the pages go), you cut down the
number of permutations that the kernel has to care about considerably.
And kernel already cares about, for example, what range of addresses
that a device can dma to/from. I think really the only thing missing
is the max # of sglist entries (contiguous or not)
> One problem with this is it duplicates a lot of logic in each
> driver which can export a dma_buf buffer. Each exporter will need to
> do pretty much the same thing: iterate over all the attachments,
> determine of all the constraints (assuming that can be done) and
> allocate pages such that the lowest-common-denominator is satisfied.
>
> Perhaps rather than duplicating that logic in every driver, we could
> Instead move allocation of the backing pages into dma_buf itself?
>
I tend to think it is better to add helpers as we see common patterns
emerge, which drivers can opt-in to using. I don't think that we
should move allocation into dma_buf itself, but it would perhaps be
useful to have dma_alloc_*() variants that could allocate for multiple
devices. That would help for simple stuff, although I'd suspect
eventually a GPU driver will move away from that. (Since you probably
want to play tricks w/ pools of pages that are pre-zero'd and in the
correct cache state, use spare cycles on the gpu or dma engine to
pre-zero uncached pages, and games like that.)
>
>> > Anyway, assuming user-space can figure out how a buffer should be
>> > stored in memory, how does it indicate this to a kernel driver and
>> > actually allocate it? Which ioctl on which device does user-space
>> > call, with what parameters? Are you suggesting using something like
>> > ION which exposes the low-level details of how buffers are laid out
>> in
>> > physical memory to userspace? If not, what?
>>
>> no, userspace should not need to know this. And having a central
>> driver that knows this for all the other drivers in the system doesn't
>> really solve anything and isn't really scalable. At best you might
>> want, in some cases, a flag you can pass when allocating. For
>> example, some of the drivers have a 'SCANOUT' flag that can be passed
>> when allocating a GEM buffer, as a hint to the kernel that 'if this hw
>> requires contig memory for scanout, allocate this buffer contig'. But
>> really, when it comes to sharing buffers between devices, we want this
>> sort of information in dev->dma_params of the importing device(s).
>
> If you had a single driver which knew the constraints of all devices
> on that particular SoC and the interface allowed user-space to specify
> which devices a buffer is intended to be used with, I guess it could
> pretty trivially allocate pages which satisfy those constraints? It
keep in mind, even a number of SoC's come with pcie these days. You
already have things like
https://developer.nvidia.com/content/kayla-platform
You probably want to get out of the SoC mindset, otherwise you are
going to make bad assumptions that come back to bite you later on.
> wouldn't need a way to programmatically describe the constraints
> either: As you say, if userspace sets the "SCANOUT" flag, it would
> just "know" that on this SoC, that buffer needs to be physically
> contiguous for example.
not really.. it just knows it wants to scanout the buffer, and tells
this as a hint to the kernel.
For example, on omapdrm, the SCANOUT flag does nothing on omap4+
(where phys contig is not required for scanout), but causes CMA
(dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just
knows that it wants to be able to scanout that particular buffer.
> Though It would effectively mean you'd need an "allocation" driver per
> SoC, which as you say may not be scalable?
Right.. and not actually even possible in the general sense (see SoC +
external pcie gfx card)
BR,
-R
>
>
> Cheers,
>
> Tom
>
>
>
>
>
Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey:
> Hi Rob,
>
> +lkml
>
> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
> > >> wrote:
> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to
> > >> >> > also allocate buffers for the GPU. Still not sure how to
> > >> >> > resolve this as we don't use DRM for our GPU driver.
> > >> >>
> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term
> > >> >> (esp. once the dma-fence stuff is in place), we'd have
> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display
> > >> >> specific drm/kms driver, using prime/dmabuf to share between
> > >> >> the two.
> > >> >
> > >> > The "extra" buffers we were allocating from armsoc DDX were really
> > >> > being allocated through DRM/GEM so we could get an flink name
> > >> > for them and pass a reference to them back to our GPU driver on
> > >> > the client side. If it weren't for our need to access those
> > >> > extra off-screen buffers with the GPU we wouldn't need to
> > >> > allocate them with DRM at all. So, given they are really "GPU"
> > >> > buffers, it does absolutely make sense to allocate them in a
> > >> > different driver to the display driver.
> > >> >
> > >> > However, to avoid unnecessary memcpys & related cache
> > >> > maintenance ops, we'd also like the GPU to render into buffers
> > >> > which are scanned out by the display controller. So let's say
> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > >> > out buffers with the display's DRM driver but a custom ioctl
> > >> > on the GPU's DRM driver to allocate non scanout, off-screen
> > >> > buffers. Sounds great, but I don't think that really works
> > >> > with DRI2. If we used two drivers to allocate buffers, which
> > >> > of those drivers do we return in DRI2ConnectReply? Even if we
> > >> > solve that somehow, GEM flink names are name-spaced to a
> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > >> > how does the EGL in the client know which DRM device owns GEM
> > >> > flink name "1234"? We'd need some pretty dirty hacks.
> > >>
> > >> You would return the name of the display driver allocating the
> > >> buffers. On the client side you can use generic ioctls to go from
> > >> flink -> handle -> dmabuf. So the client side would end up opening
> > >> both the display drm device and the gpu, but without needing to know
> > >> too much about the display.
> > >
> > > I think the bit I was missing was that a GEM bo for a buffer imported
> > > using dma_buf/PRIME can still be flink'd. So the display controller's
> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate
> > > ioctl. Those scan-out buffers than then be exported from the
> > > dispaly's DRM driver and imported into the GPU's DRM driver using
> > > PRIME. Once imported into the GPU's driver, we can use flink to get a
> > > name for that buffer within the GPU DRM driver's name-space to return
> > > to the DRI2 client. That same namespace is also what DRI2 back-
> > > buffers are allocated from, so I think that could work... Except...
> >
> > (and.. the general direction is that things will move more to just use
> > dmabuf directly, ie. wayland or dri3)
>
> I agree, DRI2 is the only reason why we need a system-wide ID. I also
> prefer buffers to be passed around by dma_buf fd, but we still need to
> support DRI2 and will do for some time I expect.
>
>
>
> > >> > Anyway, that latter case also gets quite difficult. The "GPU"
> > >> > DRM driver would need to know the constraints of the display
> > >> > controller when allocating buffers intended to be scanned out.
> > >> > For example, pl111 typically isn't behind an IOMMU and so
> > >> > requires physically contiguous memory. We'd have to teach the
> > >> > GPU's DRM driver about the constraints of the display HW. Not
> > >> > exactly a clean driver model. :-(
> > >> >
> > >> > I'm still a little stuck on how to proceed, so any ideas
> > >> > would greatly appreciated! My current train of thought is
> > >> > having a kind of SoC-specific DRM driver which allocates
> > >> > buffers for both display and GPU within a single GEM
> > >> > namespace. That SoC-specific DRM driver could then know the
> > >> > constraints of both the GPU and the display HW. We could then
> > >> > use PRIME to export buffers allocated with the SoC DRM driver
> > >> > and import them into the GPU and/or display DRM driver.
> > >>
> > >> Usually if the display drm driver is allocating the buffers that
> > >> might be scanned out, it just needs to have minimal knowledge of
> > >> the GPU (pitch alignment constraints). I don't think we need a
> > >> 3rd device just to allocate buffers.
> > >
> > > While Mali can render to pretty much any buffer, there is a mild
> > > performance improvement to be had if the buffer stride is aligned to
> > > the AXI bus's max burst length when drawing to the buffer.
> >
> > I suspect the display controllers might frequently benefit if the
> > pitch is aligned to AXI burst length too..
>
> If the display controller is going to be reading from linear memory
> I don't think it will make much difference - you'll just get an extra
> 1-2 bus transactions per scanline. With a tile-based GPU like Mali,
> you get those extra transactions per _tile_ scan-line and as such,
> the overhead is more pronounced.
>
>
>
> > > So in some respects, there is a constraint on how buffers which will
> > > be drawn to using the GPU are allocated. I don't really like the idea
> > > of teaching the display controller DRM driver about the GPU buffer
> > > constraints, even if they are fairly trivial like this. If the same
> > > display HW IP is being used on several SoCs, it seems wrong somehow
> > > to enforce those GPU constraints if some of those SoCs don't have a
> > > GPU.
> >
> > Well, I suppose you could get min_pitch_alignment from devicetree, or
> > something like this..
> >
> > In the end, the easy solution is just to make the display allocate to
> > the worst-case pitch alignment. In the early days of dma-buf
> > discussions, we kicked around the idea of negotiating or
> > programatically describing the constraints, but that didn't really
> > seem like a bounded problem.
>
> Yeah - I was around for some of those discussions and agree it's not
> really an easy problem to solve.
>
>
>
> > > We may also then have additional constraints when sharing buffers
> > > between the display HW and video decode or even camera ISP HW.
> > > Programmatically describing buffer allocation constraints is very
> > > difficult and I'm not sure you can actually do it - there's some
> > > pretty complex constraints out there! E.g. I believe there's a
> > > platform where Y and UV planes of the reference frame need to be in
> > > separate DRAM banks for real-time 1080p decode, or something like
> > > that?
> >
> > yes, this was discussed. This is different from pitch/format/size
> > constraints.. it is really just a placement constraint (ie. where do
> > the physical pages go). IIRC the conclusion was to use a dummy
> > devices with it's own CMA pool for attaching the Y vs UV buffers.
> >
> > > Anyway, I guess my point is that even if we solve how to allocate
> > > buffers which will be shared between the GPU and display HW such that
> > > both sets of constraints are satisfied, that may not be the end of
> > > the story.
> > >
> >
> > that was part of the reason to punt this problem to userspace ;-)
> >
> > In practice, the kernel drivers doesn't usually know too much about
> > the dimensions/format/etc.. that is really userspace level knowledge.
> > There are a few exceptions when the kernel needs to know how to setup
> > GTT/etc for tiled buffers, but normally this sort of information is up
> > at the next level up (userspace, and drm_framebuffer in case of
> > scanout). Userspace media frameworks like GStreamer already have a
> > concept of format/caps negotiation. For non-display<->gpu sharing, I
> > think this is probably where this sort of constraint negotiation
> > should be handled.
>
> I agree that user-space will know which devices will access the buffer
> and thus can figure out at least a common pixel format. Though I'm not
> so sure userspace can figure out more low-level details like alignment
> and placement in physical memory, etc.
>
> Anyway, assuming user-space can figure out how a buffer should be
> stored in memory, how does it indicate this to a kernel driver and
> actually allocate it? Which ioctl on which device does user-space
> call, with what parameters? Are you suggesting using something like
> ION which exposes the low-level details of how buffers are laid out in
> physical memory to userspace? If not, what?
>
I strongly disagree with exposing low-level hardware details like tiling
to userspace. If we have to do the negotiation of those things in
userspace we will end up with having to pipe those information through
things like the wayland protocol. I don't see how this could ever be
considered a good idea.
I would rather see kernel drivers negotiating those things at dmabuf
attach time in way invisible to userspace. I agree that this negotiation
thing isn't easy to get right for the plethora of different hardware
constraints we see today, but I would rather see this in-kernel, where
we have the chance to fix things up if needed, than in a fixed userspace
interface.
Regards,
Lucas
--
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
>
>> > So in some respects, there is a constraint on how buffers which will
>> > be drawn to using the GPU are allocated. I don't really like the idea
>> > of teaching the display controller DRM driver about the GPU buffer
>> > constraints, even if they are fairly trivial like this. If the same
>> > display HW IP is being used on several SoCs, it seems wrong somehow
>> > to enforce those GPU constraints if some of those SoCs don't have a
>> > GPU.
>>
>> Well, I suppose you could get min_pitch_alignment from devicetree, or
>> something like this..
>>
>> In the end, the easy solution is just to make the display allocate to
>> the worst-case pitch alignment. In the early days of dma-buf
>> discussions, we kicked around the idea of negotiating or
>> programatically describing the constraints, but that didn't really
>> seem like a bounded problem.
>
> Yeah - I was around for some of those discussions and agree it's not
> really an easy problem to solve.
>
>
>
>> > We may also then have additional constraints when sharing buffers
>> > between the display HW and video decode or even camera ISP HW.
>> > Programmatically describing buffer allocation constraints is very
>> > difficult and I'm not sure you can actually do it - there's some
>> > pretty complex constraints out there! E.g. I believe there's a
>> > platform where Y and UV planes of the reference frame need to be in
>> > separate DRAM banks for real-time 1080p decode, or something like
>> > that?
>>
>> yes, this was discussed. This is different from pitch/format/size
>> constraints.. it is really just a placement constraint (ie. where do
>> the physical pages go). IIRC the conclusion was to use a dummy
>> devices with it's own CMA pool for attaching the Y vs UV buffers.
>>
>> > Anyway, I guess my point is that even if we solve how to allocate
>> > buffers which will be shared between the GPU and display HW such that
>> > both sets of constraints are satisfied, that may not be the end of
>> > the story.
>> >
>>
>> that was part of the reason to punt this problem to userspace ;-)
>>
>> In practice, the kernel drivers doesn't usually know too much about
>> the dimensions/format/etc.. that is really userspace level knowledge.
>> There are a few exceptions when the kernel needs to know how to setup
>> GTT/etc for tiled buffers, but normally this sort of information is up
>> at the next level up (userspace, and drm_framebuffer in case of
>> scanout). Userspace media frameworks like GStreamer already have a
>> concept of format/caps negotiation. For non-display<->gpu sharing, I
>> think this is probably where this sort of constraint negotiation
>> should be handled.
>
> I agree that user-space will know which devices will access the buffer
> and thus can figure out at least a common pixel format. Though I'm not
> so sure userspace can figure out more low-level details like alignment
> and placement in physical memory, etc.
well, let's divide things up into two categories:
1) the arrangement and format of pixels.. ie. what userspace would
need to know if it mmap's a buffer. This includes pixel format,
stride, etc. This should be negotiated in userspace, it would be
crazy to try to do this in the kernel.
2) the physical placement of the pages. Ie. whether it is contiguous
or not. Which bank the pages in the buffer are placed in, etc. This
is not visible to userspace. This is the purpose of the attach step,
so you know all the devices involved in sharing up front before
allocating the backing pages. (Or in the worst case, if you have a
"late attacher" you at least know when no device is doing dma access
to a buffer and can reallocate and move the buffer.) A long time
back, I had a patch that added a field or two to 'struct
device_dma_parameters' so that it could be known if a device required
contiguous buffers.. looks like that never got merged, so I'd need to
dig that back up and resend it. But the idea was to have the 'struct
device' encapsulate all the information that would be needed to
do-the-right-thing when it comes to placement.
> Anyway, assuming user-space can figure out how a buffer should be
> stored in memory, how does it indicate this to a kernel driver and
> actually allocate it? Which ioctl on which device does user-space
> call, with what parameters? Are you suggesting using something like
> ION which exposes the low-level details of how buffers are laid out in
> physical memory to userspace? If not, what?
no, userspace should not need to know this. And having a central
driver that knows this for all the other drivers in the system doesn't
really solve anything and isn't really scalable. At best you might
want, in some cases, a flag you can pass when allocating. For
example, some of the drivers have a 'SCANOUT' flag that can be passed
when allocating a GEM buffer, as a hint to the kernel that 'if this hw
requires contig memory for scanout, allocate this buffer contig'. But
really, when it comes to sharing buffers between devices, we want this
sort of information in dev->dma_params of the importing device(s).
BR,
-R
On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey(a)arm.com> wrote:
> Hi Rob,
>
> +linux-media, +linaro-mm-sig for discussion of video/camera
> buffer constraints...
>
>
>> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
>> wrote:
>> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
>> >> > allocate buffers for the GPU. Still not sure how to resolve
>> >> > this as we don't use DRM for our GPU driver.
>> >>
>> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp.
>> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
>> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
>> >> using prime/dmabuf to share between the two.
>> >
>> > The "extra" buffers we were allocating from armsoc DDX were really
>> > being allocated through DRM/GEM so we could get an flink name
>> > for them and pass a reference to them back to our GPU driver on
>> > the client side. If it weren't for our need to access those
>> > extra off-screen buffers with the GPU we wouldn't need to
>> > allocate them with DRM at all. So, given they are really "GPU"
>> > buffers, it does absolutely make sense to allocate them in a
>> > different driver to the display driver.
>> >
>> > However, to avoid unnecessary memcpys & related cache
>> > maintenance ops, we'd also like the GPU to render into buffers
>> > which are scanned out by the display controller. So let's say
>> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
>> > out buffers with the display's DRM driver but a custom ioctl
>> > on the GPU's DRM driver to allocate non scanout, off-screen
>> > buffers. Sounds great, but I don't think that really works
>> > with DRI2. If we used two drivers to allocate buffers, which
>> > of those drivers do we return in DRI2ConnectReply? Even if we
>> > solve that somehow, GEM flink names are name-spaced to a
>> > single device node (AFAIK). So when we do a DRI2GetBuffers,
>> > how does the EGL in the client know which DRM device owns GEM
>> > flink name "1234"? We'd need some pretty dirty hacks.
>>
>> You would return the name of the display driver allocating the
>> buffers. On the client side you can use generic ioctls to go from
>> flink -> handle -> dmabuf. So the client side would end up opening
>> both the display drm device and the gpu, but without needing to know
>> too much about the display.
>
> I think the bit I was missing was that a GEM bo for a buffer imported
> using dma_buf/PRIME can still be flink'd. So the display controller's
> DRM driver allocates scan-out buffers via the DUMB buffer allocate
> ioctl. Those scan-out buffers than then be exported from the
> dispaly's DRM driver and imported into the GPU's DRM driver using
> PRIME. Once imported into the GPU's driver, we can use flink to get a
> name for that buffer within the GPU DRM driver's name-space to return
> to the DRI2 client. That same namespace is also what DRI2 back-buffers
> are allocated from, so I think that could work... Except...
>
(and.. the general direction is that things will move more to just use
dmabuf directly, ie. wayland or dri3)
>
>> > Anyway, that latter case also gets quite difficult. The "GPU"
>> > DRM driver would need to know the constraints of the display
>> > controller when allocating buffers intended to be scanned out.
>> > For example, pl111 typically isn't behind an IOMMU and so
>> > requires physically contiguous memory. We'd have to teach the
>> > GPU's DRM driver about the constraints of the display HW. Not
>> > exactly a clean driver model. :-(
>> >
>> > I'm still a little stuck on how to proceed, so any ideas
>> > would greatly appreciated! My current train of thought is
>> > having a kind of SoC-specific DRM driver which allocates
>> > buffers for both display and GPU within a single GEM
>> > namespace. That SoC-specific DRM driver could then know the
>> > constraints of both the GPU and the display HW. We could then
>> > use PRIME to export buffers allocated with the SoC DRM driver
>> > and import them into the GPU and/or display DRM driver.
>>
>> Usually if the display drm driver is allocating the buffers that might
>> be scanned out, it just needs to have minimal knowledge of the GPU
>> (pitch alignment constraints). I don't think we need a 3rd device
>> just to allocate buffers.
>
> While Mali can render to pretty much any buffer, there is a mild
> performance improvement to be had if the buffer stride is aligned to
> the AXI bus's max burst length when drawing to the buffer.
I suspect the display controllers might frequently benefit if the
pitch is aligned to AXI burst length too..
> So in some respects, there is a constraint on how buffers which will
> be drawn to using the GPU are allocated. I don't really like the idea
> of teaching the display controller DRM driver about the GPU buffer
> constraints, even if they are fairly trivial like this. If the same
> display HW IP is being used on several SoCs, it seems wrong somehow
> to enforce those GPU constraints if some of those SoCs don't have a
> GPU.
Well, I suppose you could get min_pitch_alignment from devicetree, or
something like this..
In the end, the easy solution is just to make the display allocate to
the worst-case pitch alignment. In the early days of dma-buf
discussions, we kicked around the idea of negotiating or
programatically describing the constraints, but that didn't really
seem like a bounded problem.
> We may also then have additional constraints when sharing buffers
> between the display HW and video decode or even camera ISP HW.
> Programmatically describing buffer allocation constraints is very
> difficult and I'm not sure you can actually do it - there's some
> pretty complex constraints out there! E.g. I believe there's a
> platform where Y and UV planes of the reference frame need to be in
> separate DRAM banks for real-time 1080p decode, or something like
> that?
yes, this was discussed. This is different from pitch/format/size
constraints.. it is really just a placement constraint (ie. where do
the physical pages go). IIRC the conclusion was to use a dummy
devices with it's own CMA pool for attaching the Y vs UV buffers.
> Anyway, I guess my point is that even if we solve how to allocate
> buffers which will be shared between the GPU and display HW such that
> both sets of constraints are satisfied, that may not be the end of
> the story.
>
that was part of the reason to punt this problem to userspace ;-)
In practice, the kernel drivers doesn't usually know too much about
the dimensions/format/etc.. that is really userspace level knowledge.
There are a few exceptions when the kernel needs to know how to setup
GTT/etc for tiled buffers, but normally this sort of information is up
at the next level up (userspace, and drm_framebuffer in case of
scanout). Userspace media frameworks like GStreamer already have a
concept of format/caps negotiation. For non-display<->gpu sharing, I
think this is probably where this sort of constraint negotiation
should be handled.
BR,
-R
>
> Cheers,
>
> Tom
>
>
>
>
>
Hi Rob,
+linux-media, +linaro-mm-sig for discussion of video/camera
buffer constraints...
> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey(a)arm.com>
> wrote:
> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >> > allocate buffers for the GPU. Still not sure how to resolve
> >> > this as we don't use DRM for our GPU driver.
> >>
> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp.
> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> >> using prime/dmabuf to share between the two.
> >
> > The "extra" buffers we were allocating from armsoc DDX were really
> > being allocated through DRM/GEM so we could get an flink name
> > for them and pass a reference to them back to our GPU driver on
> > the client side. If it weren't for our need to access those
> > extra off-screen buffers with the GPU we wouldn't need to
> > allocate them with DRM at all. So, given they are really "GPU"
> > buffers, it does absolutely make sense to allocate them in a
> > different driver to the display driver.
> >
> > However, to avoid unnecessary memcpys & related cache
> > maintenance ops, we'd also like the GPU to render into buffers
> > which are scanned out by the display controller. So let's say
> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > out buffers with the display's DRM driver but a custom ioctl
> > on the GPU's DRM driver to allocate non scanout, off-screen
> > buffers. Sounds great, but I don't think that really works
> > with DRI2. If we used two drivers to allocate buffers, which
> > of those drivers do we return in DRI2ConnectReply? Even if we
> > solve that somehow, GEM flink names are name-spaced to a
> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > how does the EGL in the client know which DRM device owns GEM
> > flink name "1234"? We'd need some pretty dirty hacks.
>
> You would return the name of the display driver allocating the
> buffers. On the client side you can use generic ioctls to go from
> flink -> handle -> dmabuf. So the client side would end up opening
> both the display drm device and the gpu, but without needing to know
> too much about the display.
I think the bit I was missing was that a GEM bo for a buffer imported
using dma_buf/PRIME can still be flink'd. So the display controller's
DRM driver allocates scan-out buffers via the DUMB buffer allocate
ioctl. Those scan-out buffers than then be exported from the
dispaly's DRM driver and imported into the GPU's DRM driver using
PRIME. Once imported into the GPU's driver, we can use flink to get a
name for that buffer within the GPU DRM driver's name-space to return
to the DRI2 client. That same namespace is also what DRI2 back-buffers
are allocated from, so I think that could work... Except...
> > Anyway, that latter case also gets quite difficult. The "GPU"
> > DRM driver would need to know the constraints of the display
> > controller when allocating buffers intended to be scanned out.
> > For example, pl111 typically isn't behind an IOMMU and so
> > requires physically contiguous memory. We'd have to teach the
> > GPU's DRM driver about the constraints of the display HW. Not
> > exactly a clean driver model. :-(
> >
> > I'm still a little stuck on how to proceed, so any ideas
> > would greatly appreciated! My current train of thought is
> > having a kind of SoC-specific DRM driver which allocates
> > buffers for both display and GPU within a single GEM
> > namespace. That SoC-specific DRM driver could then know the
> > constraints of both the GPU and the display HW. We could then
> > use PRIME to export buffers allocated with the SoC DRM driver
> > and import them into the GPU and/or display DRM driver.
>
> Usually if the display drm driver is allocating the buffers that might
> be scanned out, it just needs to have minimal knowledge of the GPU
> (pitch alignment constraints). I don't think we need a 3rd device
> just to allocate buffers.
While Mali can render to pretty much any buffer, there is a mild
performance improvement to be had if the buffer stride is aligned to
the AXI bus's max burst length when drawing to the buffer.
So in some respects, there is a constraint on how buffers which will
be drawn to using the GPU are allocated. I don't really like the idea
of teaching the display controller DRM driver about the GPU buffer
constraints, even if they are fairly trivial like this. If the same
display HW IP is being used on several SoCs, it seems wrong somehow
to enforce those GPU constraints if some of those SoCs don't have a
GPU.
We may also then have additional constraints when sharing buffers
between the display HW and video decode or even camera ISP HW.
Programmatically describing buffer allocation constraints is very
difficult and I'm not sure you can actually do it - there's some
pretty complex constraints out there! E.g. I believe there's a
platform where Y and UV planes of the reference frame need to be in
separate DRAM banks for real-time 1080p decode, or something like
that?
Anyway, I guess my point is that even if we solve how to allocate
buffers which will be shared between the GPU and display HW such that
both sets of constraints are satisfied, that may not be the end of
the story.
Cheers,
Tom
Hello,
This is a fourth version of my proposal for device tree integration for
reserved memory and Contiguous Memory Allocator. After the comments from
Grant Likely I moved back memory region definitions back to /memory node
(as it was in the first version of this proposal). I've also extended
the code and made it more generic, added support for so called reserved
dma memory (special dma memory regions created by dma_alloc_coherent()
function, for exclusive usage for dma allocation for the given device).
Just a few words for those who see this code for the first time:
The proposed bindings allows to define contiguous memory regions of
specified base address and size. Then, the defined regions can be
assigned to the given device(s) by adding a property with a phanle to
the defined contiguous memory region. From the device tree perspective
that's all. Once the bindings are added, all the memory allocations from
dma-mapping subsystem will be served from the defined contiguous memory
regions.
Contiguous Memory Allocator is a framework, which lets to provide a
large contiguous memory buffers for (usually a multimedia) devices. The
contiguous memory is reserved during early boot and then shared with
kernel, which is allowed to allocate it for movable pages. Then, when
device driver requests a contigouous buffer, the framework migrates
movable pages out of contiguous region and gives it to the driver. When
device driver frees the buffer, it is added to kernel memory pool again.
For more information, please refer to commit c64be2bb1c6eb43c838b2c6d57
("drivers: add Contiguous Memory Allocator") and d484864dd96e1830e76895
(CMA merge commit).
Why we need device tree bindings for CMA at all?
Older ARM kernels used so called board-based initialization. Those board
files contained a definition of all hardware blocks available on the
target system and particular kernel and driver software configuration
selected by the board maintainer.
In the new approach the board files will be removed completely and
Device Tree approach is used to describe all hardware blocks available
on the target system. By definition, the bindings should be software
independent, so at least in theory it should be possible to use those
bindings with other operating systems than Linux kernel.
Reserved memory configuration belongs to the grey area. It might depend
on hardware restriction of the board or modules and low-level
configuration done by bootloader. Putting reserved and contiguous memory
regions to /memory node and having phandles to those regions in the
device nodes however matches well with the device-tree typical style of
linking devices with other resources like clocks, interrupts,
regulators, power domains, etc. This is the main reason to use such
approach instead of putting everything to /chosen node as it has been
proposed in v2 and v3.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v4:
- corrected Devcie Tree mailing list address (resend)
- moved back contiguous-memory bindings from /chosen/contiguous-memory
to /memory nodes as suggested by Grant (see
http://article.gmane.org/gmane.linux.drivers.devicetree/41030
for more details)
- added support for DMA reserved memory with dma_declare_coherent()
- moved code to drivers/of/of_reserved_mem.c
- added generic code to scan specific path in flat device tree
v3: http://thread.gmane.org/gmane.linux.drivers.devicetree/40013/
- fixed issues pointed by Laura and updated documentation
v2: http://thread.gmane.org/gmane.linux.drivers.devicetree/34075
- moved contiguous-memory bindings from /memory to /chosen/contiguous-memory/
node to avoid spreading Linux specific parameters over the whole device
tree definitions
- added support for autoconfigured regions (use zero base)
- fixes minor bugs
v1: http://thread.gmane.org/gmane.linux.drivers.devicetree/30111/
- initial proposal
Patch summary:
Marek Szyprowski (4):
drivers: dma-contiguous: clean source code and prepare for device
tree
drivers: of: add function to scan fdt nodes given by path
drivers: of: add initialization code for dma reserved memory
ARM: init: add support for reserved memory defined by device tree
Documentation/devicetree/bindings/memory.txt | 152 ++++++++++++++++++++++
arch/arm/mm/init.c | 3 +
drivers/base/dma-contiguous.c | 147 +++++++++++-----------
drivers/of/Kconfig | 6 +
drivers/of/Makefile | 1 +
drivers/of/fdt.c | 76 +++++++++++
drivers/of/of_reserved_mem.c | 175 ++++++++++++++++++++++++++
include/asm-generic/dma-coherent.h | 6 +
include/asm-generic/dma-contiguous.h | 2 -
include/linux/dma-contiguous.h | 49 +++++++-
include/linux/of_fdt.h | 3 +
11 files changed, 541 insertions(+), 79 deletions(-)
create mode 100644 Documentation/devicetree/bindings/memory.txt
create mode 100644 drivers/of/of_reserved_mem.c
--
1.7.9.5
From: "Luis R. Rodriguez" <mcgrof(a)do-not-panic.com>
This backports the kernel's wound/wait style locks 040a0a371,
using the linux-stable v3.11-rc2 as a base for development.
Given the complexity to support debugging mutexes this backport
implementation is simplified by only making this feature availabe
if you to have DEBUG_MUTEXES and DEBUG_LOCK_ALLOC disabled.
Given that ww mutex is required for DRM this also means we must
update the kconfig for DRM and require you to also not be able to build
DRM if you have either of these options enabled. Support for
DEBUG_MUTEXES and DEBUG_LOCK_ALLOC can be added later by anyone
daring. This uses the new dependencies file kconfig language
extension to specify the backport feature build restrictions
for DRM.
Part of the ww mutex addition to the kernel required modifying
the fast path mutex locking scheme by requiring you to deal
with the slow path alternatives on your own (refer to a41b56ef).
The reason for this change was that the mutex fastpath implementation
assumed your slowpath alternative can only be passed one argument
and the addition of ww mutexes requires dealing with the slow
path with a context passed.
It'd be painful to backport all asm for an optimized fastpath
implementation so we penalize the backport ww mutex fast path
by using the generic atomic_dec_return().
To backport a clean our own mutex_lock_common() with the least
amount of changes against upstream commits 2bd2c92c and 41fcb9f2
also needed to be backported. Commit 2bd2c92c dealt with adding
support for queue mutex spinners with an MCS lock, since this
cannot be backported for older kernels we provide empty inlines.
Commit 41fcb9f2 just removed SCHED_FEAT_OWNER_SPIN as it was an
early hack, the only thing required to backport this commit was
to provide an alternative declaration for mutex_spin_on_owner()
as it was declared non-inline for older kernels.
Finally c5491ea7 required backporting schedule_preempt_disabled()
as well but that just consisted of carrying over the original
implementation. Since its not exported we need to reimplement
it to make it available to our internal core ww mutex port.
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 040a0a371
v3.11-rc1~147^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains a41b56ef
v3.11-rc1~147^2~6
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 2bd2c92c
v3.10-rc1~200^2~3
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 41fcb9f2
v3.10-rc1~200^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains c5491ea7
v3.4-rc1~3^2~27
commit 040a0a37100563754bb1fee6ff6427420bcfa609
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Mon Jun 24 10:30:04 2013 +0200
mutex: Add support for wound/wait style locks
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.
For full documentation please read Documentation/ww-mutex-design.txt.
References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Rob Clark <robdclark(a)gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit a41b56efa70e060f650aeb54740aaf52044a1ead
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Thu Jun 20 13:31:05 2013 +0200
arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.
Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.
This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: robclark(a)gmail.com
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:13 2013 -0400
mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
<-- snip -->
commit 41fcb9f230bf773656d1768b73000ef720bf00c3
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:11 2013 -0400
mutex: Move mutex spinning code from sched/core.c back to mutex.c
<-- snip -->
commit c5491ea779793f977d282754db478157cc409d82
Author: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon Mar 21 12:09:35 2011 +0100
sched/rt: Add schedule_preempt_disabled()
<-- snip -->
Cc: maarten.lankhorst(a)canonical.com
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Rob Clark <robdclark(a)gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
---
backport/backport-include/linux/ww_mutex.h | 333 ++++++++++++++
backport/compat/Kconfig | 11 +
backport/compat/Makefile | 1 +
backport/compat/kernel/ww_mutex.c | 667 ++++++++++++++++++++++++++++
dependencies | 6 +
5 files changed, 1018 insertions(+)
create mode 100644 backport/backport-include/linux/ww_mutex.h
create mode 100644 backport/compat/kernel/ww_mutex.c
diff --git a/backport/backport-include/linux/ww_mutex.h b/backport/backport-include/linux/ww_mutex.h
new file mode 100644
index 0000000..0953939
--- /dev/null
+++ b/backport/backport-include/linux/ww_mutex.h
@@ -0,0 +1,333 @@
+#ifndef __BACKPORT_LINUX_WW_MUTEX_H
+#define __BACKPORT_LINUX_WW_MUTEX_H
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+#include_next <linux/ww_mutex.h>
+#else
+#ifdef CPTCFG_BACKPORT_BUILD_WW_MUTEX
+/*
+ * Wound/Wait Mutexes: blocking mutual exclusion locks with deadlock avoidance
+ *
+ * Original mutex implementation started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Wound/wait implementation:
+ * Copyright (C) 2013 Canonical Ltd.
+ *
+ * This file contains the main data structure and API definitions.
+ */
+
+#include <linux/mutex.h>
+
+struct ww_class {
+ atomic_long_t stamp;
+ struct lock_class_key acquire_key;
+ struct lock_class_key mutex_key;
+ const char *acquire_name;
+ const char *mutex_name;
+};
+
+struct ww_acquire_ctx {
+ struct task_struct *task;
+ unsigned long stamp;
+ unsigned acquired;
+};
+
+struct ww_mutex {
+ struct mutex base;
+ struct ww_acquire_ctx *ctx;
+};
+
+# define __WW_CLASS_MUTEX_INITIALIZER(lockname, ww_class)
+
+#define __WW_CLASS_INITIALIZER(ww_class) \
+ { .stamp = ATOMIC_LONG_INIT(0) \
+ , .acquire_name = #ww_class "_acquire" \
+ , .mutex_name = #ww_class "_mutex" }
+
+#define __WW_MUTEX_INITIALIZER(lockname, class) \
+ { .base = { \__MUTEX_INITIALIZER(lockname) } \
+ __WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
+
+#define DEFINE_WW_CLASS(classname) \
+ struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+
+#define DEFINE_WW_MUTEX(mutexname, ww_class) \
+ struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
+
+/**
+ * ww_mutex_init - initialize the w/w mutex
+ * @lock: the mutex to be initialized
+ * @ww_class: the w/w class the mutex should belong to
+ *
+ * Initialize the w/w mutex to unlocked state and associate it with the given
+ * class.
+ *
+ * It is not allowed to initialize an already locked mutex.
+ */
+#define ww_mutex_init LINUX_BACKPORT(ww_mutex_init)
+static inline void ww_mutex_init(struct ww_mutex *lock,
+ struct ww_class *ww_class)
+{
+ __mutex_init(&lock->base, ww_class->mutex_name, &ww_class->mutex_key);
+ lock->ctx = NULL;
+}
+
+/**
+ * ww_acquire_init - initialize a w/w acquire context
+ * @ctx: w/w acquire context to initialize
+ * @ww_class: w/w class of the context
+ *
+ * Initializes an context to acquire multiple mutexes of the given w/w class.
+ *
+ * Context-based w/w mutex acquiring can be done in any order whatsoever within
+ * a given lock class. Deadlocks will be detected and handled with the
+ * wait/wound logic.
+ *
+ * Mixing of context-based w/w mutex acquiring and single w/w mutex locking can
+ * result in undetected deadlocks and is so forbidden. Mixing different contexts
+ * for the same w/w class when acquiring mutexes can also result in undetected
+ * deadlocks, and is hence also forbidden. Both types of abuse will be caught by
+ * enabling CONFIG_PROVE_LOCKING.
+ *
+ * Nesting of acquire contexts for _different_ w/w classes is possible, subject
+ * to the usual locking rules between different lock classes.
+ *
+ * An acquire context must be released with ww_acquire_fini by the same task
+ * before the memory is freed. It is recommended to allocate the context itself
+ * on the stack.
+ */
+#define ww_acquire_init LINUX_BACKPORT(ww_acquire_init)
+static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
+ struct ww_class *ww_class)
+{
+ ctx->task = current;
+ ctx->stamp = atomic_long_inc_return(&ww_class->stamp);
+ ctx->acquired = 0;
+}
+
+/**
+ * ww_acquire_done - marks the end of the acquire phase
+ * @ctx: the acquire context
+ *
+ * Marks the end of the acquire phase, any further w/w mutex lock calls using
+ * this context are forbidden.
+ *
+ * Calling this function is optional, it is just useful to document w/w mutex
+ * code and clearly designated the acquire phase from actually using the locked
+ * data structures.
+ */
+#define ww_acquire_done LINUX_BACKPORT(ww_acquire_done)
+static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
+{
+}
+
+/**
+ * ww_acquire_fini - releases a w/w acquire context
+ * @ctx: the acquire context to free
+ *
+ * Releases a w/w acquire context. This must be called _after_ all acquired w/w
+ * mutexes have been released with ww_mutex_unlock.
+ */
+#define ww_acquire_fini LINUX_BACKPORT(ww_acquire_fini)
+static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
+{
+}
+
+#define __ww_mutex_lock LINUX_BACKPORT(__ww_mutex_lock)
+extern int __must_check __ww_mutex_lock(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+#define __ww_mutex_lock_interruptible LINUX_BACKPORT(__ww_mutex_lock_interruptible)
+extern int __must_check __ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+
+/**
+ * ww_mutex_lock - acquire the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context, or NULL to acquire only a single lock.
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow. Alternatively callers can opt to not acquire this
+ * lock and proceed with trying to acquire further w/w mutexes (e.g. when
+ * scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock LINUX_BACKPORT(ww_mutex_lock)
+static inline int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock(lock, ctx);
+
+ mutex_lock(&lock->base);
+ return 0;
+}
+
+/**
+ * ww_mutex_lock_interruptible - acquire the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired. If a
+ * signal arrives while waiting for the lock then this function returns -EINTR.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow_interruptible. Alternatively callers can opt to
+ * not acquire this lock and proceed with trying to acquire further w/w mutexes
+ * (e.g. when scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock_interruptible LINUX_BACKPORT(ww_mutex_lock_interruptible)
+static inline int __must_check ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock_interruptible(lock, ctx);
+ else
+ return mutex_lock_interruptible(&lock->base);
+}
+
+/**
+ * ww_mutex_lock_slow - slowpath acquiring of the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the context held. It is forbidden to call this on anything else than the
+ * contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock directly. This function here is simply to help w/w mutex
+ * locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow LINUX_BACKPORT(ww_mutex_lock_slow)
+static inline void
+ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+ ret = ww_mutex_lock(lock, ctx);
+ (void)ret;
+}
+
+/**
+ * ww_mutex_lock_slow_interruptible - slowpath acquiring of the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available and returns 0 when the lock has
+ * been acquired. If a signal arrives while waiting for the lock then this
+ * function returns -EINTR.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the given context held. It is forbidden to call this on anything else
+ * than the contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock_interruptible directly. This function here is simply to help
+ * w/w mutex locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow_interruptible LINUX_BACKPORT(ww_mutex_lock_slow_interruptible)
+static inline int __must_check
+ww_mutex_lock_slow_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return ww_mutex_lock_interruptible(lock, ctx);
+}
+
+#define ww_mutex_unlock LINUX_BACKPORT(ww_mutex_unlock)
+extern void ww_mutex_unlock(struct ww_mutex *lock);
+
+/**
+ * ww_mutex_trylock - tries to acquire the w/w mutex without acquire context
+ * @lock: mutex to lock
+ *
+ * Trylocks a mutex without acquire context, so no deadlock detection is
+ * possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
+ */
+#define ww_mutex_trylock LINUX_BACKPORT(ww_mutex_trylock)
+static inline int __must_check ww_mutex_trylock(struct ww_mutex *lock)
+{
+ return mutex_trylock(&lock->base);
+}
+
+/***
+ * ww_mutex_destroy - mark a w/w mutex unusable
+ * @lock: the mutex to be destroyed
+ *
+ * This function marks the mutex uninitialized, and any subsequent
+ * use of the mutex is forbidden. The mutex must not be locked when
+ * this function is called.
+ */
+#define ww_mutex_destroy LINUX_BACKPORT(ww_mutex_destroy)
+static inline void ww_mutex_destroy(struct ww_mutex *lock)
+{
+ mutex_destroy(&lock->base);
+}
+
+/**
+ * ww_mutex_is_locked - is the w/w mutex locked
+ * @lock: the mutex to be queried
+ *
+ * Returns 1 if the mutex is locked, 0 if unlocked.
+ */
+#define ww_mutex_is_locked LINUX_BACKPORT(ww_mutex_is_locked)
+static inline bool ww_mutex_is_locked(struct ww_mutex *lock)
+{
+ return mutex_is_locked(&lock->base);
+}
+
+#endif /* CPTCFG_BACKPORT_BUILD_WW_MUTEX */
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+#endif /* __BACKPORT_LINUX_WW_MUTEX_H */
diff --git a/backport/compat/Kconfig b/backport/compat/Kconfig
index e2f0cdd..f3c1ab3 100644
--- a/backport/compat/Kconfig
+++ b/backport/compat/Kconfig
@@ -185,6 +185,17 @@ config BACKPORT_LEDS_CLASS
config BACKPORT_LEDS_TRIGGERS
bool
+config BACKPORT_BUILD_WW_MUTEX
+ bool
+ # Build only if on kernels < 3.11
+ # For now only DRM drivers use ww mutexes.
+ depends on DRM && BACKPORT_KERNEL_3_11
+ default y if BACKPORT_USERSEL_BUILD_ALL
+ # probably a bad idea if you have these options given we
+ # ripped those options out.
+ depends on !DEBUG_MUTEXES
+ depends on !DEBUG_LOCK_ALLOC
+
config BACKPORT_BUILD_RADIX_HELPERS
bool
# You have selected to build backported DRM drivers
diff --git a/backport/compat/Makefile b/backport/compat/Makefile
index 252290e..fec01c4 100644
--- a/backport/compat/Makefile
+++ b/backport/compat/Makefile
@@ -41,3 +41,4 @@ compat-$(CPTCFG_BACKPORT_BUILD_KFIFO) += kfifo.o
compat-$(CPTCFG_BACKPORT_BUILD_GENERIC_ATOMIC64) += compat_atomic.o
compat-$(CPTCFG_BACKPORT_BUILD_DMA_SHARED_HELPERS) += dma-shared-helpers.o
compat-$(CPTCFG_BACKPORT_BUILD_RADIX_HELPERS) += lib-radix-tree-helpers.o
+compat-$(CPTCFG_BACKPORT_BUILD_WW_MUTEX) += kernel/ww_mutex.o
diff --git a/backport/compat/kernel/ww_mutex.c b/backport/compat/kernel/ww_mutex.c
new file mode 100644
index 0000000..257c2a4
--- /dev/null
+++ b/backport/compat/kernel/ww_mutex.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2013 Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
+ *
+ * Backport ww mutex for older kernels. This is not supported when
+ * DEBUG_MUTEXES or DEBUG_LOCK_ALLOC is enabled.
+ *
+ * Taken from: kernel/mutex.c - via linux-stable v3.11-rc2
+ *
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
+ * David Howells for suggestions and improvements.
+ *
+ * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ * from the -rt tree, where it was originally implemented for rtmutexes
+ * by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ * and Sven Dietrich.
+ *
+ * Also see Documentation/mutex-design.txt.
+ */
+
+#include <linux/mutex.h>
+#include <linux/ww_mutex.h>
+#include <asm/mutex.h>
+#include <linux/sched.h>
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0)
+#include <linux/sched/rt.h>
+#endif
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+#include <linux/version.h>
+
+/*
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
+ */
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+
+#define spin_lock_mutex(lock, flags) \
+ do { spin_lock(lock); (void)(flags); } while (0)
+#define spin_unlock_mutex(lock, flags) \
+ do { spin_unlock(lock); (void)(flags); } while (0)
+#define mutex_remove_waiter(lock, waiter, ti) \
+ __list_del((waiter)->list.prev, (waiter)->list.next)
+
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+ lock->owner = current;
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+ lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) /* 2bd2c92c and 41fcb9f2 */
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
+static noinline
+#endif
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
+#else /* Backport 2bd2c92c: help keep backport_mutex_lock_common() clean */
+
+struct mspin_node {
+};
+#define MLOCK(mutex) NULL
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ return 1;
+}
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) */
+#endif /* CONFIG_MUTEX_SPIN_ON_OWNER */
+
+/*
+ * Release the lock, slowpath:
+ */
+static inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
+{
+ struct mutex *lock = container_of(lock_count, struct mutex, count);
+ unsigned long flags;
+
+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ /* debug_mutex_unlock(lock); */
+
+ /*
+ * some architectures leave the lock unlocked in the fastpath failure
+ * case, others need to leave it locked. In the later case we have to
+ * unlock it here
+ */
+ if (__mutex_slowpath_needs_to_unlock())
+ atomic_set(&lock->count, 1);
+
+ if (!list_empty(&lock->wait_list)) {
+ /* get the first entry from the wait-list: */
+ struct mutex_waiter *waiter =
+ list_entry(lock->wait_list.next,
+ struct mutex_waiter, list);
+
+ /* debug_mutex_wake_waiter(lock, waiter); */
+
+ wake_up_process(waiter->task);
+ }
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+}
+
+/*
+ * Release the lock, slowpath:
+ */
+static __used noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+/**
+ * ww_mutex_unlock - release the w/w mutex
+ * @lock: the mutex to be released
+ *
+ * Unlock a mutex that has been locked by this task previously with any of the
+ * ww_mutex_lock* functions (with or without an acquire context). It is
+ * forbidden to release the locks after releasing the acquire context.
+ *
+ * This function must not be used in interrupt context. Unlocking
+ * of a unlocked mutex is not allowed.
+ */
+void __sched ww_mutex_unlock(struct ww_mutex *lock)
+{
+ /*
+ * The unlocking fastpath is the 0->1 transition from 'locked'
+ * into 'unlocked' state:
+ */
+ if (lock->ctx) {
+ if (lock->ctx->acquired > 0)
+ lock->ctx->acquired--;
+ lock->ctx = NULL;
+ }
+
+ __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+}
+EXPORT_SYMBOL_GPL(ww_mutex_unlock);
+
+static inline int __sched
+__mutex_lock_check_stamp(struct mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+ struct ww_acquire_ctx *hold_ctx = ACCESS_ONCE(ww->ctx);
+
+ if (!hold_ctx)
+ return 0;
+
+ if (unlikely(ctx == hold_ctx))
+ return -EALREADY;
+
+ if (ctx->stamp - hold_ctx->stamp <= LONG_MAX &&
+ (ctx->stamp != hold_ctx->stamp || ctx > hold_ctx)) {
+ return -EDEADLK;
+ }
+
+ return 0;
+}
+
+static __always_inline void ww_mutex_lock_acquired(struct ww_mutex *ww,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ ww_ctx->acquired++;
+}
+
+/*
+ * after acquiring lock with fastpath or when we lost out in contested
+ * slowpath, set ctx and wake up any waiters so they can recheck.
+ *
+ * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
+ * as the fastpath and opportunistic spinning are disabled in that case.
+ */
+static __always_inline void
+ww_mutex_set_context_fastpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ unsigned long flags;
+ struct mutex_waiter *cur;
+
+ ww_mutex_lock_acquired(lock, ctx);
+
+ lock->ctx = ctx;
+
+ /*
+ * The lock->ctx update should be visible on all cores before
+ * the atomic read is done, otherwise contended waiters might be
+ * missed. The contended waiters will either see ww_ctx == NULL
+ * and keep spinning, or it will acquire wait_lock, add itself
+ * to waiter list and sleep.
+ */
+ smp_mb(); /* ^^^ */
+
+ /*
+ * Check if lock is contended, if not there is nobody to wake up
+ */
+ if (likely(atomic_read(&lock->base.count) == 0))
+ return;
+
+ /*
+ * Uh oh, we raced in fastpath, wake up everyone in this case,
+ * so they can see the new lock->ctx.
+ */
+ spin_lock_mutex(&lock->base.wait_lock, flags);
+ list_for_each_entry(cur, &lock->base.wait_list, list) {
+ /* debug_mutex_wake_waiter(&lock->base, cur); */
+ wake_up_process(cur->task);
+ }
+ spin_unlock_mutex(&lock->base.wait_lock, flags);
+}
+
+/**
+ * backport_schedule_preempt_disabled - called with preemption disabled
+ *
+ * Backports c5491ea7. This is not exported so we leave it
+ * here as this is the only current core user on backports.
+ * Although available on >= 3.4 its only for in-kernel code so
+ * we provide our own.
+ *
+ * Returns with preemption disabled. Note: preempt_count must be 1
+ */
+static void __sched backport_schedule_preempt_disabled(void)
+{
+ preempt_enable_no_resched();
+ schedule();
+ preempt_disable();
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static __always_inline int __sched
+__backport_mutex_lock_common(struct mutex *lock, long state,
+ unsigned int subclass,
+ struct lockdep_map *nest_lock, unsigned long ip,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ struct task_struct *task = current;
+ struct mutex_waiter waiter;
+ unsigned long flags;
+ int ret;
+
+ preempt_disable();
+ mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ /*
+ * Optimistic spinning.
+ *
+ * We try to spin for acquisition when we find that there are no
+ * pending waiters and the lock owner is currently running on a
+ * (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
+ * doesn't track the owner atomically in the lock field, we need to
+ * track it non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
+ * to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
+ */
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;
+
+ for (;;) {
+ struct task_struct *owner;
+ struct mspin_node node;
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ struct ww_mutex *ww;
+
+ ww = container_of(lock, struct ww_mutex, base);
+ /*
+ * If ww->ctx is set the contents are undefined, only
+ * by acquiring wait_lock there is a guarantee that
+ * they are not invalid when reading.
+ *
+ * As such, when deadlock detection needs to be
+ * performed the optimistic spinning cannot be done.
+ */
+ if (ACCESS_ONCE(ww->ctx))
+ break;
+ }
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ mspin_lock(MLOCK(lock), &node);
+ owner = ACCESS_ONCE(lock->owner);
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
+ break;
+ }
+
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
+ lock_acquired(&lock->dep_map, ip);
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww;
+ ww = container_of(lock, struct ww_mutex, base);
+
+ ww_mutex_set_context_fastpath(ww, ww_ctx);
+ }
+
+ mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
+ preempt_enable();
+ return 0;
+ }
+ mspin_unlock(MLOCK(lock), &node);
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+ }
+slowpath:
+#endif
+ spin_lock_mutex(&lock->wait_lock, flags);
+
+ /* We don't support DEBUG_MUTEXES on the backport */
+ /* debug_mutex_lock_common(lock, &waiter); */
+ /* debug_mutex_add_waiter(lock, &waiter, task_thread_info(task)); */
+
+ /* add waiting tasks to the end of the waitqueue (FIFO): */
+ list_add_tail(&waiter.list, &lock->wait_list);
+ waiter.task = task;
+
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
+ goto done;
+
+ lock_contended(&lock->dep_map, ip);
+
+ for (;;) {
+ /*
+ * Lets try to take the lock again - this is needed even if
+ * we get here for the first time (shortly after failing to
+ * acquire the lock), to make sure that we get a wakeup once
+ * it's unlocked. Later on, if we sleep, this is the
+ * operation that gives us the lock. We xchg it to -1, so
+ * that when we release the lock, we properly wake up the
+ * other waiters:
+ */
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
+ break;
+
+ /*
+ * got a signal? (This code gets eliminated in the
+ * TASK_UNINTERRUPTIBLE case.)
+ */
+ if (unlikely(signal_pending_state(state, task))) {
+ ret = -EINTR;
+ goto err;
+ }
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ ret = __mutex_lock_check_stamp(lock, ww_ctx);
+ if (ret)
+ goto err;
+ }
+
+ __set_task_state(task, state);
+
+ /* didn't get the lock, go to sleep: */
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ backport_schedule_preempt_disabled();
+ spin_lock_mutex(&lock->wait_lock, flags);
+ }
+
+done:
+ lock_acquired(&lock->dep_map, ip);
+ /* got the lock - rejoice! */
+ mutex_remove_waiter(lock, &waiter, current_thread_info());
+ mutex_set_owner(lock);
+
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww = container_of(lock,
+ struct ww_mutex,
+ base);
+ struct mutex_waiter *cur;
+
+ /*
+ * This branch gets optimized out for the common case,
+ * and is only important for ww_mutex_lock.
+ */
+
+ ww_mutex_lock_acquired(ww, ww_ctx);
+ ww->ctx = ww_ctx;
+
+ /*
+ * Give any possible sleeping processes the chance to wake up,
+ * so they can recheck if they have to back off.
+ */
+ list_for_each_entry(cur, &lock->wait_list, list) {
+ /* debug_mutex_wake_waiter(lock, cur); */
+ wake_up_process(cur->task);
+ }
+ }
+
+ /* set it to 0 if there are no waiters left: */
+ if (likely(list_empty(&lock->wait_list)))
+ atomic_set(&lock->count, 0);
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+
+ /* debug_mutex_free_waiter(&waiter); */
+ preempt_enable();
+
+ return 0;
+
+err:
+ mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ /* debug_mutex_free_waiter(&waiter); */
+ mutex_release(&lock->dep_map, 1, ip);
+ preempt_enable();
+ return ret;
+}
+
+static noinline int __sched
+__ww_mutex_lock_slowpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+static noinline int __sched
+__ww_mutex_lock_interruptible_slowpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+/**
+ * __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ * from 1 to a 0 value
+ * @count: pointer of type atomic_t
+ *
+ * For backporting purposes we can't use the older kernel's
+ * __mutex_fastpath_lock_retval() since upon failure of a fastpath
+ * lock we want to call our a failure routine with more than one argument, in
+ * this case the context for ww mutexes. Refer to commit a41b56ef the
+ * argument increase. It'd be painful to backport all asm code for the
+ * supported architectures so instead lets penalize the backport ww mutex
+ * fastpath lock with the not so efficient generic atomic_dec_return()
+ * implementation.
+ *
+ * Change the count from 1 to a value lower than 1. This function returns 0
+ * if the fastpath succeeds, or -1 otherwise.
+ */
+static inline int
+__backport_mutex_fastpath_lock_retval(atomic_t *count)
+{
+ if (unlikely(atomic_dec_return(count) < 0))
+ return -1;
+ return 0;
+}
+
+int __sched
+__ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock);
+
+int __sched
+__ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
diff --git a/dependencies b/dependencies
index 6841613..5c50571 100644
--- a/dependencies
+++ b/dependencies
@@ -48,6 +48,12 @@ MWIFIEX 2.6.27
# DRM stuff
HDMI 3.2
DRM 3.2
+# As of 3.11 DRM depends on the new ww_mutex which is
+# backported via BACKPORT_BUILD_WW_MUTEX. This backported
+# feature however has does not yet have support for
+# DEBUG_MUTEXES and DEBUG_LOCK_ALLOC.
+DRM kconfig: !BACKPORT_KERNEL_3_11 || !DEBUG_MUTEXES
+DRM kconfig: !BACKPORT_KERNEL_3_11 || !DEBUG_LOCK_ALLOC
DRM_TTM 3.2
# See e2bdb933, this was added on v3.3, in order to
# support DRM_QXL on 3.2 you'd have to backport 78c1d7848
--
1.7.10.4
From: "Luis R. Rodriguez" <mcgrof(a)do-not-panic.com>
This backports the kernel's wound/wait style locks 040a0a371,
using the linux-stable v3.11-rc2 as a base for development.
Given the complexity to support debugging mutexes this backport
implementation is simplified by only making this feature availabe
if you to have DEBUG_MUTEXES and DEBUG_LOCK_ALLOC disabled. Given
that ww mutex is required for DRM this also means we must update
the kconfig for DRM and require you to also not be able to build
DRM if you have either of these options enabled. Support for
DEBUG_MUTEXES and DEBUG_LOCK_ALLOC can be added later by anyone
daring.
Part of the ww mutex addition to the kernel required modifying
the fast path mutex locking scheme by requiring you to deal
with the slow path alternatives on your own (refer to a41b56ef).
The reason for this change was that the mutex fastpath implementation
assumed your slowpath alternative can only be passed one argument
and the addition of ww mutexes requires dealing with the slow
path with a context passed.
It'd be painful to backport all asm for an optimized fastpath
implementation so we penalize the backport ww mutex fast path
by using the generic atomic_dec_return().
To backport a clean our own mutex_lock_common() with the least
amount of changes against upstream commits 2bd2c92c and 41fcb9f2
also needed to be backported. Commit 2bd2c92c dealt with adding
support for queue mutex spinners with an MCS lock, since this
cannot be backported for older kernels we provide empty inlines.
Commit 41fcb9f2 just removed SCHED_FEAT_OWNER_SPIN as it was an
early hack, the only thing required to backport this commit was
to provide an alternative declaration for mutex_spin_on_owner()
as it was declared non-inline for older kernels.
Finally c5491ea7 required backporting schedule_preempt_disabled()
as well but that just consisted of carrying over the original
implementation. Since its not exported we need to reimplement
it to make it available to our internal core ww mutex port.
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 040a0a371
v3.11-rc1~147^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains a41b56ef
v3.11-rc1~147^2~6
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 2bd2c92c
v3.10-rc1~200^2~3
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains 41fcb9f2
v3.10-rc1~200^2~5
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains c5491ea7
v3.4-rc1~3^2~27
commit 040a0a37100563754bb1fee6ff6427420bcfa609
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Mon Jun 24 10:30:04 2013 +0200
mutex: Add support for wound/wait style locks
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.
For full documentation please read Documentation/ww-mutex-design.txt.
References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Rob Clark <robdclark(a)gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit a41b56efa70e060f650aeb54740aaf52044a1ead
Author: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Date: Thu Jun 20 13:31:05 2013 +0200
arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.
Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.
This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: robclark(a)gmail.com
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
commit 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:13 2013 -0400
mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
<-- snip -->
commit 41fcb9f230bf773656d1768b73000ef720bf00c3
Author: Waiman Long <Waiman.Long(a)hp.com>
Date: Wed Apr 17 15:23:11 2013 -0400
mutex: Move mutex spinning code from sched/core.c back to mutex.c
<-- snip -->
commit c5491ea779793f977d282754db478157cc409d82
Author: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon Mar 21 12:09:35 2011 +0100
sched/rt: Add schedule_preempt_disabled()
<-- snip -->
Cc: maarten.lankhorst(a)canonical.com
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Rob Clark <robdclark(a)gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linaro-mm-sig(a)lists.linaro.org
Cc: rostedt(a)goodmis.org
Cc: daniel(a)ffwll.ch
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
---
backport/backport-include/linux/ww_mutex.h | 333 ++++++++++++++
backport/compat/Kconfig | 11 +
backport/compat/Makefile | 1 +
backport/compat/kernel/ww_mutex.c | 667 ++++++++++++++++++++++++++++
4 files changed, 1012 insertions(+)
create mode 100644 backport/backport-include/linux/ww_mutex.h
create mode 100644 backport/compat/kernel/ww_mutex.c
diff --git a/backport/backport-include/linux/ww_mutex.h b/backport/backport-include/linux/ww_mutex.h
new file mode 100644
index 0000000..0953939
--- /dev/null
+++ b/backport/backport-include/linux/ww_mutex.h
@@ -0,0 +1,333 @@
+#ifndef __BACKPORT_LINUX_WW_MUTEX_H
+#define __BACKPORT_LINUX_WW_MUTEX_H
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+#include_next <linux/ww_mutex.h>
+#else
+#ifdef CPTCFG_BACKPORT_BUILD_WW_MUTEX
+/*
+ * Wound/Wait Mutexes: blocking mutual exclusion locks with deadlock avoidance
+ *
+ * Original mutex implementation started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Wound/wait implementation:
+ * Copyright (C) 2013 Canonical Ltd.
+ *
+ * This file contains the main data structure and API definitions.
+ */
+
+#include <linux/mutex.h>
+
+struct ww_class {
+ atomic_long_t stamp;
+ struct lock_class_key acquire_key;
+ struct lock_class_key mutex_key;
+ const char *acquire_name;
+ const char *mutex_name;
+};
+
+struct ww_acquire_ctx {
+ struct task_struct *task;
+ unsigned long stamp;
+ unsigned acquired;
+};
+
+struct ww_mutex {
+ struct mutex base;
+ struct ww_acquire_ctx *ctx;
+};
+
+# define __WW_CLASS_MUTEX_INITIALIZER(lockname, ww_class)
+
+#define __WW_CLASS_INITIALIZER(ww_class) \
+ { .stamp = ATOMIC_LONG_INIT(0) \
+ , .acquire_name = #ww_class "_acquire" \
+ , .mutex_name = #ww_class "_mutex" }
+
+#define __WW_MUTEX_INITIALIZER(lockname, class) \
+ { .base = { \__MUTEX_INITIALIZER(lockname) } \
+ __WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
+
+#define DEFINE_WW_CLASS(classname) \
+ struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+
+#define DEFINE_WW_MUTEX(mutexname, ww_class) \
+ struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
+
+/**
+ * ww_mutex_init - initialize the w/w mutex
+ * @lock: the mutex to be initialized
+ * @ww_class: the w/w class the mutex should belong to
+ *
+ * Initialize the w/w mutex to unlocked state and associate it with the given
+ * class.
+ *
+ * It is not allowed to initialize an already locked mutex.
+ */
+#define ww_mutex_init LINUX_BACKPORT(ww_mutex_init)
+static inline void ww_mutex_init(struct ww_mutex *lock,
+ struct ww_class *ww_class)
+{
+ __mutex_init(&lock->base, ww_class->mutex_name, &ww_class->mutex_key);
+ lock->ctx = NULL;
+}
+
+/**
+ * ww_acquire_init - initialize a w/w acquire context
+ * @ctx: w/w acquire context to initialize
+ * @ww_class: w/w class of the context
+ *
+ * Initializes an context to acquire multiple mutexes of the given w/w class.
+ *
+ * Context-based w/w mutex acquiring can be done in any order whatsoever within
+ * a given lock class. Deadlocks will be detected and handled with the
+ * wait/wound logic.
+ *
+ * Mixing of context-based w/w mutex acquiring and single w/w mutex locking can
+ * result in undetected deadlocks and is so forbidden. Mixing different contexts
+ * for the same w/w class when acquiring mutexes can also result in undetected
+ * deadlocks, and is hence also forbidden. Both types of abuse will be caught by
+ * enabling CONFIG_PROVE_LOCKING.
+ *
+ * Nesting of acquire contexts for _different_ w/w classes is possible, subject
+ * to the usual locking rules between different lock classes.
+ *
+ * An acquire context must be released with ww_acquire_fini by the same task
+ * before the memory is freed. It is recommended to allocate the context itself
+ * on the stack.
+ */
+#define ww_acquire_init LINUX_BACKPORT(ww_acquire_init)
+static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
+ struct ww_class *ww_class)
+{
+ ctx->task = current;
+ ctx->stamp = atomic_long_inc_return(&ww_class->stamp);
+ ctx->acquired = 0;
+}
+
+/**
+ * ww_acquire_done - marks the end of the acquire phase
+ * @ctx: the acquire context
+ *
+ * Marks the end of the acquire phase, any further w/w mutex lock calls using
+ * this context are forbidden.
+ *
+ * Calling this function is optional, it is just useful to document w/w mutex
+ * code and clearly designated the acquire phase from actually using the locked
+ * data structures.
+ */
+#define ww_acquire_done LINUX_BACKPORT(ww_acquire_done)
+static inline void ww_acquire_done(struct ww_acquire_ctx *ctx)
+{
+}
+
+/**
+ * ww_acquire_fini - releases a w/w acquire context
+ * @ctx: the acquire context to free
+ *
+ * Releases a w/w acquire context. This must be called _after_ all acquired w/w
+ * mutexes have been released with ww_mutex_unlock.
+ */
+#define ww_acquire_fini LINUX_BACKPORT(ww_acquire_fini)
+static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx)
+{
+}
+
+#define __ww_mutex_lock LINUX_BACKPORT(__ww_mutex_lock)
+extern int __must_check __ww_mutex_lock(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+#define __ww_mutex_lock_interruptible LINUX_BACKPORT(__ww_mutex_lock_interruptible)
+extern int __must_check __ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx);
+
+/**
+ * ww_mutex_lock - acquire the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context, or NULL to acquire only a single lock.
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow. Alternatively callers can opt to not acquire this
+ * lock and proceed with trying to acquire further w/w mutexes (e.g. when
+ * scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock LINUX_BACKPORT(ww_mutex_lock)
+static inline int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock(lock, ctx);
+
+ mutex_lock(&lock->base);
+ return 0;
+}
+
+/**
+ * ww_mutex_lock_interruptible - acquire the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Lock the w/w mutex exclusively for this task.
+ *
+ * Deadlocks within a given w/w class of locks are detected and handled with the
+ * wait/wound algorithm. If the lock isn't immediately avaiable this function
+ * will either sleep until it is (wait case). Or it selects the current context
+ * for backing off by returning -EDEADLK (wound case). Trying to acquire the
+ * same lock with the same context twice is also detected and signalled by
+ * returning -EALREADY. Returns 0 if the mutex was successfully acquired. If a
+ * signal arrives while waiting for the lock then this function returns -EINTR.
+ *
+ * In the wound case the caller must release all currently held w/w mutexes for
+ * the given context and then wait for this contending lock to be available by
+ * calling ww_mutex_lock_slow_interruptible. Alternatively callers can opt to
+ * not acquire this lock and proceed with trying to acquire further w/w mutexes
+ * (e.g. when scanning through lru lists trying to free resources).
+ *
+ * The mutex must later on be released by the same task that
+ * acquired it. The task may not exit without first unlocking the mutex. Also,
+ * kernel memory where the mutex resides must not be freed with the mutex still
+ * locked. The mutex must first be initialized (or statically defined) before it
+ * can be locked. memset()-ing the mutex to 0 is not allowed. The mutex must be
+ * of the same w/w lock class as was used to initialize the acquire context.
+ *
+ * A mutex acquired with this function must be released with ww_mutex_unlock.
+ */
+#define ww_mutex_lock_interruptible LINUX_BACKPORT(ww_mutex_lock_interruptible)
+static inline int __must_check ww_mutex_lock_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ if (ctx)
+ return __ww_mutex_lock_interruptible(lock, ctx);
+ else
+ return mutex_lock_interruptible(&lock->base);
+}
+
+/**
+ * ww_mutex_lock_slow - slowpath acquiring of the w/w mutex
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the context held. It is forbidden to call this on anything else than the
+ * contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock directly. This function here is simply to help w/w mutex
+ * locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow LINUX_BACKPORT(ww_mutex_lock_slow)
+static inline void
+ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+ ret = ww_mutex_lock(lock, ctx);
+ (void)ret;
+}
+
+/**
+ * ww_mutex_lock_slow_interruptible - slowpath acquiring of the w/w mutex, interruptible
+ * @lock: the mutex to be acquired
+ * @ctx: w/w acquire context
+ *
+ * Acquires a w/w mutex with the given context after a wound case. This function
+ * will sleep until the lock becomes available and returns 0 when the lock has
+ * been acquired. If a signal arrives while waiting for the lock then this
+ * function returns -EINTR.
+ *
+ * The caller must have released all w/w mutexes already acquired with the
+ * context and then call this function on the contended lock.
+ *
+ * Afterwards the caller may continue to (re)acquire the other w/w mutexes it
+ * needs with ww_mutex_lock. Note that the -EALREADY return code from
+ * ww_mutex_lock can be used to avoid locking this contended mutex twice.
+ *
+ * It is forbidden to call this function with any other w/w mutexes associated
+ * with the given context held. It is forbidden to call this on anything else
+ * than the contending mutex.
+ *
+ * Note that the slowpath lock acquiring can also be done by calling
+ * ww_mutex_lock_interruptible directly. This function here is simply to help
+ * w/w mutex locking code readability by clearly denoting the slowpath.
+ */
+#define ww_mutex_lock_slow_interruptible LINUX_BACKPORT(ww_mutex_lock_slow_interruptible)
+static inline int __must_check
+ww_mutex_lock_slow_interruptible(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return ww_mutex_lock_interruptible(lock, ctx);
+}
+
+#define ww_mutex_unlock LINUX_BACKPORT(ww_mutex_unlock)
+extern void ww_mutex_unlock(struct ww_mutex *lock);
+
+/**
+ * ww_mutex_trylock - tries to acquire the w/w mutex without acquire context
+ * @lock: mutex to lock
+ *
+ * Trylocks a mutex without acquire context, so no deadlock detection is
+ * possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
+ */
+#define ww_mutex_trylock LINUX_BACKPORT(ww_mutex_trylock)
+static inline int __must_check ww_mutex_trylock(struct ww_mutex *lock)
+{
+ return mutex_trylock(&lock->base);
+}
+
+/***
+ * ww_mutex_destroy - mark a w/w mutex unusable
+ * @lock: the mutex to be destroyed
+ *
+ * This function marks the mutex uninitialized, and any subsequent
+ * use of the mutex is forbidden. The mutex must not be locked when
+ * this function is called.
+ */
+#define ww_mutex_destroy LINUX_BACKPORT(ww_mutex_destroy)
+static inline void ww_mutex_destroy(struct ww_mutex *lock)
+{
+ mutex_destroy(&lock->base);
+}
+
+/**
+ * ww_mutex_is_locked - is the w/w mutex locked
+ * @lock: the mutex to be queried
+ *
+ * Returns 1 if the mutex is locked, 0 if unlocked.
+ */
+#define ww_mutex_is_locked LINUX_BACKPORT(ww_mutex_is_locked)
+static inline bool ww_mutex_is_locked(struct ww_mutex *lock)
+{
+ return mutex_is_locked(&lock->base);
+}
+
+#endif /* CPTCFG_BACKPORT_BUILD_WW_MUTEX */
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+#endif /* __BACKPORT_LINUX_WW_MUTEX_H */
diff --git a/backport/compat/Kconfig b/backport/compat/Kconfig
index e2f0cdd..f3c1ab3 100644
--- a/backport/compat/Kconfig
+++ b/backport/compat/Kconfig
@@ -185,6 +185,17 @@ config BACKPORT_LEDS_CLASS
config BACKPORT_LEDS_TRIGGERS
bool
+config BACKPORT_BUILD_WW_MUTEX
+ bool
+ # Build only if on kernels < 3.11
+ # For now only DRM drivers use ww mutexes.
+ depends on DRM && BACKPORT_KERNEL_3_11
+ default y if BACKPORT_USERSEL_BUILD_ALL
+ # probably a bad idea if you have these options given we
+ # ripped those options out.
+ depends on !DEBUG_MUTEXES
+ depends on !DEBUG_LOCK_ALLOC
+
config BACKPORT_BUILD_RADIX_HELPERS
bool
# You have selected to build backported DRM drivers
diff --git a/backport/compat/Makefile b/backport/compat/Makefile
index 252290e..fec01c4 100644
--- a/backport/compat/Makefile
+++ b/backport/compat/Makefile
@@ -41,3 +41,4 @@ compat-$(CPTCFG_BACKPORT_BUILD_KFIFO) += kfifo.o
compat-$(CPTCFG_BACKPORT_BUILD_GENERIC_ATOMIC64) += compat_atomic.o
compat-$(CPTCFG_BACKPORT_BUILD_DMA_SHARED_HELPERS) += dma-shared-helpers.o
compat-$(CPTCFG_BACKPORT_BUILD_RADIX_HELPERS) += lib-radix-tree-helpers.o
+compat-$(CPTCFG_BACKPORT_BUILD_WW_MUTEX) += kernel/ww_mutex.o
diff --git a/backport/compat/kernel/ww_mutex.c b/backport/compat/kernel/ww_mutex.c
new file mode 100644
index 0000000..257c2a4
--- /dev/null
+++ b/backport/compat/kernel/ww_mutex.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2013 Luis R. Rodriguez <mcgrof(a)do-not-panic.com>
+ *
+ * Backport ww mutex for older kernels. This is not supported when
+ * DEBUG_MUTEXES or DEBUG_LOCK_ALLOC is enabled.
+ *
+ * Taken from: kernel/mutex.c - via linux-stable v3.11-rc2
+ *
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo(a)redhat.com>
+ *
+ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
+ * David Howells for suggestions and improvements.
+ *
+ * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ * from the -rt tree, where it was originally implemented for rtmutexes
+ * by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ * and Sven Dietrich.
+ *
+ * Also see Documentation/mutex-design.txt.
+ */
+
+#include <linux/mutex.h>
+#include <linux/ww_mutex.h>
+#include <asm/mutex.h>
+#include <linux/sched.h>
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0)
+#include <linux/sched/rt.h>
+#endif
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+#include <linux/version.h>
+
+/*
+ * A negative mutex count indicates that waiters are sleeping waiting for the
+ * mutex.
+ */
+#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+
+#define spin_lock_mutex(lock, flags) \
+ do { spin_lock(lock); (void)(flags); } while (0)
+#define spin_unlock_mutex(lock, flags) \
+ do { spin_unlock(lock); (void)(flags); } while (0)
+#define mutex_remove_waiter(lock, waiter, ti) \
+ __list_del((waiter)->list.prev, (waiter)->list.next)
+
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+ lock->owner = current;
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+ lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) /* 2bd2c92c and 41fcb9f2 */
+/*
+ * In order to avoid a stampede of mutex spinners from acquiring the mutex
+ * more or less simultaneously, the spinners need to acquire a MCS lock
+ * first before spinning on the owner field.
+ *
+ * We don't inline mspin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+struct mspin_node {
+ struct mspin_node *next ;
+ int locked; /* 1 if lock acquired */
+};
+#define MLOCK(mutex) ((struct mspin_node **)&((mutex)->spin_mlock))
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *prev;
+
+ /* Init node */
+ node->locked = 0;
+ node->next = NULL;
+
+ prev = xchg(lock, node);
+ if (likely(prev == NULL)) {
+ /* Lock acquired */
+ node->locked = 1;
+ return;
+ }
+ ACCESS_ONCE(prev->next) = node;
+ smp_wmb();
+ /* Wait until the lock holder passes the lock down */
+ while (!ACCESS_ONCE(node->locked))
+ arch_mutex_cpu_relax();
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+ struct mspin_node *next = ACCESS_ONCE(node->next);
+
+ if (likely(!next)) {
+ /*
+ * Release the lock by setting it to NULL
+ */
+ if (cmpxchg(lock, node, NULL) == node)
+ return;
+ /* Wait until the next pointer is set */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ }
+ ACCESS_ONCE(next->locked) = 1;
+ smp_wmb();
+}
+
+/*
+ * Mutex spinning code migrated from kernel/sched/core.c
+ */
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+ if (lock->owner != owner)
+ return false;
+
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_ checking
+ * lock->owner still matches owner, if that fails, owner might
+ * point to free()d memory, if it still matches, the rcu_read_lock()
+ * ensures the memory stays valid.
+ */
+ barrier();
+
+ return owner->on_cpu;
+}
+
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
+static noinline
+#endif
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+ rcu_read_lock();
+ while (owner_running(lock, owner)) {
+ if (need_resched())
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * We break out the loop above on need_resched() and when the
+ * owner changed, which is a sign for heavy contention. Return
+ * success only when lock->owner is NULL.
+ */
+ return lock->owner == NULL;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ int retval = 1;
+
+ rcu_read_lock();
+ if (lock->owner)
+ retval = lock->owner->on_cpu;
+ rcu_read_unlock();
+ /*
+ * if lock->owner is not set, the mutex owner may have just acquired
+ * it and not set the owner yet or the mutex has been released.
+ */
+ return retval;
+}
+#else /* Backport 2bd2c92c: help keep backport_mutex_lock_common() clean */
+
+struct mspin_node {
+};
+#define MLOCK(mutex) NULL
+
+static noinline
+void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
+{
+}
+
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+}
+
+static inline int mutex_can_spin_on_owner(struct mutex *lock)
+{
+ return 1;
+}
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0) */
+#endif /* CONFIG_MUTEX_SPIN_ON_OWNER */
+
+/*
+ * Release the lock, slowpath:
+ */
+static inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
+{
+ struct mutex *lock = container_of(lock_count, struct mutex, count);
+ unsigned long flags;
+
+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ /* debug_mutex_unlock(lock); */
+
+ /*
+ * some architectures leave the lock unlocked in the fastpath failure
+ * case, others need to leave it locked. In the later case we have to
+ * unlock it here
+ */
+ if (__mutex_slowpath_needs_to_unlock())
+ atomic_set(&lock->count, 1);
+
+ if (!list_empty(&lock->wait_list)) {
+ /* get the first entry from the wait-list: */
+ struct mutex_waiter *waiter =
+ list_entry(lock->wait_list.next,
+ struct mutex_waiter, list);
+
+ /* debug_mutex_wake_waiter(lock, waiter); */
+
+ wake_up_process(waiter->task);
+ }
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+}
+
+/*
+ * Release the lock, slowpath:
+ */
+static __used noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+ __mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+/**
+ * ww_mutex_unlock - release the w/w mutex
+ * @lock: the mutex to be released
+ *
+ * Unlock a mutex that has been locked by this task previously with any of the
+ * ww_mutex_lock* functions (with or without an acquire context). It is
+ * forbidden to release the locks after releasing the acquire context.
+ *
+ * This function must not be used in interrupt context. Unlocking
+ * of a unlocked mutex is not allowed.
+ */
+void __sched ww_mutex_unlock(struct ww_mutex *lock)
+{
+ /*
+ * The unlocking fastpath is the 0->1 transition from 'locked'
+ * into 'unlocked' state:
+ */
+ if (lock->ctx) {
+ if (lock->ctx->acquired > 0)
+ lock->ctx->acquired--;
+ lock->ctx = NULL;
+ }
+
+ __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+}
+EXPORT_SYMBOL_GPL(ww_mutex_unlock);
+
+static inline int __sched
+__mutex_lock_check_stamp(struct mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+ struct ww_acquire_ctx *hold_ctx = ACCESS_ONCE(ww->ctx);
+
+ if (!hold_ctx)
+ return 0;
+
+ if (unlikely(ctx == hold_ctx))
+ return -EALREADY;
+
+ if (ctx->stamp - hold_ctx->stamp <= LONG_MAX &&
+ (ctx->stamp != hold_ctx->stamp || ctx > hold_ctx)) {
+ return -EDEADLK;
+ }
+
+ return 0;
+}
+
+static __always_inline void ww_mutex_lock_acquired(struct ww_mutex *ww,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ ww_ctx->acquired++;
+}
+
+/*
+ * after acquiring lock with fastpath or when we lost out in contested
+ * slowpath, set ctx and wake up any waiters so they can recheck.
+ *
+ * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
+ * as the fastpath and opportunistic spinning are disabled in that case.
+ */
+static __always_inline void
+ww_mutex_set_context_fastpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ unsigned long flags;
+ struct mutex_waiter *cur;
+
+ ww_mutex_lock_acquired(lock, ctx);
+
+ lock->ctx = ctx;
+
+ /*
+ * The lock->ctx update should be visible on all cores before
+ * the atomic read is done, otherwise contended waiters might be
+ * missed. The contended waiters will either see ww_ctx == NULL
+ * and keep spinning, or it will acquire wait_lock, add itself
+ * to waiter list and sleep.
+ */
+ smp_mb(); /* ^^^ */
+
+ /*
+ * Check if lock is contended, if not there is nobody to wake up
+ */
+ if (likely(atomic_read(&lock->base.count) == 0))
+ return;
+
+ /*
+ * Uh oh, we raced in fastpath, wake up everyone in this case,
+ * so they can see the new lock->ctx.
+ */
+ spin_lock_mutex(&lock->base.wait_lock, flags);
+ list_for_each_entry(cur, &lock->base.wait_list, list) {
+ /* debug_mutex_wake_waiter(&lock->base, cur); */
+ wake_up_process(cur->task);
+ }
+ spin_unlock_mutex(&lock->base.wait_lock, flags);
+}
+
+/**
+ * backport_schedule_preempt_disabled - called with preemption disabled
+ *
+ * Backports c5491ea7. This is not exported so we leave it
+ * here as this is the only current core user on backports.
+ * Although available on >= 3.4 its only for in-kernel code so
+ * we provide our own.
+ *
+ * Returns with preemption disabled. Note: preempt_count must be 1
+ */
+static void __sched backport_schedule_preempt_disabled(void)
+{
+ preempt_enable_no_resched();
+ schedule();
+ preempt_disable();
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static __always_inline int __sched
+__backport_mutex_lock_common(struct mutex *lock, long state,
+ unsigned int subclass,
+ struct lockdep_map *nest_lock, unsigned long ip,
+ struct ww_acquire_ctx *ww_ctx)
+{
+ struct task_struct *task = current;
+ struct mutex_waiter waiter;
+ unsigned long flags;
+ int ret;
+
+ preempt_disable();
+ mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+ /*
+ * Optimistic spinning.
+ *
+ * We try to spin for acquisition when we find that there are no
+ * pending waiters and the lock owner is currently running on a
+ * (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
+ * doesn't track the owner atomically in the lock field, we need to
+ * track it non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
+ * to serialize everything.
+ *
+ * The mutex spinners are queued up using MCS lock so that only one
+ * spinner can compete for the mutex. However, if mutex spinning isn't
+ * going to happen, there is no point in going through the lock/unlock
+ * overhead.
+ */
+ if (!mutex_can_spin_on_owner(lock))
+ goto slowpath;
+
+ for (;;) {
+ struct task_struct *owner;
+ struct mspin_node node;
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ struct ww_mutex *ww;
+
+ ww = container_of(lock, struct ww_mutex, base);
+ /*
+ * If ww->ctx is set the contents are undefined, only
+ * by acquiring wait_lock there is a guarantee that
+ * they are not invalid when reading.
+ *
+ * As such, when deadlock detection needs to be
+ * performed the optimistic spinning cannot be done.
+ */
+ if (ACCESS_ONCE(ww->ctx))
+ break;
+ }
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ mspin_lock(MLOCK(lock), &node);
+ owner = ACCESS_ONCE(lock->owner);
+ if (owner && !mutex_spin_on_owner(lock, owner)) {
+ mspin_unlock(MLOCK(lock), &node);
+ break;
+ }
+
+ if ((atomic_read(&lock->count) == 1) &&
+ (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
+ lock_acquired(&lock->dep_map, ip);
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww;
+ ww = container_of(lock, struct ww_mutex, base);
+
+ ww_mutex_set_context_fastpath(ww, ww_ctx);
+ }
+
+ mutex_set_owner(lock);
+ mspin_unlock(MLOCK(lock), &node);
+ preempt_enable();
+ return 0;
+ }
+ mspin_unlock(MLOCK(lock), &node);
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!owner && (need_resched() || rt_task(task)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ arch_mutex_cpu_relax();
+ }
+slowpath:
+#endif
+ spin_lock_mutex(&lock->wait_lock, flags);
+
+ /* We don't support DEBUG_MUTEXES on the backport */
+ /* debug_mutex_lock_common(lock, &waiter); */
+ /* debug_mutex_add_waiter(lock, &waiter, task_thread_info(task)); */
+
+ /* add waiting tasks to the end of the waitqueue (FIFO): */
+ list_add_tail(&waiter.list, &lock->wait_list);
+ waiter.task = task;
+
+ if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
+ goto done;
+
+ lock_contended(&lock->dep_map, ip);
+
+ for (;;) {
+ /*
+ * Lets try to take the lock again - this is needed even if
+ * we get here for the first time (shortly after failing to
+ * acquire the lock), to make sure that we get a wakeup once
+ * it's unlocked. Later on, if we sleep, this is the
+ * operation that gives us the lock. We xchg it to -1, so
+ * that when we release the lock, we properly wake up the
+ * other waiters:
+ */
+ if (MUTEX_SHOW_NO_WAITER(lock) &&
+ (atomic_xchg(&lock->count, -1) == 1))
+ break;
+
+ /*
+ * got a signal? (This code gets eliminated in the
+ * TASK_UNINTERRUPTIBLE case.)
+ */
+ if (unlikely(signal_pending_state(state, task))) {
+ ret = -EINTR;
+ goto err;
+ }
+
+ if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
+ ret = __mutex_lock_check_stamp(lock, ww_ctx);
+ if (ret)
+ goto err;
+ }
+
+ __set_task_state(task, state);
+
+ /* didn't get the lock, go to sleep: */
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ backport_schedule_preempt_disabled();
+ spin_lock_mutex(&lock->wait_lock, flags);
+ }
+
+done:
+ lock_acquired(&lock->dep_map, ip);
+ /* got the lock - rejoice! */
+ mutex_remove_waiter(lock, &waiter, current_thread_info());
+ mutex_set_owner(lock);
+
+ if (!__builtin_constant_p(ww_ctx == NULL)) {
+ struct ww_mutex *ww = container_of(lock,
+ struct ww_mutex,
+ base);
+ struct mutex_waiter *cur;
+
+ /*
+ * This branch gets optimized out for the common case,
+ * and is only important for ww_mutex_lock.
+ */
+
+ ww_mutex_lock_acquired(ww, ww_ctx);
+ ww->ctx = ww_ctx;
+
+ /*
+ * Give any possible sleeping processes the chance to wake up,
+ * so they can recheck if they have to back off.
+ */
+ list_for_each_entry(cur, &lock->wait_list, list) {
+ /* debug_mutex_wake_waiter(lock, cur); */
+ wake_up_process(cur->task);
+ }
+ }
+
+ /* set it to 0 if there are no waiters left: */
+ if (likely(list_empty(&lock->wait_list)))
+ atomic_set(&lock->count, 0);
+
+ spin_unlock_mutex(&lock->wait_lock, flags);
+
+ /* debug_mutex_free_waiter(&waiter); */
+ preempt_enable();
+
+ return 0;
+
+err:
+ mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+ spin_unlock_mutex(&lock->wait_lock, flags);
+ /* debug_mutex_free_waiter(&waiter); */
+ mutex_release(&lock->dep_map, 1, ip);
+ preempt_enable();
+ return ret;
+}
+
+static noinline int __sched
+__ww_mutex_lock_slowpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+static noinline int __sched
+__ww_mutex_lock_interruptible_slowpath(struct ww_mutex *lock,
+ struct ww_acquire_ctx *ctx)
+{
+ return __backport_mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+ NULL, _RET_IP_, ctx);
+}
+
+/**
+ * __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ * from 1 to a 0 value
+ * @count: pointer of type atomic_t
+ *
+ * For backporting purposes we can't use the older kernel's
+ * __mutex_fastpath_lock_retval() since upon failure of a fastpath
+ * lock we want to call our a failure routine with more than one argument, in
+ * this case the context for ww mutexes. Refer to commit a41b56ef the
+ * argument increase. It'd be painful to backport all asm code for the
+ * supported architectures so instead lets penalize the backport ww mutex
+ * fastpath lock with the not so efficient generic atomic_dec_return()
+ * implementation.
+ *
+ * Change the count from 1 to a value lower than 1. This function returns 0
+ * if the fastpath succeeds, or -1 otherwise.
+ */
+static inline int
+__backport_mutex_fastpath_lock_retval(atomic_t *count)
+{
+ if (unlikely(atomic_dec_return(count) < 0))
+ return -1;
+ return 0;
+}
+
+int __sched
+__ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock);
+
+int __sched
+__ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
+{
+ int ret;
+
+ might_sleep();
+
+ ret = __backport_mutex_fastpath_lock_retval(&lock->base.count);
+
+ if (likely(!ret)) {
+ ww_mutex_set_context_fastpath(lock, ctx);
+ mutex_set_owner(&lock->base);
+ } else
+ ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
--
1.7.10.4
In Tegra SoC, IOMMU can set some mapping attribute for each page, for
exmaple, READable, and WRITEable. We'd like to use this feature
*newly* for robustness, where read-only pages can be protected from
being overwritten by wrong operation. We have been using IOMMU via DMA
mapping API(ARM). DMA mapping API currently doesn't use "prot"
parameter when calling IOMMU API. So this series tries to pass "struct
dma_attrs *attrs" via "int prot" in IOMMU API. I'm not so sure if this
implementation is right or not because:
- Casting (struct dma_attrs *) to (int) in DMA API doesn't look nice.
- IOMMU layer needs to cast (int) back to (struct dma_attrs *) again,
which can be considered as violation of layers.
If you have any implementations/suggestions, it would be really helpful.
This series isn't applied cleanly but this is posted to request for
comments.
Hiroshi Doyu (3):
common: DMA-mapping: add DMA_ATTR_READ_ONLY attribute
ARM: dma-mapping: Pass DMA attrs as IOMMU prot
iommu/tegra: smmu: Support read-only mapping
arch/arm/mm/dma-mapping.c | 34 +++++++++++++++++++++-------------
drivers/iommu/tegra-smmu.c | 41 +++++++++++++++++++++++++++++------------
include/linux/dma-attrs.h | 1 +
3 files changed, 51 insertions(+), 25 deletions(-)
--
1.8.1.5