Hello Everyone,
This is another attempt to finally make Exynos SYSMMU driver fully
integrated with DMA-mapping subsystem. The main change from previous
version is a rebase onto latest "automatic DMA configuration for IOMMU
masters" patches from Will Deacon.
This patchset demonstrates that Will's proposal works fine and
significantly simplifies the driver code.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changelog:
v3:
- rebased onto "[RFC PATCH v4 0/8] Introduce automatic DMA
configuration for IOMMU masters"
- added some minor fixes for iommu and dma-mapping frameworks
v2: http://thread.gmane.org/gmane.linux.kernel.iommu/6472/
- rebased onto "[RFC PATCH v3 0/7] Introduce automatic DMA
configuration for IOMMU masters" patches:
http://www.spinics.net/lists/arm-kernel/msg362076.html
- changed initialization from bus notifiers to DT related callbacks
- removed support for separate IO address spaces - this will be
discussed separately after the basic support gets merged
- removed support for power domain notifier-based runtime power
management - this also will be discussed separately later
v1: https://lkml.org/lkml/2014/8/5/183
- initial version, feature complete, completely rewrote integration
approach
Patch summary:
Marek Szyprowski (19):
iommu: fix const qualifier in of_iommu_set_ops
iommu: fix initialization without 'add_device' callback
arm: dma-mapping: add missing check for iommu
drm: exynos: detach from default dma-mapping domain on init
arm: exynos: pm_domains: add support for devices registered before
arch_initcall
ARM: dts: exynos4: add sysmmu nodes
iommu: exynos: don't read version register on every tlb operation
iommu: exynos: remove unused functions
iommu: exynos: remove useless spinlock
iommu: exynos: refactor function parameters to simplify code
iommu: exynos: remove unused functions, part 2
iommu: exynos: remove useless device_add/remove callbacks
iommu: exynos: add support for binding more than one sysmmu to master
device
iommu: exynos: add support for runtime_pm
iommu: exynos: rename variables to reflect their purpose
iommu: exynos: document internal structures
iommu: exynos: remove excessive includes and sort others
alphabetically
iommu: exynos: init from dt-specific callback instead of initcall
iommu: exynos: add callback for initializing devices from device tree
arch/arm/boot/dts/exynos4.dtsi | 117 +++++++
arch/arm/boot/dts/exynos4210.dtsi | 23 ++
arch/arm/boot/dts/exynos4x12.dtsi | 82 +++++
arch/arm/mach-exynos/pm_domains.c | 9 +-
arch/arm/mm/dma-mapping.c | 2 +-
drivers/gpu/drm/exynos/exynos_drm_iommu.c | 3 +
drivers/iommu/exynos-iommu.c | 490 ++++++++++++++----------------
drivers/iommu/iommu.c | 2 +-
include/linux/of_iommu.h | 4 +-
9 files changed, 459 insertions(+), 273 deletions(-)
--
1.9.2
Hello Dave,
This patch enable the last big hardware feature of my driver: the
connector for panel.
Like for HMDI and HDA, Digital Video Out (DVO) create brige, encoder
and connector
drm objects.
The following changes since commit 4e0cd68115620bc3236ff4e58e4c073948629b41:
drm: sti: fix module compilation issue (2014-12-15 17:07:57 +1000)
are available in the git repository at:
http://git.linaro.org/people/benjamin.gaignard/kernel.git drm-sti-next-add-dvo
for you to fetch changes up to f32c4c506f9b197f24d4be4ee7283bd549e3a30f:
drm: sti: add DVO output connector (2014-12-30 15:08:16 +0100)
----------------------------------------------------------------
Benjamin Gaignard (1):
drm: sti: add DVO output connector
.../devicetree/bindings/gpu/st,stih4xx.txt | 29 ++
drivers/gpu/drm/sti/Makefile | 4 +
drivers/gpu/drm/sti/sti_awg_utils.c | 184 +++++++
drivers/gpu/drm/sti/sti_awg_utils.h | 34 ++
drivers/gpu/drm/sti/sti_dvo.c | 551 +++++++++++++++++++++
drivers/gpu/drm/sti/sti_tvout.c | 118 +++++
6 files changed, 920 insertions(+)
create mode 100644 drivers/gpu/drm/sti/sti_awg_utils.c
create mode 100644 drivers/gpu/drm/sti/sti_awg_utils.h
create mode 100644 drivers/gpu/drm/sti/sti_dvo.c
Hi,
Why:
====
While sharing buffers using dma-buf, currently there's no mechanism to let
devices share their memory access constraints with each other to allow for
delayed allocation of backing storage.
This RFC attempts to introduce the idea of memory constraints of a device,
and how these constraints can be shared and used to help allocate buffers that
can satisfy requirements of all devices attached to a particular dma-buf.
How:
====
A constraints_mask is added to dma_parms of the device, and at the time of
each device attachment to a dma-buf, the dma-buf uses this constraints_mask
to calculate the access_mask for the dma-buf.
Allocators can be defined for each of these constraints_masks, and then helper
functions can be used to allocate the backing storage from the matching
allocator satisfying the constraints of all devices interested.
A new miscdevice, /dev/cenalloc [1] is created, which acts as the dma-buf
exporter to make this transparent to the devices.
More details in the patch description of "cenalloc: Constraint-Enabled
Allocation helpers for dma-buf".
At present, the constraint_mask is only a bitmask, but it should be possible to
change it to a struct and adapt the constraint_mask calculation accordingly,
based on discussion.
Important requirement:
======================
Of course, delayed allocation can only work if all participating devices
will wait for other devices to have 'attached' before mapping the buffer
for the first time.
As of now, users of dma-buf(drm prime, v4l2 etc) call the attach() and then
map_attachment() almost immediately after it. This would need to be changed if
they were to benefit from constraints.
What 'cenalloc' is not:
=======================
- not 'general' allocator helpers - useful only for constraints-enabled
devices that share buffers with others using dma-buf.
- not a replacement for existing allocation mechanisms inside various
subsystems; merely a possible alternative.
- no page-migration - it would be very complementary to the delayed allocation
suggested here.
TODOs:
======
- demonstration test cases
- vma helpers for allocators
- more sample allocators
- userspace ioctl (It should be a simple one, and we have one ready, but wanted
to agree on the kernel side of things first)
May the brickbats begin, please! :)
Best regards,
~Sumit.
[1]: 'C'onstraints 'EN'abled 'ALLOC'ation helpers = cenalloc: it might not be a
very appealing name, so suggestions are very welcome!
Benjamin Gaignard (1):
cenalloc: a sample allocator for contiguous page allocation
Sumit Semwal (3):
dma-buf: Add constraints sharing information
cenalloc: Constraint-Enabled Allocation helpers for dma-buf
cenalloc: Build files for constraint-enabled allocation helpers
MAINTAINERS | 1 +
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/cenalloc/Kconfig | 8 +
drivers/cenalloc/Makefile | 3 +
drivers/cenalloc/cenalloc.c | 597 ++++++++++++++++++++++++++++++
drivers/cenalloc/cenalloc.h | 99 +++++
drivers/cenalloc/cenalloc_priv.h | 188 ++++++++++
drivers/cenalloc/cenalloc_system_contig.c | 225 +++++++++++
drivers/dma-buf/dma-buf.c | 50 ++-
include/linux/device.h | 7 +-
include/linux/dma-buf.h | 14 +
12 files changed, 1189 insertions(+), 6 deletions(-)
create mode 100644 drivers/cenalloc/Kconfig
create mode 100644 drivers/cenalloc/Makefile
create mode 100644 drivers/cenalloc/cenalloc.c
create mode 100644 drivers/cenalloc/cenalloc.h
create mode 100644 drivers/cenalloc/cenalloc_priv.h
create mode 100644 drivers/cenalloc/cenalloc_system_contig.c
--
1.9.1
Hi,
I need some some pointers on how to write a IOMMU driver on arm64 for a new SoC.
The iommu interface is currently not being provided in dma-mappings.c (the DMA API is SWIOTLB-based)
what is the alternative?
thanks
Jorge
Hi,
I'm adding more people to CC who have commented on Wayland dmabuf
attempts before and might be interested.
If you want to keep updated without monitoring wayland-devel@, you
could subscribe to
https://bugs.freedesktop.org/show_bug.cgi?id=83881
Thanks,
pq
On Fri, 12 Dec 2014 16:51:01 -0500
Louis-Francis Ratté-Boulianne <lfrb(a)collabora.com> wrote:
> This serie of patches by Pekka and George contains an experimental
> implementation of "zlinux_dmabuf" protocol.
>
> See these links for more information about the design and failed previous
> attempts:
>
> http://lists.freedesktop.org/archives/wayland-devel/2014-June/015362.html
> https://bugs.freedesktop.org/show_bug.cgi?id=83881
>
> This protocol allows clients to wrap a dmabuf into a wl_buffer, and push that
> into the compositor for display. The compositor then uses GBM to import that
> wl_buffer as a bo for compositing with GL (via EGLImage) or direct scanout as a
> DRM FB object.
>
> Note that a round-trip is needed when creating the buffer because there is no
> way to communicate all dmabuf constraints from the compositor to a client
> before-hand. They are simply too complicated and changing to be described in
> a Wayland protocol extension. In fact, there are existing APIs like EGL (the
> dmabuf import extension) that are based on trial-and-error rather than knowing
> the constraints before-hand.
>
> However, the protocol design is far far away from complete. Even the kernel
> developers are still discussing, how cross-device support for dmabufs should
> work, and what is the proper information split between the kernel and the user
> space. How do you communicate, or even describe, things like tiling formats.
>
> But, we should start from somewhere, and this patch is pushing the user space
> part forward a bit.
>
> The extension is especially useful for video players and this pipeline has been
> demonstrated to work with GStreamer:
>
> http://cgit.collabora.com/git/user/gkiagia/gst-plugins-bad.git/commit/?h=in…
>
>
> George Kiagiadakis (1):
> clients: add simple-dmabuf client
>
> Pekka Paalanen (6):
> protocol: add linux_dmabuf extension RFCv1
> dmabuf: implement linux_dmabuf extension
> gl-renderer: add dmabuf import
> compositor-x11: init linux_dmabuf support
> compositor-drm: init linux_dmabuf support
> compositor-drm: dmabuf GBM import
>
> .gitignore | 1 +
> Makefile.am | 24 +-
> clients/simple-dmabuf.c | 578 ++++++++++++++++++++++++++++++++++++++++++++++
> configure.ac | 9 +
> protocol/linux-dmabuf.xml | 224 ++++++++++++++++++
> src/compositor-drm.c | 31 ++-
> src/compositor-x11.c | 5 +
> src/gl-renderer.c | 103 +++++++++
> src/linux-dmabuf.c | 322 ++++++++++++++++++++++++++
> src/linux-dmabuf.h | 45 ++++
> 10 files changed, 1336 insertions(+), 6 deletions(-)
> create mode 100644 clients/simple-dmabuf.c
> create mode 100644 protocol/linux-dmabuf.xml
> create mode 100644 src/linux-dmabuf.c
> create mode 100644 src/linux-dmabuf.h
>
This series of patches fix various issues in STI drm driver.
Now HDMI i2c adapter could be selected in device tree
and plug detection doesn't use gpio anymore.
I also had fix some signal timing problems after testing the driver
on more hardware.
The remaining patches attemps to simplify the code and prepare
the next evolutions like DVO and auxiliary CRTC support
The changes could be fetch here:
http://git.linaro.org/people/benjamin.gaignard/kernel.git
on drm-sti-fixes-2014-12-04 branch
Benjamin Gaignard (9):
drm: sti: allow to change hdmi ddc i2c adapter
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: clear all mixer control
drm: sti: simplify gdp code
drm: sti: remove event lock while disabling vblank
drm: sti: fix hdmi avi infoframe
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: fix delay in VTG programming
.../devicetree/bindings/gpu/st,stih4xx.txt | 3 +-
drivers/gpu/drm/sti/sti_drm_crtc.c | 10 +--
drivers/gpu/drm/sti/sti_gdp.c | 39 ++++-----
drivers/gpu/drm/sti/sti_hdmi.c | 84 +++++++++++---------
drivers/gpu/drm/sti/sti_hdmi.h | 6 +-
drivers/gpu/drm/sti/sti_mixer.c | 9 +++
drivers/gpu/drm/sti/sti_mixer.h | 1 +
drivers/gpu/drm/sti/sti_tvout.c | 92 ++++++++++++----------
drivers/gpu/drm/sti/sti_vtg.c | 25 +++++-
9 files changed, 160 insertions(+), 109 deletions(-)
--
1.9.1
Hello Everyone,
This is yet another attempt to finally make Exynos SYSMMU driver fully
integrated with DMA-mapping subsystem.
Previous approach is available here: https://lkml.org/lkml/2014/8/5/183
I meantime, there have been a discussion about the way the iommu driver
should be integrated with dma-mapping subsystem, which resulted in "[RFC
PATCH v3 0/7] Introduce automatic DMA configuration for IOMMU masters"
patches prepared by Will Deacon:
http://www.spinics.net/lists/arm-kernel/msg362076.html
Those patches removed the need to use bus-specific notifiers for
initialization.
Main changes since previous version of my patches:
1. rebased onto "[RFC PATCH v3 0/7] Introduce automatic DMA
configuration for IOMMU masters" patches, changed initialization from
bus notifiers to DT related callbacks
2. removed support for separate IO address spaces - this will be
discussed separately after the basic support gets merged
3. removed support for power domain notifier-based runtime power
management - this also will be discussed separately later
I hope that the driver with above changes will be easier to be merged to
v3.18.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Patch summary:
Marek Szyprowski (18):
arm: dma-mapping: arm_iommu_attach_device: automatically set
max_seg_size
arm: exynos: bind power domains earlier, on device creation
drm: exynos: detach from default dma-mapping domain on init
clk: exynos: add missing smmu_g2d clock and update comments
ARM: DTS: Exynos4: add System MMU nodes
iommu: exynos: don't read version register on every tlb operation
iommu: exynos: remove unused functions
iommu: exynos: remove useless spinlock
iommu: exynos: refactor function parameters to simplify code
iommu: exynos: remove unused functions, part 2
iommu: exynos: remove useless device_add/remove callbacks
iommu: exynos: add support for binding more than one sysmmu to master
device
iommu: exynos: add support for runtime_pm
iommu: exynos: rename variables to reflect their purpose
iommu: exynos: document internal structures
iommu: exynos: remove excessive includes and sort others
alphabetically
iommu: exynos: init from dt-specific callback instead of initcall
iommu: exynos: add callback for initializing devices from device tree
arch/arm/boot/dts/exynos4.dtsi | 117 +++++++
arch/arm/boot/dts/exynos4210.dtsi | 23 ++
arch/arm/boot/dts/exynos4x12.dtsi | 82 +++++
arch/arm/mach-exynos/pm_domains.c | 12 +-
arch/arm/mm/dma-mapping.c | 16 +
drivers/clk/samsung/clk-exynos4.c | 1 +
drivers/gpu/drm/exynos/exynos_drm_iommu.c | 3 +
drivers/iommu/exynos-iommu.c | 494 ++++++++++++++----------------
include/dt-bindings/clock/exynos4.h | 10 +-
9 files changed, 483 insertions(+), 275 deletions(-)
--
1.9.2
Hello,
This is another approach to finish support for reserved memory regions
defined in device tree. Previous attempts
(http://lists.linaro.org/pipermail/linaro-mm-sig/2014-February/003738.html
and https://lkml.org/lkml/2014/7/14/108) ended in merging parts of the
code and documentation. Merged patches allow to reserve memory, but
there is still no reserved memory drivers nor any code that actually
uses reserved memory regions.
The final conclusion from the above mentioned threads is that there is
no automated reserved memory initialization. All drivers that want to
use reserved memory, should initialize it on their own.
This patch series provides two driver for reserved memory regions (one
based on CMA and one based on dma_coherent allocator). The main
improvement comparing to the previous version is removal of automated
reserved memory for every device and support for named memory regions.
Those patches are for merging, rebased on top of recent linux-next tree.
Best regards
Marek Szyprowski
Samsung R&D Institute Poland
Changes since v1 (https://lkml.org/lkml/2014/8/26/339):
- removed patches for named reserved regions - they will be discussed
separately
- added a check for 'no-map' property to dma coherent allocator
(suggested by Laura Abbott)
- removed example code for s5p-mfc driver
Changes since '[PATCH v2 RESEND 0/4] CMA & device tree, once again' version:
(https://lkml.org/lkml/2014/7/14/108)
- added return error value to of_reserved_mem_device_init()
- added support for named memory regions (so more than one region can be
defined per device)
- added usage example - converted custom reserved memory code used by
s5p-mfc driver to the generic reserved memory handling code
Patch summary:
Marek Szyprowski (3):
drivers: of: add return value to of_reserved_mem_device_init
drivers: dma-coherent: add initialization from device tree
drivers: dma-contiguous: add initialization from device tree
drivers/base/dma-coherent.c | 145 ++++++++++++++++++++++++++++++++++------
drivers/base/dma-contiguous.c | 71 ++++++++++++++++++++
drivers/of/of_reserved_mem.c | 3 +-
include/linux/cma.h | 3 +
include/linux/of_reserved_mem.h | 9 ++-
mm/cma.c | 62 ++++++++++++++---
6 files changed, 259 insertions(+), 34 deletions(-)
--
1.9.2
Hi All,
I lost the only Android Tablet that I had to hack around and in the market
for new Android Tablet. Needs to be stuff I can buy personally, so Juno /
Vexpress is out the window.
What would you recommend I buy? Below are specific use cases
i. Work on NEON optimizations for Audio libraries (Vorbis, FLAC etc) in
**Android** context.
ii. So, would obviously like to have a reference Android baseline build
that is actively used in Linaro.
Is anyone in LMG using builds available for Nexus10
<http://releases.linaro.org/14.09/android/nexus10> and Nexus7
<http://releases.linaro.org/14.09/android/nexus7-2013> actively right now?
How official are these builds?
Regards,
Vish (Viswanath Puttagunta)
Cell: 972-342-0205
Technical Program Manager
Member Services, Linaro
Hi All,
Does any one have instructions to install Ubuntu natively on Samsung
Chromebook2 that worked for you?
Regards,
Vish (Viswanath Puttagunta)
Cell: 972-342-0205
Technical Program Manager
Member Services, Linaro
Hi,
I wanted to know about the impact of changing PAGE_ALLOC_COSTLY_ORDER value from 3 to 2.
This macro is defined in include/linux/mmzone.h
#define PAGE_ALLOC_COSTLY_ORDER 3
As I know this value should never be changed irrespective of the type of the system.
Is it good to change this value for RAM size: 512MB, 256MB or 128MB?
If anybody have changed this value and experience any kind of problem or benefits please let us know.
We noticed that for one of the Android product with 512MB RAM, the PAGE_ALLOC_COSTLY_ORDER was set to 2.
We could not figure out why this value was decreased from 3 to 2.
As per my analysis, I observed that kmalloc fails little early, if we change this value to 2.
This is also visible from the _slowpath_ in page_alloc.c
Apart from this we could not find any other impact.
If anybody is aware of any other impact, please let us know.
Thank you!
Regards,
Pintu Kumar
On Sun, Sep 14, 2014 at 12:36:43PM +0200, Christian König wrote:
> Yeah, right. Providing the fd to reassign to a fence would indeed reduce the
> create/close overhead.
>
> But it would still be more overhead than for example a simple on demand
> growing ring buffer which then uses 64bit sequence numbers in userspace to
> refer to a fence in the kernel.
>
> Apart from that I'm pretty sure that when we do the syncing completely in
> userspace we need more fences open at the same time than fds are available
> by default.
If you do the syncing completely in userspace you don't need kernel fences
at all. Kernel fences are only required if you sync with a different
process (where the pure userspace syncing might not work out) or with
different devices.
tbh I don't see any use-case at all where you'd need 10k such fences. That
means your driver gets to deal with 2 kinds of fences, but so be it. Since
not using fds for cross-device or cross-process syncing imo just doesn't
make sense, so that one pretty much will have to stick.
> As long as our internal handle or sequence based fence are easily
> convertible to a fence fd I actually don't really see a problem with that.
> Going to hack that approach into my prototype and then we can see how bad
> the code looks after all.
My plan for i915 is to start out with fd fences only, and once we have
some clarity on the exact requirements probably add some pure
userspace-controlled fences for tightly coupled stuff. Those might be
fully internal to the opencl userspace driver though and never get out of
there, ever.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Fri, 12 Sep 2014 18:08:23 +0200
Christian König <christian.koenig(a)amd.com> wrote:
> > As Daniel said using fd is most likely the way we want to do it but this
> > remains vague.
> Separating the discussion if it should be an fd or not. Using an fd
> sounds fine to me in general, but I have some concerns as well.
>
> For example what was the maximum number of opened FDs per process again?
> Could that become a problem? etc...
You can check out the i915 patches I posted if you want to see
examples. Max fds may be an issue if userspace doesn't clean up its
fences. The implementation is pretty easy with the stuff Maarten has
done recently.
The changes I still need to make to mine:
- sit on top of Chris's request/seqno changes (driver internals
really)
- switch over to execbuf as the main API on the render side (like
you're doing)
- add support for display and other timelines
As far as compat goes, I don't think it should be too hard. Even with
GPU scheduling, a given context's buffers should all be in-order with
respect to one another, so we ought to be able to mix & match clients
using explicit fencing and implicit fencing. Though in Mesa I still
haven't looked at how to handle server vs client side arb_sync with the
scheduler and explicit fencing in place; might need some extra work
there...
--
Jesse Barnes, Intel Open Source Technology Center
On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
> >On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
> >>Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
> >>>On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
> >>>>On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse(a)gmail.com> wrote:
> >>>>>On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> >>>>>>On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel(a)ffwll.ch> wrote:
> >>>>>>>On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >>>>>>>>Hello everyone,
> >>>>>>>>
> >>>>>>>>to allow concurrent buffer access by different engines beyond the multiple
> >>>>>>>>readers/single writer model that we currently use in radeon and other
> >>>>>>>>drivers we need some kind of synchonization object exposed to userspace.
> >>>>>>>>
> >>>>>>>>My initial patch set for this used (or rather abused) zero sized GEM buffers
> >>>>>>>>as fence handles. This is obviously isn't the best way of doing this (to
> >>>>>>>>much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >>>>>>>>
> >>>>>>>>So what should a driver expose instead? Android sync points? Something else?
> >>>>>>>I think actually exposing the struct fence objects as a fd, using android
> >>>>>>>syncpts (or at least something compatible to it) is the way to go. Problem
> >>>>>>>is that it's super-hard to get the android guys out of hiding for this :(
> >>>>>>>
> >>>>>>>Adding a bunch of people in the hopes that something sticks.
> >>>>>>More people.
> >>>>>Just to re-iterate, exposing such thing while still using command stream
> >>>>>ioctl that use implicit synchronization is a waste and you can only get
> >>>>>the lowest common denominator which is implicit synchronization. So i do
> >>>>>not see the point of such api if you are not also adding a new cs ioctl
> >>>>>with explicit contract that it does not do any kind of synchronization
> >>>>>(it could be almost the exact same code modulo the do not wait for
> >>>>>previous cmd to complete).
> >>>>Our thinking was to allow explicit sync from a single process, but
> >>>>implicitly sync between processes.
> >>>This is a BIG NAK if you are using the same ioctl as it would mean you are
> >>>changing userspace API, well at least userspace expectation. Adding a new
> >>>cs flag might do the trick but it should not be about inter-process, or any
> >>>thing special, it's just implicit sync or no synchronization. Converting
> >>>userspace is not that much of a big deal either, it can be broken into
> >>>several step. Like mesa use explicit synchronization all time but ddx use
> >>>implicit.
> >>The thinking here is that we need to be backward compatible for DRI2/3 and
> >>support all kind of different use cases like old DDX and new Mesa, or old
> >>Mesa and new DDX etc...
> >>
> >>So for my prototype if the kernel sees any access of a BO from two different
> >>clients it falls back to the old behavior of implicit synchronization of
> >>access to the same buffer object. That might not be the fastest approach,
> >>but is as far as I can see conservative and so should work under all
> >>conditions.
> >>
> >>Apart from that the planning so far was that we just hide this feature
> >>behind a couple of command submission flags and new chunks.
> >Just to reproduce IRC discussion, i think it's a lot simpler and not that
> >complex. For explicit cs ioctl you do not wait for any previous fence of
> >any of the buffer referenced in the cs ioctl, but you still associate a
> >new fence with all the buffer object referenced in the cs ioctl. So if the
> >next ioctl is an implicit sync ioctl it will wait properly and synchronize
> >properly with previous explicit cs ioctl. Hence you can easily have a mix
> >in userspace thing is you only get benefit once enough of your userspace
> >is using explicit.
>
> Yes, that's exactly what my patches currently implement.
>
> The only difference is that by current planning I implemented it as a per BO
> flag for the command submission, but that was just for testing. Having a
> single flag to switch between implicit and explicit synchronization for
> whole CS IOCTL would do equally well.
Doing it per BO sounds bogus to me. But otherwise yes we are in agreement.
As Daniel said using fd is most likely the way we want to do it but this
remains vague.
>
> >Note that you still need a way to have explicit cs ioctl to wait on a
> >previos "explicit" fence so you need some api to expose fence per cs
> >submission.
>
> Exactly, that's what this mail thread is all about.
>
> As Daniel correctly noted you need something like a functionality to get a
> fence as the result of a command submission as well as pass in a list of
> fences to wait for before beginning a command submission.
>
> At least it looks like we are all on the same general line here, its just
> nobody has a good idea how the details should look like.
>
> Regards,
> Christian.
>
> >
> >Cheers,
> >Jérôme
> >
> >>Regards,
> >>Christian.
> >>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>Alex
> >>>>
> >>>>>Also one thing that the Android sync point does not have, AFAICT, is a
> >>>>>way to schedule synchronization as part of a cs ioctl so cpu never have
> >>>>>to be involve for cmd stream that deal only one gpu (assuming the driver
> >>>>>and hw can do such trick).
> >>>>>
> >>>>>Cheers,
> >>>>>Jérôme
> >>>>>
> >>>>>>-Daniel
> >>>>>>--
> >>>>>>Daniel Vetter
> >>>>>>Software Engineer, Intel Corporation
> >>>>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>>>_______________________________________________
> >>>>>dri-devel mailing list
> >>>>>dri-devel(a)lists.freedesktop.org
> >>>>>http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
> >On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
> >>On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse(a)gmail.com> wrote:
> >>>On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> >>>>On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel(a)ffwll.ch> wrote:
> >>>>>On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >>>>>>Hello everyone,
> >>>>>>
> >>>>>>to allow concurrent buffer access by different engines beyond the multiple
> >>>>>>readers/single writer model that we currently use in radeon and other
> >>>>>>drivers we need some kind of synchonization object exposed to userspace.
> >>>>>>
> >>>>>>My initial patch set for this used (or rather abused) zero sized GEM buffers
> >>>>>>as fence handles. This is obviously isn't the best way of doing this (to
> >>>>>>much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >>>>>>
> >>>>>>So what should a driver expose instead? Android sync points? Something else?
> >>>>>I think actually exposing the struct fence objects as a fd, using android
> >>>>>syncpts (or at least something compatible to it) is the way to go. Problem
> >>>>>is that it's super-hard to get the android guys out of hiding for this :(
> >>>>>
> >>>>>Adding a bunch of people in the hopes that something sticks.
> >>>>More people.
> >>>Just to re-iterate, exposing such thing while still using command stream
> >>>ioctl that use implicit synchronization is a waste and you can only get
> >>>the lowest common denominator which is implicit synchronization. So i do
> >>>not see the point of such api if you are not also adding a new cs ioctl
> >>>with explicit contract that it does not do any kind of synchronization
> >>>(it could be almost the exact same code modulo the do not wait for
> >>>previous cmd to complete).
> >>Our thinking was to allow explicit sync from a single process, but
> >>implicitly sync between processes.
> >This is a BIG NAK if you are using the same ioctl as it would mean you are
> >changing userspace API, well at least userspace expectation. Adding a new
> >cs flag might do the trick but it should not be about inter-process, or any
> >thing special, it's just implicit sync or no synchronization. Converting
> >userspace is not that much of a big deal either, it can be broken into
> >several step. Like mesa use explicit synchronization all time but ddx use
> >implicit.
>
> The thinking here is that we need to be backward compatible for DRI2/3 and
> support all kind of different use cases like old DDX and new Mesa, or old
> Mesa and new DDX etc...
>
> So for my prototype if the kernel sees any access of a BO from two different
> clients it falls back to the old behavior of implicit synchronization of
> access to the same buffer object. That might not be the fastest approach,
> but is as far as I can see conservative and so should work under all
> conditions.
>
> Apart from that the planning so far was that we just hide this feature
> behind a couple of command submission flags and new chunks.
Just to reproduce IRC discussion, i think it's a lot simpler and not that
complex. For explicit cs ioctl you do not wait for any previous fence of
any of the buffer referenced in the cs ioctl, but you still associate a
new fence with all the buffer object referenced in the cs ioctl. So if the
next ioctl is an implicit sync ioctl it will wait properly and synchronize
properly with previous explicit cs ioctl. Hence you can easily have a mix
in userspace thing is you only get benefit once enough of your userspace
is using explicit.
Note that you still need a way to have explicit cs ioctl to wait on a
previos "explicit" fence so you need some api to expose fence per cs
submission.
Cheers,
Jérôme
>
> Regards,
> Christian.
>
> >
> >Cheers,
> >Jérôme
> >
> >>Alex
> >>
> >>>Also one thing that the Android sync point does not have, AFAICT, is a
> >>>way to schedule synchronization as part of a cs ioctl so cpu never have
> >>>to be involve for cmd stream that deal only one gpu (assuming the driver
> >>>and hw can do such trick).
> >>>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>-Daniel
> >>>>--
> >>>>Daniel Vetter
> >>>>Software Engineer, Intel Corporation
> >>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>_______________________________________________
> >>>dri-devel mailing list
> >>>dri-devel(a)lists.freedesktop.org
> >>>http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> Hello everyone,
>
> to allow concurrent buffer access by different engines beyond the multiple
> readers/single writer model that we currently use in radeon and other
> drivers we need some kind of synchonization object exposed to userspace.
>
> My initial patch set for this used (or rather abused) zero sized GEM buffers
> as fence handles. This is obviously isn't the best way of doing this (to
> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>
> So what should a driver expose instead? Android sync points? Something else?
I think actually exposing the struct fence objects as a fd, using android
syncpts (or at least something compatible to it) is the way to go. Problem
is that it's super-hard to get the android guys out of hiding for this :(
Adding a bunch of people in the hopes that something sticks.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hello Everyone,
A lot of things has happened in the area of improving Exynos IOMMU
driver and discussion about generic IOMMU bindings, which finally
motivated me to get back to IOMMU related tasks. Just to remind, here
are those 2 important threads:
1. [PATCH v13 00/19] iommu/exynos: Fixes and Enhancements of System MMU
driver with DT: https://lkml.org/lkml/2014/5/12/34
2. [PATCH v4] devicetree: Add generic IOMMU device tree bindings:
https://lkml.org/lkml/2014/7/4/349
As a follow up of those discussions I've decided to finish our internal
code, which adapts Exynos SYSMMU driver to meet generic IOMMU bindings
requirements and implement all needed glue code to finally demonstare
seemless integration IOMMU controller with DMA-mapping subsystem for
the drivers available on Exynos SoCs.
1. Introduction - a few words for those who are not fully aware of the
Exynos SoC hardware
Exynos SoC consists of various devices integrated directly into SoCs.
Most of them are multimedia devices, which usually process large
buffers. Some of them (like i.e. MFC - a multimedia codec or FIMD - a
multi-window framebuffer device & lcd panel controller) are equipped
with more than one memory interface for higher processing performance.
There are also really complex subsystems (like ISP, the camera sensor
interface & processor), which consist of many sub-blocks, each having
its own memory interface/channel/bus (different names are used for the
same thing).
Each such memory controller might be equipped with SYSMMU device, which
acts as IOMMU controller for the parent device (called master device, a
device which that memory interface belongs to). Each SYSMMU controller
has its own register set and clock, belongs to the same power domain as
master device. There is also some non-direct relation from master's
device gate clock - SYSMMU registers can be accessed only when master's
gate clock is enabled.
Basically we have following dependencies between hardware and drivers:
- each multimedia device might have 1 or more SYSMMU controller
- each SYSMMU controller belongs only to 1 master device
- all SYSMMU controllers are independent of each other, there is no
global hardware ID that must be assigned to enable given SYSMMU
controller
- multimedia devices are modeled usually by a separate node in device
tree with it's own compatible string and separate driver for them
- sub-blocks of complex devices right now are not modeled by a separate
device tree nodes, but this might be changed in the future
- some multimedia devices have limited address space per each memory
controller/channel (i.e. codec might access buffers only in a 256MiB
window for each of it's memory channels)
- some drivers for independent device are used together to provide a
more complex subsystem, i.e. FIMD, HDMI-mixer and others form together
Exynos DRM subsystem; it is highly welcome to let them to operate in
the same, shared DMA address space to simplify buffer sharing
2. Introduction part 2 - a few word of summary of the discussions about
generic IOMMU DT bindings
There have been a lot of discussions on the method of modeling IOMMU
controllers in device tree. The approach which has been selected as the
generic IOMMU binding candidate has been described in the '[PATCH v4]
devicetree: Add generic IOMMU device tree bindings' thread.
Those bindings describe how to link an IOMMU controller with its master
device. Basically an 'iommus' property placed in the master's device
node has been introduced. This property contains phandle to IOMMU
controller node. Optional properties of the particular binding can also
be specified after the phandle, assuming that IOMMU controller node
contains '#iommu-cells' property, which defines number of cells used for
those parameters. Those parameters are then interpreted by particular
IOMMU controller driver. Those parameters might be some hw channel id
required for correct hardware setup, base address and size pair for
limited IO address space window or others hardware dependant properties.
3. IOMMU integration to DMA-mapping subsystem
By default we assume that each master device, which has been equipped
with IOMMU controller gets its own DMA (IO) address space. This is
created automatically and transparently without any changes in the
device driver code. All DMA-mapping functions are replaced with the
IOMMU aware versions. This has to be done somewhere by the architecture
or SoC startup code, so when master's driver probe() function is called,
everything is in place.
However some device drivers might need (for various reasons) to manually
manage DMA (IO) address space. For this case a driver need to notify
kernel about that and do the management of DMA address space on its own.
This has been achieved by introducing DRIVER_HAS_OWN_IOMMU_MANAGER flag,
which can be set in struct device_driver. This way the startup code can
easily determine if creating the default per-device separate DMA
address space is required for a given driver or not without any unneeded
alloc/free call sequences.
4. Linux DMA-mapping subsystem and more than one DMA address space
DMA-mapping subsystem assumes that there is only one DMA (IO) address
space associated with the given struct device entity. Usually struct
device is mapped in one-to-one relation to a node describing given
device in device tree. To let driver to access other DMA (IO) address
spaces a sub-device has been introduced. This approach has been already
used by s5p-mfc driver (drivers/media/platform/s5p-mfc/s5p_mfc.c). The
only question is how and when sub-devices are created.
In the proposed approach, such additional address spaces are named with
the names of the respective IOMMU controllers (iommu-names property in
master's DT node). To let driver to access an address space, a
sub-device named 'parent_device_name:address_space_name' need to be
created and added as a child to master's struct device. A good example
is codec device, which on Exynos4412 SoC is instantiated as
'13400000.codec' device. It has 2 memory interfaces ('left' and
'right'), so the sub-devices called '13400000.codec:left' and
'13400000.codec:right' must be created by a driver and added as children
of '13400000.codec' device. Once then the driver is allowed to allocate
2 separate dma-mapping address spaces by calling
arm_iommu_create_mapping() and arm_iommu_detach_device() functions or
newly introduced helper arm_iommu_create_default_mapping(). For more
details, please refer to the last patch in this series.
Exactly the same approach is planned to be done for memory regions and
DMA-mapping implemented on top of CMA or DMA-coherent memory allocators.
When driver doesn't specify that it wants to manage its DMA (IO) address
spaces, a default DMA (IO) address space will be created and all SYSMMU
controllers will be bound to it, so this space will be shared across
master's device IO channels / memory interfaces. This way IOMMU support
might be added only to drivers which really benefit from having
separate IO address space per memory interface without a need to alter
the other drivers.
Why driver might need to manage the IO address space on its own? Once
again the codec device on Exynos4 series is a good example. Memory
interfaces found in the mentioned codec device are limited and can
address only 256MiB window. If we bind both interfaces to common address
space, driver is able only to access memory buffers, which fits into
256MiB window. If we use separate spaces for each memory interface,
codec device will be able to access buffers of total 2*256MiB=512MiB,
which is a significant advantage over the default case of shared address
space.
5. Power management (runtime)
Runtime power management is the most tricky part of the proposed
solution. I assumed that it is a sane requirement that from the
master's device driver the operation without IOMMU and with IOMMU (with
default, per-device mapping) should be exactly the same. The runtime
power management, which is now mainly limited to enabling and disabling
hardware power domains is done by the master's device driver. However
from the hardware perspective, there is also a need to save SYSMMU
context before switch pm domain off and restore it after switching pm
domain on.
To achieve this way of SYSMMU operation, a notifiers for power domains
have been introduces. With such an approach no changes are needed in
master's device driver and SYSMMU driver seamlessly integrates with
master's device runtime pm operations.
6. Proposed patches and changes
Patch 0001 "pm: Add PM domain notifications" adds support for power
domain notifiers (see chapter 5 above).
Patch 0002 "ARM: Exynos: bind power domains earlier, on device creation"
changes the time, when Exynos power domains are bound to the device. Now
this happens on DEVICE_ADD event instead of DRIVER_BIND, so when SYSMMU
driver is being initialized, the power domains are already bound and
notifiers can be added.
Patch 0003 "clk: exynos: add missing smmu_g2d clock and update comments"
simply simply adds missing sysmmu related entities to Exynos clock
driver.
Patch 0004 "drivers: base: add notifier for failed driver bind" add
event for failed driver bind, so things prepared in DRIVER_BIND event
can be cleaned up, similar to DRIVER_UNBOUND.
Patch 0005 "drivers: convert suppress_bind_attrs parameter into flags"
is preparation for adding new flags to struct device_driver.
Patch 0006 "drivers: iommu: add notify about failed bind" adds support
for recently introduced failed driver bind event to IOMMU subsystem.
Patch 0007 "ARM: dma-mapping: arm_iommu_attach_device: automatically set
max_seg_size" moves common operation of setting dma max_seg_size
directly to arm_iommu_attach_device function.
Patch 0008 "ARM: dma-mapping: add helpers for managing default
per-device dma mappings" adds convenient helpers for the most common
case of setting up per-device, separate DMA (IO) address space.
Patch 0009 "ARM: dma-mapping: provide stubs if no ARM_DMA_USE_IOMMU has
been selected" fixes usage of IOMMU related ARM DMA-mapping functions in
common code.
Patch 0010 "drivers: add DRIVER_HAS_OWN_IOMMU_MANAGER flag" adds a flag
described in chapter 3.
Patch 0011 "DRM: exynos: add DRIVER_HAS_OWN_IOMMU_MANAGER flag to all
sub-drivers" marks all Exynos DRM sub-drivers with a flag notifying that
they perform own management of DMA (IO) address space. All the code to
setup dma-mapping and attach all devices is realy there.
Patch 0012 "DRM: Exynos: fix window clear code" is a simple bugfix of
broken init code, which triggers issues when used with IOMMU (page fault
happens on systems, where bootloader has left framebuffer enabled).
Patch 0013 "temporary: media: s5p-mfc: remove DT hacks & initialization
custom memory init code" removes all custom memory region handling, to
let later demonstrate how to use separate DMA (IO) address spaces from
master's device driver.
Patch 0014 "devicetree: Update Exynos SYSMMU device tree bindings" adds
a few words about proposed solution to SYSMMU device tree bindings
documentation.
Patch 0015 "ARM: DTS: Exynos4: add System MMU nodes" adds device tree
nodes for all SYSMMU controllers found in Exynos 4210 and 4x12 SoC and
respective properties to their master devices.
Patch 0016-0021 are simple bugfixes and code refactoring to simplify
the driver:
"iommu: exynos: make driver multiarch friendly",
"iommu: exynos: don't read version register on every tlb",
"iommu: exynos: remove unused functions",
"iommu: exynos: remove useless spinlock",
"iommu: exynos: refactor function parameters to simplify code",
"iommu: exynos: remove unused functions, part 2".
Patch 0022 "iommu: exynos: add support for binding more than one sysmmu
to master device" adds support for storing a list of SYSMMU controllers
in the master's iommu arch data structure.
Patch 0023 "iommu: exynos: init iommu controllers from device tree"
finally implements bindings described in patch 0015 and access to
particular DMA address space managed by SYSMMU controller via sub-device
of predefined name (see chapter 4).
Patch 0024 "iommu: exynos: create default iommu-based dma-mapping for
master devices" does what patch title says.
Patch 0025 "iommu: exynos: add support for runtime_pm" implements power
management scheme described in chapter 5.
Patch 0026-0028 are cleanup and refactoring to make the code easier to
understand:
"iommu: exynos: rename variables to reflect their purpose",
"iommu: exynos: document internal structures",
"iommu: exynos: remove excessive includes and sort others
alphabetically".
Patch 0029 "temporary: media: s5p-mfc: add support for IOMMU"
demonstrates how to use sub-devices to get access to separate DMA (IO)
address spaces. The driver is able to work both with and without this
patch. Without this patch a common shared address space is created for
both SYSMMU controllers (so only 256MiB of total address space is
available, see end of chapter 4).
7. Summary
All the development of those patches have been done on Exynos4412-based
OdroidU3 board and Exynos4210-based UniversalC210, on top of v3.16
kernel with some additional patches to enable HDMI support on Odroid
board. This version is available in the following GIT repository:
http://git.linaro.org/git/people/marek.szyprowski/linux-dma-mapping.git
on branch v3.16-odroid-iommu.
However, the version posted here has been rebased on top of linux-next
kernel (next-20140804 tag), to make marging the easier once v3.17-rc1 is
out.
8. Diffstat
.../devicetree/bindings/iommu/samsung,sysmmu.txt | 93 ++-
Documentation/power/notifiers.txt | 14 +
arch/arm/boot/dts/exynos4.dtsi | 118 ++++
arch/arm/boot/dts/exynos4210.dtsi | 23 +
arch/arm/boot/dts/exynos4x12.dtsi | 82 +++
arch/arm/include/asm/dma-iommu.h | 36 ++
arch/arm/mach-exynos/pm_domains.c | 12 +-
arch/arm/mach-integrator/impd1.c | 2 +-
arch/arm/mm/dma-mapping.c | 47 ++
drivers/base/bus.c | 4 +-
drivers/base/dd.c | 10 +-
drivers/base/platform.c | 2 +-
drivers/base/power/domain.c | 70 ++-
drivers/clk/samsung/clk-exynos4.c | 1 +
drivers/gpu/drm/exynos/exynos_drm_fimc.c | 1 +
drivers/gpu/drm/exynos/exynos_drm_fimd.c | 26 +-
drivers/gpu/drm/exynos/exynos_drm_g2d.c | 1 +
drivers/gpu/drm/exynos/exynos_drm_gsc.c | 1 +
drivers/gpu/drm/exynos/exynos_drm_rotator.c | 1 +
drivers/gpu/drm/exynos/exynos_mixer.c | 1 +
drivers/iommu/exynos-iommu.c | 663 +++++++++++++--------
drivers/iommu/iommu.c | 3 +
drivers/media/platform/s5p-mfc/s5p_mfc.c | 107 ++--
drivers/pci/host/pci-mvebu.c | 2 +-
drivers/pci/host/pci-rcar-gen2.c | 2 +-
drivers/pci/host/pci-tegra.c | 2 +-
drivers/pci/host/pcie-rcar.c | 2 +-
drivers/soc/tegra/pmc.c | 2 +-
include/dt-bindings/clock/exynos4.h | 10 +-
include/linux/device.h | 12 +-
include/linux/iommu.h | 1 +
include/linux/pm.h | 2 +
include/linux/pm_domain.h | 19 +
33 files changed, 1016 insertions(+), 356 deletions(-)
Best regards
Marek Szyprowski
Samsung R&D Institute Poland