Hi Maarten!
Broadening the audience a bit..
On 9/14/12 9:12 AM, Maarten Lankhorst wrote:
> Op 13-09-12 23:00, Thomas Hellstrom schreef:
>> On 09/13/2012 07:13 PM, Maarten Lankhorst wrote:
>>> Hey
>>>
>>> Op 13-09-12 18:41, Thomas Hellstrom schreef:
>>>> On 09/13/2012 05:19 PM, Maarten Lankhorst wrote:
>>>>> Hey,
>>>>>
>>>>> Op 12-09-12 15:28, Thomas Hellstrom schreef:
>>>>>> On 09/12/2012 02:48 PM, Maarten Lankhorst wrote:
>>>>>>> Hey Thomas,
>>>>>>>
>>>>>>> I'm playing around with moving reservations from ttm to global, but how ttm
>>>>>>> ttm is handling reservations is getting in the way. The code wants to move
>>>>>>> the bo from the lru lock at the same time a reservation is made, but that
>>>>>>> seems to be slightly too strict. It would really help me if that guarantee
>>>>>>> is removed.
>>>>>> Hi, Maarten.
>>>>>>
>>>>>> Removing that restriction is not really possible at the moment.
>>>>>> Also the memory accounting code depends on this, and may cause reservations
>>>>>> in the most awkward places. Since these reservations don't have a ticket
>>>>>> they may and will cause deadlocks. So in short the restriction is there
>>>>>> to avoid deadlocks caused by ticketless reservations.
>>>>> I have finished the lockdep annotations now which seems to catch almost
>>>>> all abuse I threw at it, so I'm feeling slightly more confident about moving
>>>>> the locking order and reservations around.
>>>> Maarten, moving reservations in TTM out of the lru lock is incorrect as the code is
>>>> written now. If we want to move it out we need something for ticketless reservations
>>>>
>>>> I've been thinking of having a global hash table of tickets with the task struct pointer as the key,
>>>> but even then, we'd need to be able to handle EBUSY errors on every operation that might try to
>>>> reserve a buffer.
>>>>
>>>> The fact that lockdep doesn't complain isn't enough. There *will* be deadlock use-cases when TTM is handed
>>>> the right data-set.
>>>>
>>>> Isn't there a way that a subsystem can register a callback to be performed to remove stuff from LRU and
>>>> to take a pre-reservation lock?
>>> What if multiple subsystems need those? You will end up with a deadlock again.
>>>
>>> I think it would be easier to change the code in ttm_bo.c to not assume the first
>>> item on the lru list is really the least recently used, and assume the first item
>>> that can be reserved without blocking IS the least recently used instead.
>> So what would happen then is that we'd spin on the first item on the LRU list, since
>> when reserving we must release the LRU lock, and if reserving fails, we thus
>> need to restart LRU traversal. Typically after a schedule(). That's bad.
>>
>> So let's take a step back and analyze why the LRU lock has become a problem.
>> From what I can tell, it's because you want to use per-object lock when reserving instead of a
>> global reservation lock (that TTM could use as the LRU lock). Is that correct?
>> and in that case, in what situation do you envision such a global lock being contended
>> to the extent that it hurts performance?
>>
>>>>> Lockdep WILL complain about trying to use multiple tickets, doing ticketed
>>>>> and unticketed blocking reservations mixed, etc.
>>>>>
>>>>> I want to remove the global fence_lock and make it a per buffer lock, with some
>>>>> lockdep annotations it's perfectly legal to grab obj->fence_lock and obj2->fence_lock
>>>>> if you have a reservation, but it should complain loudly about trying to take 2 fence_locks
>>>>> at the same time without a reservation.
>>>> Yes, TTM was previously using per buffer fence locks, and that works fine from a deadlock perspective, but
>>>> it hurts performance. Fencing 200 buffers in a command submission (google-earth for example) will mean
>>>> 198 unnecessary locks, each discarding the processor pipelines. Locking is a *slow* operation, particularly
>>>> on systems with many processors, and I don't think it's a good idea to change that back, without analyzing
>>>> the performance impact. There are reasons people are writing stuff like RCU to avoid locking...
>>> So why don't we simply use RCU for fence pointers and get rid of the fence locking? :D
>>> danvet originally suggested it as a joke but if you think about it, it would make a lot of sense for this usecase.
>> I thought of that before, but the problem is you'd still need a spinlock to change the buffer's fence pointer,
>> even if reading it becomes quick.
> Actually, I changed lockdep annotations a bit to distinguish between the
> cases where ttm_bo_wait is called without reservation, and ttm_bo_wait
> is called with, as far as I can see there are only 2 places that do it without,
> at least if I converted my git tree properly..
>
> http://cgit.freedesktop.org/~mlankhorst/linux/log/?h=v10-wip
>
> First one is nouveau_bo_vma_del, this can be fixed easily.
> Second one is ttm_bo_cleanup_refs and ttm_bo_cleanup_refs_or_queue,
> if reservation is done first before ttm_bo_wait, the fence_lock could be
> dropped entirely by adding smb_mb() in reserve and unreserve, functionally
> there would be no difference. So if you can verify my lockdep annotations are
> correct in the most recent commit wrt what's using ttm_bo_wait without reservation
> we could remove the fence_lock entirely.
>
> ~Maarten
Being able to wait for buffer idle or get the fence pointer without
reserving is a fundamental property of TTM. Reservation is a long-term
lock. The fence lock is a very short term lock. If I were to choose, I'd
rather accept per-object fence locks than removing this property, but
see below.
Likewise, to be able to guarantee that a reserved object is not on any
LRU list is also an important property. Removing that property will, in
addition to the spin wait we've already discussed make understanding TTM
locking even more difficult, and I'd really like to avoid it.
If this were a real performance problem we were trying to solve it would
be easier to motivate changes in this area, but if it's just trying to
avoid a global reservation lock and a global fence lock that will rarely
if ever see any contention, I can't see the point. On the contrary,
having per-object locks will be very costly when reserving / fencing
many objects. As mentioned before, in the fence lock case it's been
tried and removed, so I'd like to know the reasoning behind introducing
it again, and in what situations you think the global locks will be
contended.
/Thomas
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Signed-off-by: Laura Abbott <lauraa(a)codeaurora.org>
---
fs/buffer.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index ad5938c..daa0c3d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1399,12 +1399,49 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return 0;
}
+static void __evict_bh_lru(void *arg)
+{
+ struct bh_lru *b = &get_cpu_var(bh_lrus);
+ struct buffer_head *bh = arg;
+ int i;
+
+ for (i = 0; i < BH_LRU_SIZE; i++) {
+ if (b->bhs[i] == bh) {
+ brelse(b->bhs[i]);
+ b->bhs[i] = NULL;
+ goto out;
+ }
+ }
+out:
+ put_cpu_var(bh_lrus);
+}
+
+static bool bh_exists_in_lru(int cpu, void *arg)
+{
+ struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
+ struct buffer_head *bh = arg;
+ int i;
+
+ for (i = 0; i < BH_LRU_SIZE; i++) {
+ if (b->bhs[i] == bh)
+ return 1;
+ }
+
+ return 0;
+
+}
void invalidate_bh_lrus(void)
{
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
}
EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
+void evict_bh_lrus(struct buffer_head *bh)
+{
+ on_each_cpu_cond(bh_exists_in_lru, __evict_bh_lru, bh, 1, GFP_ATOMIC);
+}
+EXPORT_SYMBOL_GPL(evict_bh_lrus);
+
void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
{
@@ -3052,6 +3089,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
bh = head;
do {
+ evict_bh_lrus(bh);
if (buffer_write_io_error(bh) && page->mapping)
set_bit(AS_EIO, &page->mapping->flags);
if (buffer_busy(bh))
--
1.7.11.3
Hi Linus,
I would like to ask for pulling one more patch for ARM dma-mapping
subsystem to Linux v3.6 kernel tree. This patch fixes very subtle bug
(typical off-by-one error) which might appear in very rare
circumstances.
The following changes since commit 55d512e245bc7699a8800e23df1a24195dd08217:
Linux 3.6-rc5 (2012-09-08 16:43:45 -0700)
are available in the git repository at:
git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git fixes-for-3.6
for you to fetch changes up to f3d87524975f01b885fc3d009c6ab6afd0d00746:
arm: mm: fix DMA pool affiliation check (2012-09-10 16:15:48 +0200)
Thanks!
Best regards
Marek Szyprowski
Samsung Poland R&D Center
Patch summary:
----------------------------------------------------------------
Thomas Petazzoni (1):
arm: mm: fix DMA pool affiliation check
arch/arm/mm/dma-mapping.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
On Wed, Sep 5, 2012 at 5:08 AM, Tomi Valkeinen <tomi.valkeinen(a)ti.com> wrote:
> Hi,
>
> OMAP has a custom video ram allocator, which I'd like to remove and use
> the standard dma allocation functions.
>
> There are two problems for which I'd like to hear suggestions or
> comments:
>
> First one is that the dma_alloc_* functions map the allocated memory for
> cpu use. In many cases with OMAP DSS (display subsystem) this is not
> needed: the memory may be written only by the SGX or the DSP, and it's
> only read by the DSS, so it's never touched by the CPU.
see dma_alloc_attrs() and DMA_ATTR_NO_KERNEL_MAPPING
> This is even more true when using VRFB on omap3 (and probably TILER on
> omap4) for rotation, as VRFB hides the actual memory and offers rotated
> views. In this case the backend memory is never accessed by anyone else
> than VRFB.
just fwiw, we don't actually need contiguous memory on o4/tiler :-)
(well, at least if you ignore things like secure playback)
> Is there a way to allocate the memory without creating a mapping? While
> it won't break anything as such, the allocated areas can be quite large
> thus causing large areas of the kernel's memory space to be needlessly
> reserved.
>
> The second case is passing a framebuffer address from the bootloader to
> the kernel. Often with mobile devices the bootloader will initialize the
> display hardware, showing a company logo or such. To keep the image on
> the screen when kernel starts we need to reserve the same physical
> memory area early at boot, and use that for the framebuffer.
with a bit of handwaving, this is possible. You can pass a base
address to dma_declare_contiguous() when you setup your device's CMA
pool. Although that doesn't really guarantee you're allocation from
that pool is at offset zero, I suppose.
> I'm not sure if there's any actual problem with this one, presuming
> there is a solution for the first case. Somehow the memory is reserved
> at early boot time, and this is passed to the fb driver. But can the
> memory be managed the same way as in normal case (for example freeing
> it), or does it need to be handled as a special case?
special-casing it might be better.. although possibly a dma attr could
be added for this to tell dma_alloc_from_contiguous() that we need a
particular address within the CMA pool. It seems a bit like a hack,
but OTOH I guess pretty much every consumer device would need a hack
like this.
BR,
-R
> Tomi
>
v2->v3
Split oom killer patch only.
Based on Nishanth's patch, which change ion_debug_heap_total with id.
1. add heap_found
2. Solve the issue about serveral id share one type.
Use ion_debug_heap_total(client, heap->id) instead of ion_debug_heap_total(client, heap->type)
since id is unique while type can be shared.
Fortunately Nishanth has update one patch, so rebase on the patch
v1->v2
Sync to Aug 30 common.git
v0->v1:
1. move ion_shrink out of mutex, suggested by Nishanth
2. check error flag of ERR_PTR(-ENOMEM)
3. add msleep to allow schedule out.
Base on common.git, android-3.4 branch
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Nishanth Peethambaran (1):
gpu: ion: Update debugfs to show for each id
Zhangfei Gao (1):
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 131 +++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 121 insertions(+), 10 deletions(-)
v1->v2
Sync to Aug 30 common.git
v0->v1:
1, Change gen_pool_create(12, -1) to gen_pool_create(PAGE_SHIFT, -1), suggested by Haojian
2. move ion_shrink out of mutex, suggested by Nishanth
3. check error flag of ERR_PTR(-ENOMEM)
4. add msleep to allow schedule out.
Base on common.git, android-3.4 branch
Patch 2:
Add support page wised cache flush for carveout_heap
There is only one nents for carveout heap, as well as dirty bit.
As a result, cache flush only takes effect for total carve heap.
Patch 3:
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Zhangfei Gao (3):
gpu: ion: update carveout_heap_ops
gpu: ion: carveout_heap page wised cache flush
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 118 +++++++++++++++++++++++++++++++++-
drivers/gpu/ion/ion_carveout_heap.c | 25 ++++++--
2 files changed, 133 insertions(+), 10 deletions(-)
So I've been experimenting with support for Dave Airlie's new RandR 1.4 provider
object interface, so that Optimus-based laptops can use our driver to drive the
discrete GPU and display on the integrated GPU. The good news is that I've got
a proof of concept working.
During a review of the current code, we came up with a few concerns:
1. The output source is responsible for allocating the shared memory
Right now, the X server calls CreatePixmap on the output source screen and then
expects the output sink screen to be able to display from whatever memory the
source allocates. Right now, the source has no mechanism for asking the sink
what its requirements are for the surface. I'm using our own internal pitch
alignment requirements and that seems to be good enough for the Intel device to
scan out, but that could be pure luck.
Does it make sense to add a mechanism for drivers to negotiate this with each
other, or is it sufficient to just define a lowest common denominator format and
if your hardware can't deal with that format, you just don't get to share
buffers?
One of my coworkers brought to my attention the fact that Tegra requires a
specific pitch alignment, and cannot accommodate larger pitches. If other SoC
designs have similar restrictions, we might need to add a handshake mechanism.
2. There's no fallback mechanism if sharing can't be negotiated
If RandR fails to share a pixmap with the output sink screen, the whole modeset
fails. This means you'll end up not seeing anything on the screen and you'll
probably think your computer locked up. Should there be some sort of software
copy fallback to ensure that something at least shows up on the display?
3. How should the memory be allocated?
In the prototype I threw together, I'm allocating the shared memory using
shm_open and then exporting that as a dma-buf file descriptor using an ioctl I
added to the kernel, and then importing that memory back into our driver through
dma_buf_attach & dma_buf_map_attachment. Does it make sense for user-space
programs to be able to export shmfs files like that? Should that interface go
in DRM / GEM / PRIME instead? Something else? I'm pretty unfamiliar with this
kernel code so any suggestions would be appreciated.
-- Aaron
P.S. for those unfamiliar with PRIME:
Dave Airlie added new support to the X Resize and Rotate extension version 1.4
to support offloading display and rendering to different drivers. PRIME is the
DRM implementation in the kernel, layered on top of DMA-BUF, that implements the
actual sharing of buffers between drivers.
http://cgit.freedesktop.org/xorg/proto/randrproto/tree/randrproto.txt?id=ra…http://airlied.livejournal.com/75555.html - update on hotplug server
http://airlied.livejournal.com/76078.html - randr 1.5 demo videos
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Base on common.git, android-3.4 branch
Patch 2:
Add support page wised cache flush for carveout_heap
There is only one nents for carveout heap, as well as dirty bit.
As a result, cache flush only takes effect for total carve heap.
Patch 3:
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Zhangfei Gao (3):
gpu: ion: update carveout_heap_ops
gpu: ion: carveout_heap page wised cache flush
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 112 ++++++++++++++++++++++++++++++++++-
drivers/gpu/ion/ion_carveout_heap.c | 23 ++++++--
2 files changed, 127 insertions(+), 8 deletions(-)
v0->v1:
1, Change gen_pool_create(12, -1) to gen_pool_create(PAGE_SHIFT, -1), suggested by Haojian
2. move ion_shrink out of mutex, suggested by Nishanth
3. check error flag of ERR_PTR(-ENOMEM)
4. add msleep to allow schedule out.
Base on common.git, android-3.4 branch
Patch 2:
Add support page wised cache flush for carveout_heap
There is only one nents for carveout heap, as well as dirty bit.
As a result, cache flush only takes effect for total carve heap.
Patch 3:
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Zhangfei Gao (3):
gpu: ion: update carveout_heap_ops
gpu: ion: carveout_heap page wised cache flush
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 118 +++++++++++++++++++++++++++++++++-
drivers/gpu/ion/ion_carveout_heap.c | 25 ++++++--
2 files changed, 133 insertions(+), 10 deletions(-)
Hi Linus,
I would like to ask for pulling another set of fixes for ARM dma-mapping
subsystem. Commit e9da6e9905e6 replaced custom consistent buffer
remapping code with generic vmalloc areas. It however introduced some
regressions caused by limited support for allocations in atomic context.
This series contains fixes for those regressions. For some subplatforms
the default, pre-allocated pool for atomic allocations turned out to be
too small, so a function for setting its size has been added. Another
set of patches adds support for atomic allocations to IOMMU-aware
DMA-mapping implementation. The last part of this pull request contains
two fixes for Contiguous Memory Allocator, which relax too strict
requirements.
The following changes since commit fea7a08acb13524b47711625eebea40a0ede69a0:
Linux 3.6-rc3 (2012-08-22 13:29:06 -0700)
are available in the git repository at:
fixes-for-3.6
for you to fetch changes up to 479ed93a4b98eef03fd8260f7ddc00019221c450:
ARM: dma-mapping: IOMMU allocates pages from atomic_pool with GFP_ATOMIC (2012-08-28 21:01:07 +0200)
Thanks!
Best regards
Marek Szyprowski
Samsung Poland R&D Center
----------------------------------------------------------------
Patch summary:
Hiroshi Doyu (4):
ARM: dma-mapping: atomic_pool with struct page **pages
ARM: dma-mapping: Refactor out to introduce __in_atomic_pool
ARM: dma-mapping: Introduce __atomic_get_pages() for __iommu_get_pages()
ARM: dma-mapping: IOMMU allocates pages from atomic_pool with GFP_ATOMIC
Marek Szyprowski (5):
mm: cma: fix alignment requirements for contiguous regions
ARM: relax conditions required for enabling Contiguous Memory Allocator
ARM: DMA-Mapping: add function for setting coherent pool size from platform code
ARM: DMA-Mapping: print warning when atomic coherent allocation fails
ARM: Kirkwood: increase atomic coherent pool size
arch/arm/Kconfig | 2 +-
arch/arm/include/asm/dma-mapping.h | 7 ++
arch/arm/mach-kirkwood/common.c | 7 ++
arch/arm/mm/dma-mapping.c | 114 ++++++++++++++++++++++++++++++++---
drivers/base/dma-contiguous.c | 2 +-
5 files changed, 120 insertions(+), 12 deletions(-)
Hi,
I am trying to export an ion_buffer allocate from kernel space to
multiple user-space clients. Eg: Allow multiple process to mmap
framebuffer allocated using ion by fb driver.
The following is the pseudo-code for that. Is this fine? there a
cleaner way to do it?
Or is it expected to share buffers across process only by user-space
sharing fds using sockets/binder and not directly in kernel.
fb driver init/probe: (init process context)
-------------------------------------------------
/* Create an ion client and allocate framebuffer */
init_client = ion_client_create(idev,...);
init_hdl = ion_alloc(init_client,...);
/* Create a global dma_buf instance for the buffer */
fd = ion_share_dma_buf(init_client, init_hdl);
// - Inc refcount of ion_buffer
// - Create a dma_buf and anon-file for the ion buffer
// - Get a free fd and install to anon file
g_dma_buf = dma_buf_get(fd);
// - Get the dma_buf pointer and inc refcount of anon_file
dma_buf_put(g_dma_buf);
// - Dec extra refcount of anon_file which happened in prev command
put_unused_fd(fd);
// - Free up the fd as fd is not exported to user-space here.
fb driver exit: (init process context)
------------------------------------------
/* Free the dma_buf reference */
dma_buf_put(g_dma_buf);
// - Dec refcount of anon_file. Free the dma_buf and dec refcount of
ion_buffer if anon_file refcount = 0
/* Free the framebuffer and destroy the ion client created for init process */
ion_free(init_client, init_hdl);
ion_client_destroy(init_client);
fb device open: (user process context)
-----------------------------------------------
/* Create an ion client for the user process */
p_client = ion_client_create(idev,...);
fb device ioctl to import ion handle for the fb: (user process context)
-----------------------------------------------------------------------------------
/* Import a ion_handle from the global dma_buf */
fd = dma_buf_fd(g_dmabuf, O_CLOEXEC);
// - Get ref to anon file
// - Get a free fd and install to anon file
p_hdl = ion_import_dma_buf(p_client, fd);
// - Inc refcount of ion_buffer
// - create a ion_handle for the buffer for this process/client
dma_buf_put(g_dmabuf);
// - Free the anon file reference taken in first step
put_unused_fd(fd);
// - Free up the fd as fd is not exported to user-space here.
fb device release: (user process context)
---------------------------------------------------
/* Destroy the client created */
ion_client_destroy(p_client);
- Nishanth Peethambaran
Hi,
I've been observing a high rate of failures with CMA allocations on my
ARM system. I've set up a test case set up with a 56MB CMA region that
essentially does the following:
total_failures = 0;
loop forever:
loop_failure = 0;
for (i = 0; i < 56; i++)
chunk[i] = dma_allocate(&cma_dev, 1MB)
if (!chunk[i])
loop_failure = 0
if (loop_failure)
total_failures++
loop_failure = 0
for (i = 0; i < 56; i++)
dma_free(&cma_dev, chunk[i], 1MB)
In the background, I also have a process doing some amount of filesystem
activity (adb push/pull since this is an android system). During the
course of my investigations I generally get ~8500 loops total and ~450
total failures (i.e. one or more buffers could not be allocated). This
is unacceptably high for our use cases.
In every case the allocation failure was ultimately due to a migration
failure; the pages contained buffers which could not be dropped because
the buffers were busy (move_to_new_page -> fallback_migrate_page ->
try_to_release_page -> try_to_free_buffers -> drop_buffers ->
buffer_busy). In every case, the b_count on the buffer head was always 1.
The problem arises because of the LRU lists for buffer heads:
__getblk
__getblk_slow
grow_buffers
grow_dev_page
find_or_create_page -- create a possibly movable page
__find_get_block
__find_get_block_slow
find_get_page -- return the movable page
bh_lru_install
get_bh -- buffer head now has a reference
The reference taken in bh_lru_install won't be dropped until the bh is
evicted from the lru. This means the page cannot be migrated as long as
the buffer exists on an LRU list. The real issue is that unless the
buffer gets evicted quickly the page can remain non-migratible for long
periods of time. This makes CMA regions unusable for long periods of
time given that we generally don't want to size CMA regions any larger
than necessary ergo any failure will cause a problem.
My quick and dirty workaround for testing is to remove the GFP_MOVABLE
flag from find_or_create_page but this seems significantly less than
optimal. Ideally, it seems like the buffers should be evicted from the
LRU when trying to drop (expand on invalid_bh_lru?) but I'm not familiar
enough with the code path to know if this is a good approach.
Any suggestions/feedback is appreciated. Thanks.
Laura
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
Hi,
I see that the lowmemkiller.c is changed to use oom_score_adj instead
of oom_adj.
Does this mean I cannot use an ICS system image with 3.4 kernel?
Or is there a workaround?
- Nishanth Peethambaran
looping-in the linaro-mm-sig ML.
On Thu, Aug 30, 2012 at 4:47 PM, Aubertin, Guillaume <g-aubertin(a)ti.com>wrote:
> hi guys,
>
> I've been working for a few days on getting a proper rmmod with the
> remoteproc/rpmsg modules, and I stumbled upon an interesting issue.
>
> when doing sucessive memory allocation and release in the CMA
> reservation (by loading/unloading the firmware several times), the
> following message shows up :
>
> [ 119.908477] cma: dma_alloc_from_contiguous(cma ed10ad00, count 256,
> align 8)
> [ 119.908843] cma: dma_alloc_from_contiguous(): memory range at c0dfb000
> is busy, retrying
> [ 119.909698] cma: dma_alloc_from_contiguous(): returned c0dfd000
>
> dma_alloc_from_contiguous() tries to allocate the following range,
> 0xc0dfd000, succesfully this time.
>
> In some cases, the allocation fails after trying several ranges :
>
> [ 119.912231] cma: dma_alloc_from_contiguous(cma ed10ad00, count 768,
> align 8)
> [ 119.912719] cma: dma_alloc_from_contiguous(): memory range at c0dff000
> is busy, retrying
> [ 119.913055] cma: dma_alloc_from_contiguous(): memory range at c0e01000
> is busy, retrying
> [ 119.913055] rproc remoteproc0: dma_alloc_coherent failed: 3145728
>
> Here is my understanding so far :
>
> First, even if we made a CMA reservation, the kernel can still allocate
> pages in this area, but these pages must be movable (user process page by
> example).
>
> When dma_alloc_from_contiguous() is called to allocate X pages, it looks
> for the next X contiguous free pages in it's CMA bitmap (with respect to
> the memory alignment). Then, alloc_contig_range() is called to allocate the
> given range of pages. Alloc_contig_range() analyses the pages we want to
> allocate, and if a page is already used, it is migrated to a new page
> outside the page array we want to reserve. this is done using
> isolate_migratepages_range() to list the pages to migrate, and
> migrate_pages() to try to migrate the pages, and that's where it fails.
> Below is a list of next function calls :
>
> fallback_migrate_page() --> migrate_page() --> try_to_release_page()
> --> try_to_free_buffer() --> drop_buffers() --> buffer_busy()
>
> I understand here that the page contains used buffers that can't be
> dropped, and so the page can't be migrated. Well, I must admit that once
> here, I'm feeling a little lost in this ocean of memory management code ;).
> After a few researches, I found the following thread on the
> linux-arm-kernel ML talking about the same issue :
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/102844.html with
> the following patch :
>
> * mm/page_alloc.c | 3 ++-*
> * 1 files changed, 2 insertions(+), 1 deletions(-)*
> *
> *
> *diff --git a/mm/page_alloc.c b/mm/page_alloc.c*
> *index 0e1c6f5..c9a6483 100644*
> *--- a/mm/page_alloc.c*
> *+++ b/mm/page_alloc.c*
> *@@ -1310,7 +1310,8 @@ void free_hot_cold_page(struct page *page, int
> cold)*
> * * excessively into the page allocator*
> * */*
> * if (migratetype >= MIGRATE_PCPTYPES) {*
> *- if (unlikely(migratetype == MIGRATE_ISOLATE)) {*
> *+ if (unlikely(migratetype == MIGRATE_ISOLATE)*
> *+ || is_migrate_cma(migratetype)) {*
> * free_one_page(zone, page, 0, migratetype);*
> * goto out;*
> * }*
>
> I tried the patch, and it seems to work (I didn't have any "memory range
> busy" in 5000+ tests), but I'm affraid that this could have some nasty side
> effects.
>
> Any idea ?
>
> Thanks in advance,
> Guillaume
>
>
> --
> Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve
> Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920
>
--
Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve
Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920
Hi All,
Over the last few months I've been working on & off with a few people from
Linaro on a new EGL extension. The extension allows constructing an EGLImage
from a (set of) dma_buf file descriptors, including support for multi-plane
YUV. I envisage the primary use-case of this extension to be importing video
frames from v4l2 into the EGL/GLES graphics driver to texture from.
Originally the intent was to develop this as a Khronos-ratified extension.
However, this is a little too platform-specific to be an officially
sanctioned Khronos extension. It also goes against the general "EGLStream"
direction the EGL working group is going in. As such, the general feeling
was to make this an EXT "multi-vendor" extension with no official stamp of
approval from Khronos. As this is no-longer intended to be a Khronos
extension, I've re-written it to be a lot more Linux & dma_buf specific. It
also allows me to circulate the extension more widely (I.e. To those outside
Khronos membership).
ARM are implementing this extension for at least our Mali-T6xx driver and
likely earlier drivers too. I am sending this e-mail to solicit feedback,
both from other vendors who might implement this extension (Mesa3D?) and
from potential users of the extension. However, any feedback is welcome.
Please find the extension text as it currently stands below. There several
open issues which I've proposed solutions for, but I'm not really happy with
those proposals and hoped others could chip-in with better ideas. There are
likely other issues I've not thought about which also need to be added and
addressed.
Once there's a general consensus or if no-one's interested, I'll update the
spec, move it out of Draft status and get it added to the Khronos registry,
which includes assigning values for the new symbols.
Cheers,
Tom
---------8<---------
Name
EXT_image_dma_buf_import
Name Strings
EGL_EXT_image_dma_buf_import
Contributors
Jesse Barker
Rob Clark
Tom Cooksey
Contacts
Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)
Status
DRAFT
Version
Version 3, August 16, 2012
Number
EGL Extension ???
Dependencies
EGL 1.2 is required.
EGL_KHR_image_base is required.
The EGL implementation must be running on a Linux kernel supporting the
dma_buf buffer sharing mechanism.
This extension is written against the wording of the EGL 1.2
Specification.
Overview
This extension allows creating an EGLImage from a Linux dma_buf file
descriptor or multiple file descriptors in the case of multi-plane YUV
images.
New Types
None
New Procedures and Functions
None
New Tokens
Accepted by the <target> parameter of eglCreateImageKHR:
EGL_LINUX_DMA_BUF_EXT
Accepted as an attribute in the <attrib_list> parameter of
eglCreateImageKHR:
EGL_LINUX_DRM_FOURCC_EXT
EGL_DMA_BUF_PLANE0_FD_EXT
EGL_DMA_BUF_PLANE0_OFFSET_EXT
EGL_DMA_BUF_PLANE0_PITCH_EXT
EGL_DMA_BUF_PLANE1_FD_EXT
EGL_DMA_BUF_PLANE1_OFFSET_EXT
EGL_DMA_BUF_PLANE1_PITCH_EXT
EGL_DMA_BUF_PLANE2_FD_EXT
EGL_DMA_BUF_PLANE2_OFFSET_EXT
EGL_DMA_BUF_PLANE2_PITCH_EXT
Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)
Add to section 2.5.1 "EGLImage Specification" (as defined by the
EGL_KHR_image_base specification), in the description of
eglCreateImageKHR:
"Values accepted for <target> are listed in Table aaa, below.
+-------------------------+--------------------------------------------+
| <target> | Notes
|
+-------------------------+--------------------------------------------+
| EGL_LINUX_DMA_BUF_EXT | Used for EGLImages imported from Linux
|
| | dma_buf file descriptors
|
+-------------------------+--------------------------------------------+
Table aaa. Legal values for eglCreateImageKHR <target> parameter
...
If <target> is EGL_LINUX_DMA_BUF_EXT, <dpy> must be a valid display,
<ctx>
must be EGL_NO_CONTEXT, and <buffer> must be NULL, cast into the type
EGLClientBuffer. The details of the image is specified by the attributes
passed into eglCreateImageKHR. Required attributes and their values are
as
follows:
* EGL_WIDTH & EGL_HEIGHT: The logical dimensions of the buffer in
pixels
* EGL_LINUX_DRM_FOURCC_EXT: The pixel format of the buffer, as
specified
by drm_fourcc.h and used as the pixel_format parameter of the
drm_mode_fb_cmd2 ioctl.
* EGL_DMA_BUF_PLANE0_FD_EXT: The dma_buf file descriptor of plane 0
of
the image.
* EGL_DMA_BUF_PLANE0_OFFSET_EXT: The offset from the start of the
dma_buf of the first sample in plane 0, in bytes.
* EGL_DMA_BUF_PLANE0_PITCH_EXT: The number of bytes between the
start of
subsequent rows of samples in plane 0. May have special meaning
for
non-linear formats.
For images in an RGB color-space or those using a single-plane YUV
format,
only the first plane's file descriptor, offset & pitch should be
specified.
For semi-planar YUV formats, the chroma samples are stored in plane 1
and
for fully planar formats, U-samples are stored in plane 1 and V-samples
are
stored in plane 2. Planes 1 & 2 are specified by the following
attributes,
which have the same meanings as defined above for plane 0:
* EGL_DMA_BUF_PLANE1_FD_EXT
* EGL_DMA_BUF_PLANE1_OFFSET_EXT
* EGL_DMA_BUF_PLANE1_PITCH_EXT
* EGL_DMA_BUF_PLANE2_FD_EXT
* EGL_DMA_BUF_PLANE2_OFFSET_EXT
* EGL_DMA_BUF_PLANE2_PITCH_EXT
If eglCreateImageKHR is successful for a EGL_LINUX_DMA_BUF_EXT target,
the
EGL takes ownership of the file descriptor and is responsible for
closing
it, which it may do at any time while the EGLDisplay is initialized."
Add to the list of error conditions for eglCreateImageKHR:
"* If <target> is EGL_LINUX_DMA_BUF_EXT and <buffer> is not NULL, the
error EGL_BAD_PARAMETER is generated.
* If <target> is EGL_LINUX_DMA_BUF_EXT, and the list of attributes is
incomplete, EGL_BAD_PARAMETER is generated.
* If <target> is EGL_LINUX_DMA_BUF_EXT, and the
EGL_LINUX_DRM_FOURCC_EXT
attribute is set to a format not supported by the EGL,
EGL_BAD_MATCH
is generated.
* If <target> is EGL_LINUX_DMA_BUF_EXT, and the
EGL_LINUX_DRM_FOURCC_EXT
attribute indicates a single-plane format, EGL_BAD_ATTRIBUTE is
generated if any of the EGL_DMA_BUF_PLANE1_* or
EGL_DMA_BUF_PLANE2_*
attributes are specified.
Issues
1. Should this be a KHR or EXT extension?
ANSWER: EXT. Khronos EGL working group not keen on this extension as it
is
seen as contradicting the EGLStream direction the specification is going
in.
The working group recommends creating additional specs to allow an
EGLStream
producer/consumer connected to v4l2/DRM or any other Linux interface.
2. Should this be a generic any platform extension, or a Linux-only
extension which explicitly states the handles are dma_buf fds?
ANSWER: There's currently no intention to port this extension to any OS
not
based on the Linux kernel. Consequently, this spec can be explicitly
written
against Linux and the dma_buf API.
3. Does ownership of the file descriptor pass to the EGL library?
PROPOSAL: If eglCreateImageKHR is successful, EGL assumes ownership of
the
file descriptors and is responsible for closing them.
4. How are the different YUV color spaces handled (BT.709/BT.601)?
Open issue, still TBD. Doesn't seem to be specified by either the v4l2
or
DRM APIs. PROPOSAL: Undefined and implementation/format dependent.
5. What chroma-siting is used for sub-sampled YUV formats?
Open issue, still TBD. Doesn't seem to be specified by either the v4l2
or
DRM APIs. PROPOSAL: Undefined and implementation/format dependent.
5. How can an application query which formats the EGL implementation
supports?
PROPOSAL: Don't provide a query mechanism but instead add an error
condition
that EGL_BAD_MATCH is raised if the EGL implementation doesn't support
that
particular format.
5. Which image formats should be supported and how is format specified?
Open issue, still TBD. Seem to be two options 1) specify a new enum in
this
specification and enumerate all possible formats. 2) Use an existing
enum
already in Linux, either v4l2_mbus_pixelcode and/or those formats listed
in drm_fourcc.h?
PROPOSAL: Go for option 2) and just use values defined in drm_fourcc.h.
Revision History
#3 (Tom Cooksey, August 16, 2012)
- Changed name from EGL_EXT_image_external and re-written language to
explicitly state this for use with Linux & dma_buf.
- Added a list of issues, including some still open ones.
#2 (Jesse Barker, May 30, 2012)
- Revision to split eglCreateImageKHR functionality from export
Functionality.
- Update definition of EGLNativeBufferType to be a struct containing a
list
of handles to support multi-buffer/multi-planar formats.
#1 (Jesse Barker, March 20, 2012)
- Initial draft.
Hi!
Aaro Koskinen and Josh Coombs reported that commit e9da6e9905e639 ("ARM:
dma-mapping: remove custom consistent dma region") introduced a
regresion. It turned out that the default 256KiB for atomic coherent
pool might not be enough. After that patch, some Kirkwood systems run
out of atomic coherent memory and fail without any meanfull message.
This patch series is an attempt to fix those issues by adding function
for setting coherent pool size from platform initialization code and
increasing the size of the pool for Kirkwood systems.
Best regards
Marek Szyprowski
Samsung Poland R&D Center
Patch summary:
Marek Szyprowski (3):
ARM: DMA-Mapping: add function for setting coherent pool size from
platform code
ARM: DMA-Mapping: print warning when atomic coherent allocation fails
ARM: Kirkwood: increase atomic coherent pool size
arch/arm/include/asm/dma-mapping.h | 7 +++++++
arch/arm/mach-kirkwood/common.c | 7 +++++++
arch/arm/mm/dma-mapping.c | 22 +++++++++++++++++++++-
3 files changed, 35 insertions(+), 1 deletions(-)
--
1.7.1.569.g6f426
Hi, All
We met question about dmac_map_area & dmac_flush_range from user addr.
mcr would not return on armv7 processor.
Existing ion carveout heap does not support partial cache flush.
Total cache will be flushed at all.
There is only one dirty bit for carveout heap, as well as sg_table->nents.
drivers/gpu/ion/ion_carveout_heap.c
ion_carveout_heap_map_dma -> sg_alloc_table(table, 1, GFP_KERNEL);
ion_buffer_alloc_dirty -> pages = buffer->sg_table->nents;
We want to support partial cache flush.
Align to cache line, instead of PAGE_SIZE, for efficiency consideration.
We have considered extended dirty bit, but looks like only align to PAGE_SIZE.
For experiment we modify ioctl ION_IOC_SYNC on armv7.
And directly use dmac_map_area & dmac_flush_range with add from user space.
However, we find dmac_map_area can not work with this addr from user space.
In fact, it is mcr can not work with addr from user space, it would hung.
Also, ion_vm_falut would happen twice.
The first time is from __dabt_usr, when we access the mmaped buffer, it is fine.
The second is from __davt_svc, it is caused by mcr, it is strange?
ION malloc carveout heap
addr = user mmap
user access addr, ion_vm_fault (__dabt_usr), build page table, and
vm_insert_page.
dmac_map_area & dmac_flush_range with addr -> ion_vm_fault (__davt_svc)
mcr hung.
Not understand why ion_vm_fault happen twice, where page table has been build.
Why mcr will hung with addr from user space.
Besides, no problem with ION on 3.0, which do not use ion_vm_fault.
Any suggestion?
Thanks
Hi,
ION debugfs currently shows/groups output based on type.
But, it is possible to have multiple heaps of the same type - for CMA
and carveout types.
It is more useful to get usage information for individual heaps.
- Nishanth Peethambaran
>From fa819b42fb69321a8e5db260ba9fd8ce7a2f16d2 Mon Sep 17 00:00:00 2001
From: Nishanth Peethambaran <nishanth(a)broadcom.com>
Date: Tue, 28 Aug 2012 07:57:37 +0530
Subject: [PATCH] gpu: ion: Update debugfs to show for each id
Update the debugfs read of client and heap to show
based on 'id' instead of 'type'.
Multiple heaps of the same type can be present, but
id is unique.
---
drivers/gpu/ion/ion.c | 14 +++++++-------
1 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/ion/ion.c b/drivers/gpu/ion/ion.c
index 34c12df..65cedee 100644
--- a/drivers/gpu/ion/ion.c
+++ b/drivers/gpu/ion/ion.c
@@ -547,11 +547,11 @@ static int ion_debug_client_show(struct seq_file
*s, void *unused)
for (n = rb_first(&client->handles); n; n = rb_next(n)) {
struct ion_handle *handle = rb_entry(n, struct ion_handle,
node);
- enum ion_heap_type type = handle->buffer->heap->type;
+ int id = handle->buffer->heap->id;
- if (!names[type])
- names[type] = handle->buffer->heap->name;
- sizes[type] += handle->buffer->size;
+ if (!names[id])
+ names[id] = handle->buffer->heap->name;
+ sizes[id] += handle->buffer->size;
}
mutex_unlock(&client->lock);
@@ -1121,7 +1121,7 @@ static const struct file_operations ion_fops = {
};
static size_t ion_debug_heap_total(struct ion_client *client,
- enum ion_heap_type type)
+ int id)
{
size_t size = 0;
struct rb_node *n;
@@ -1131,7 +1131,7 @@ static size_t ion_debug_heap_total(struct
ion_client *client,
struct ion_handle *handle = rb_entry(n,
struct ion_handle,
node);
- if (handle->buffer->heap->type == type)
+ if (handle->buffer->heap->id == id)
size += handle->buffer->size;
}
mutex_unlock(&client->lock);
@@ -1149,7 +1149,7 @@ static int ion_debug_heap_show(struct seq_file
*s, void *unused)
for (n = rb_first(&dev->clients); n; n = rb_next(n)) {
struct ion_client *client = rb_entry(n, struct ion_client,
node);
- size_t size = ion_debug_heap_total(client, heap->type);
+ size_t size = ion_debug_heap_total(client, heap->id);
if (!size)
continue;
if (client->task) {
--
1.7.0.4
How do we share ion buffers from user-space with other processes if
they are exported/shared after fork?
The ION_IOC_SHARE ioctl creates an fd for process-1. In 3.0 kernel,
the ION_ION_IMPORT ioctl from process-2 calls ion_import_fd which
calls fget(fd) which fails to find the file for the fd shared by
process-1.
In 3.4 kernel, dma_buf_get does the fget(fd) to get struct file which
also fails for the same reason - fget searches in current->files.
- Nishanth Peethambaran
Hi all,
I think that we have a memory mapping issue on ION carveout heap for
v3.4+ kernel from android.
The scenario is User app + kernel driver (cpu) + kernel driver (dma) that
all these three clients will access memory. And the memory is cacheable.
The .map_kernel() of carveout heap remaps the allocated memory buffer
by ioremap().
In arm_ioremap(), we don't allow memory to be mapped. In order to make
.map_kernel() working, we need to use memblock_alloc() &
memblock_remove() to move the heap memory from system to reserved
area. So the linear address of the memory buffer is removed from page table.
And the new virtual address comes from .map_kernel() while kernel driver
wants to access the buffer.
But ION use dma_sync_sg_for_devices() to flush cache that means they're
using linear address from page. So they're using the NOT-EXISTED virtual
address that is removed by memblock_remove().
Solution #1.
.map_kernel() only returns the linear address. And there's a limitation of this
solution, the heap should be always lying in low memory. So we needn't use
any ioremap() and memblock_remove() any more.
Solution #2.
Use vmap() in .map_kernel().
How do you think about these two solutions?
Regards
Haojian