Hi Maarten!
Broadening the audience a bit..
On 9/14/12 9:12 AM, Maarten Lankhorst wrote:
> Op 13-09-12 23:00, Thomas Hellstrom schreef:
>> On 09/13/2012 07:13 PM, Maarten Lankhorst wrote:
>>> Hey
>>>
>>> Op 13-09-12 18:41, Thomas Hellstrom schreef:
>>>> On 09/13/2012 05:19 PM, Maarten Lankhorst wrote:
>>>>> Hey,
>>>>>
>>>>> Op 12-09-12 15:28, Thomas Hellstrom schreef:
>>>>>> On 09/12/2012 02:48 PM, Maarten Lankhorst wrote:
>>>>>>> Hey Thomas,
>>>>>>>
>>>>>>> I'm playing around with moving reservations from ttm to global, but how ttm
>>>>>>> ttm is handling reservations is getting in the way. The code wants to move
>>>>>>> the bo from the lru lock at the same time a reservation is made, but that
>>>>>>> seems to be slightly too strict. It would really help me if that guarantee
>>>>>>> is removed.
>>>>>> Hi, Maarten.
>>>>>>
>>>>>> Removing that restriction is not really possible at the moment.
>>>>>> Also the memory accounting code depends on this, and may cause reservations
>>>>>> in the most awkward places. Since these reservations don't have a ticket
>>>>>> they may and will cause deadlocks. So in short the restriction is there
>>>>>> to avoid deadlocks caused by ticketless reservations.
>>>>> I have finished the lockdep annotations now which seems to catch almost
>>>>> all abuse I threw at it, so I'm feeling slightly more confident about moving
>>>>> the locking order and reservations around.
>>>> Maarten, moving reservations in TTM out of the lru lock is incorrect as the code is
>>>> written now. If we want to move it out we need something for ticketless reservations
>>>>
>>>> I've been thinking of having a global hash table of tickets with the task struct pointer as the key,
>>>> but even then, we'd need to be able to handle EBUSY errors on every operation that might try to
>>>> reserve a buffer.
>>>>
>>>> The fact that lockdep doesn't complain isn't enough. There *will* be deadlock use-cases when TTM is handed
>>>> the right data-set.
>>>>
>>>> Isn't there a way that a subsystem can register a callback to be performed to remove stuff from LRU and
>>>> to take a pre-reservation lock?
>>> What if multiple subsystems need those? You will end up with a deadlock again.
>>>
>>> I think it would be easier to change the code in ttm_bo.c to not assume the first
>>> item on the lru list is really the least recently used, and assume the first item
>>> that can be reserved without blocking IS the least recently used instead.
>> So what would happen then is that we'd spin on the first item on the LRU list, since
>> when reserving we must release the LRU lock, and if reserving fails, we thus
>> need to restart LRU traversal. Typically after a schedule(). That's bad.
>>
>> So let's take a step back and analyze why the LRU lock has become a problem.
>> From what I can tell, it's because you want to use per-object lock when reserving instead of a
>> global reservation lock (that TTM could use as the LRU lock). Is that correct?
>> and in that case, in what situation do you envision such a global lock being contended
>> to the extent that it hurts performance?
>>
>>>>> Lockdep WILL complain about trying to use multiple tickets, doing ticketed
>>>>> and unticketed blocking reservations mixed, etc.
>>>>>
>>>>> I want to remove the global fence_lock and make it a per buffer lock, with some
>>>>> lockdep annotations it's perfectly legal to grab obj->fence_lock and obj2->fence_lock
>>>>> if you have a reservation, but it should complain loudly about trying to take 2 fence_locks
>>>>> at the same time without a reservation.
>>>> Yes, TTM was previously using per buffer fence locks, and that works fine from a deadlock perspective, but
>>>> it hurts performance. Fencing 200 buffers in a command submission (google-earth for example) will mean
>>>> 198 unnecessary locks, each discarding the processor pipelines. Locking is a *slow* operation, particularly
>>>> on systems with many processors, and I don't think it's a good idea to change that back, without analyzing
>>>> the performance impact. There are reasons people are writing stuff like RCU to avoid locking...
>>> So why don't we simply use RCU for fence pointers and get rid of the fence locking? :D
>>> danvet originally suggested it as a joke but if you think about it, it would make a lot of sense for this usecase.
>> I thought of that before, but the problem is you'd still need a spinlock to change the buffer's fence pointer,
>> even if reading it becomes quick.
> Actually, I changed lockdep annotations a bit to distinguish between the
> cases where ttm_bo_wait is called without reservation, and ttm_bo_wait
> is called with, as far as I can see there are only 2 places that do it without,
> at least if I converted my git tree properly..
>
> http://cgit.freedesktop.org/~mlankhorst/linux/log/?h=v10-wip
>
> First one is nouveau_bo_vma_del, this can be fixed easily.
> Second one is ttm_bo_cleanup_refs and ttm_bo_cleanup_refs_or_queue,
> if reservation is done first before ttm_bo_wait, the fence_lock could be
> dropped entirely by adding smb_mb() in reserve and unreserve, functionally
> there would be no difference. So if you can verify my lockdep annotations are
> correct in the most recent commit wrt what's using ttm_bo_wait without reservation
> we could remove the fence_lock entirely.
>
> ~Maarten
Being able to wait for buffer idle or get the fence pointer without
reserving is a fundamental property of TTM. Reservation is a long-term
lock. The fence lock is a very short term lock. If I were to choose, I'd
rather accept per-object fence locks than removing this property, but
see below.
Likewise, to be able to guarantee that a reserved object is not on any
LRU list is also an important property. Removing that property will, in
addition to the spin wait we've already discussed make understanding TTM
locking even more difficult, and I'd really like to avoid it.
If this were a real performance problem we were trying to solve it would
be easier to motivate changes in this area, but if it's just trying to
avoid a global reservation lock and a global fence lock that will rarely
if ever see any contention, I can't see the point. On the contrary,
having per-object locks will be very costly when reserving / fencing
many objects. As mentioned before, in the fence lock case it's been
tried and removed, so I'd like to know the reasoning behind introducing
it again, and in what situations you think the global locks will be
contended.
/Thomas
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Signed-off-by: Laura Abbott <lauraa(a)codeaurora.org>
---
fs/buffer.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index ad5938c..daa0c3d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1399,12 +1399,49 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return 0;
}
+static void __evict_bh_lru(void *arg)
+{
+ struct bh_lru *b = &get_cpu_var(bh_lrus);
+ struct buffer_head *bh = arg;
+ int i;
+
+ for (i = 0; i < BH_LRU_SIZE; i++) {
+ if (b->bhs[i] == bh) {
+ brelse(b->bhs[i]);
+ b->bhs[i] = NULL;
+ goto out;
+ }
+ }
+out:
+ put_cpu_var(bh_lrus);
+}
+
+static bool bh_exists_in_lru(int cpu, void *arg)
+{
+ struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
+ struct buffer_head *bh = arg;
+ int i;
+
+ for (i = 0; i < BH_LRU_SIZE; i++) {
+ if (b->bhs[i] == bh)
+ return 1;
+ }
+
+ return 0;
+
+}
void invalidate_bh_lrus(void)
{
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
}
EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
+void evict_bh_lrus(struct buffer_head *bh)
+{
+ on_each_cpu_cond(bh_exists_in_lru, __evict_bh_lru, bh, 1, GFP_ATOMIC);
+}
+EXPORT_SYMBOL_GPL(evict_bh_lrus);
+
void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
{
@@ -3052,6 +3089,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
bh = head;
do {
+ evict_bh_lrus(bh);
if (buffer_write_io_error(bh) && page->mapping)
set_bit(AS_EIO, &page->mapping->flags);
if (buffer_busy(bh))
--
1.7.11.3
Hi Linus,
I would like to ask for pulling one more patch for ARM dma-mapping
subsystem to Linux v3.6 kernel tree. This patch fixes very subtle bug
(typical off-by-one error) which might appear in very rare
circumstances.
The following changes since commit 55d512e245bc7699a8800e23df1a24195dd08217:
Linux 3.6-rc5 (2012-09-08 16:43:45 -0700)
are available in the git repository at:
git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git fixes-for-3.6
for you to fetch changes up to f3d87524975f01b885fc3d009c6ab6afd0d00746:
arm: mm: fix DMA pool affiliation check (2012-09-10 16:15:48 +0200)
Thanks!
Best regards
Marek Szyprowski
Samsung Poland R&D Center
Patch summary:
----------------------------------------------------------------
Thomas Petazzoni (1):
arm: mm: fix DMA pool affiliation check
arch/arm/mm/dma-mapping.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
On Wed, Sep 5, 2012 at 5:08 AM, Tomi Valkeinen <tomi.valkeinen(a)ti.com> wrote:
> Hi,
>
> OMAP has a custom video ram allocator, which I'd like to remove and use
> the standard dma allocation functions.
>
> There are two problems for which I'd like to hear suggestions or
> comments:
>
> First one is that the dma_alloc_* functions map the allocated memory for
> cpu use. In many cases with OMAP DSS (display subsystem) this is not
> needed: the memory may be written only by the SGX or the DSP, and it's
> only read by the DSS, so it's never touched by the CPU.
see dma_alloc_attrs() and DMA_ATTR_NO_KERNEL_MAPPING
> This is even more true when using VRFB on omap3 (and probably TILER on
> omap4) for rotation, as VRFB hides the actual memory and offers rotated
> views. In this case the backend memory is never accessed by anyone else
> than VRFB.
just fwiw, we don't actually need contiguous memory on o4/tiler :-)
(well, at least if you ignore things like secure playback)
> Is there a way to allocate the memory without creating a mapping? While
> it won't break anything as such, the allocated areas can be quite large
> thus causing large areas of the kernel's memory space to be needlessly
> reserved.
>
> The second case is passing a framebuffer address from the bootloader to
> the kernel. Often with mobile devices the bootloader will initialize the
> display hardware, showing a company logo or such. To keep the image on
> the screen when kernel starts we need to reserve the same physical
> memory area early at boot, and use that for the framebuffer.
with a bit of handwaving, this is possible. You can pass a base
address to dma_declare_contiguous() when you setup your device's CMA
pool. Although that doesn't really guarantee you're allocation from
that pool is at offset zero, I suppose.
> I'm not sure if there's any actual problem with this one, presuming
> there is a solution for the first case. Somehow the memory is reserved
> at early boot time, and this is passed to the fb driver. But can the
> memory be managed the same way as in normal case (for example freeing
> it), or does it need to be handled as a special case?
special-casing it might be better.. although possibly a dma attr could
be added for this to tell dma_alloc_from_contiguous() that we need a
particular address within the CMA pool. It seems a bit like a hack,
but OTOH I guess pretty much every consumer device would need a hack
like this.
BR,
-R
> Tomi
>
v2->v3
Split oom killer patch only.
Based on Nishanth's patch, which change ion_debug_heap_total with id.
1. add heap_found
2. Solve the issue about serveral id share one type.
Use ion_debug_heap_total(client, heap->id) instead of ion_debug_heap_total(client, heap->type)
since id is unique while type can be shared.
Fortunately Nishanth has update one patch, so rebase on the patch
v1->v2
Sync to Aug 30 common.git
v0->v1:
1. move ion_shrink out of mutex, suggested by Nishanth
2. check error flag of ERR_PTR(-ENOMEM)
3. add msleep to allow schedule out.
Base on common.git, android-3.4 branch
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Nishanth Peethambaran (1):
gpu: ion: Update debugfs to show for each id
Zhangfei Gao (1):
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 131 +++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 121 insertions(+), 10 deletions(-)
v1->v2
Sync to Aug 30 common.git
v0->v1:
1, Change gen_pool_create(12, -1) to gen_pool_create(PAGE_SHIFT, -1), suggested by Haojian
2. move ion_shrink out of mutex, suggested by Nishanth
3. check error flag of ERR_PTR(-ENOMEM)
4. add msleep to allow schedule out.
Base on common.git, android-3.4 branch
Patch 2:
Add support page wised cache flush for carveout_heap
There is only one nents for carveout heap, as well as dirty bit.
As a result, cache flush only takes effect for total carve heap.
Patch 3:
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Zhangfei Gao (3):
gpu: ion: update carveout_heap_ops
gpu: ion: carveout_heap page wised cache flush
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 118 +++++++++++++++++++++++++++++++++-
drivers/gpu/ion/ion_carveout_heap.c | 25 ++++++--
2 files changed, 133 insertions(+), 10 deletions(-)
So I've been experimenting with support for Dave Airlie's new RandR 1.4 provider
object interface, so that Optimus-based laptops can use our driver to drive the
discrete GPU and display on the integrated GPU. The good news is that I've got
a proof of concept working.
During a review of the current code, we came up with a few concerns:
1. The output source is responsible for allocating the shared memory
Right now, the X server calls CreatePixmap on the output source screen and then
expects the output sink screen to be able to display from whatever memory the
source allocates. Right now, the source has no mechanism for asking the sink
what its requirements are for the surface. I'm using our own internal pitch
alignment requirements and that seems to be good enough for the Intel device to
scan out, but that could be pure luck.
Does it make sense to add a mechanism for drivers to negotiate this with each
other, or is it sufficient to just define a lowest common denominator format and
if your hardware can't deal with that format, you just don't get to share
buffers?
One of my coworkers brought to my attention the fact that Tegra requires a
specific pitch alignment, and cannot accommodate larger pitches. If other SoC
designs have similar restrictions, we might need to add a handshake mechanism.
2. There's no fallback mechanism if sharing can't be negotiated
If RandR fails to share a pixmap with the output sink screen, the whole modeset
fails. This means you'll end up not seeing anything on the screen and you'll
probably think your computer locked up. Should there be some sort of software
copy fallback to ensure that something at least shows up on the display?
3. How should the memory be allocated?
In the prototype I threw together, I'm allocating the shared memory using
shm_open and then exporting that as a dma-buf file descriptor using an ioctl I
added to the kernel, and then importing that memory back into our driver through
dma_buf_attach & dma_buf_map_attachment. Does it make sense for user-space
programs to be able to export shmfs files like that? Should that interface go
in DRM / GEM / PRIME instead? Something else? I'm pretty unfamiliar with this
kernel code so any suggestions would be appreciated.
-- Aaron
P.S. for those unfamiliar with PRIME:
Dave Airlie added new support to the X Resize and Rotate extension version 1.4
to support offloading display and rendering to different drivers. PRIME is the
DRM implementation in the kernel, layered on top of DMA-BUF, that implements the
actual sharing of buffers between drivers.
http://cgit.freedesktop.org/xorg/proto/randrproto/tree/randrproto.txt?id=ra…http://airlied.livejournal.com/75555.html - update on hotplug server
http://airlied.livejournal.com/76078.html - randr 1.5 demo videos
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Base on common.git, android-3.4 branch
Patch 2:
Add support page wised cache flush for carveout_heap
There is only one nents for carveout heap, as well as dirty bit.
As a result, cache flush only takes effect for total carve heap.
Patch 3:
Add oom killer.
Once heap is used off,
SIGKILL is send to all tasks refered the buffer with descending oom_socre_adj
Zhangfei Gao (3):
gpu: ion: update carveout_heap_ops
gpu: ion: carveout_heap page wised cache flush
gpu: ion: oom killer
drivers/gpu/ion/ion.c | 112 ++++++++++++++++++++++++++++++++++-
drivers/gpu/ion/ion_carveout_heap.c | 23 ++++++--
2 files changed, 127 insertions(+), 8 deletions(-)