[PATCH 1/2] drm/i915/gem: Replace reloc chain with terminator on error unwind

List overview All Threads
Download

newer

older

✅ PASS: Test report for kernel...

Chris Wilson

19 Aug 2020 19 Aug '20

10:39 a.m.

If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Reported-by: Pavel Machek pavel@ucw.cz Fixes: 964a9b0f611e ("drm/i915/gem: Use chained reloc batches") Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Joonas Lahtinen joonas.lahtinen@linux.intel.com Cc: Pavel Machek pavel@ucw.cz Cc: stable@vger.kernel.org # v5.8+ --- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 31 ++++++++++--------- 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 24a1486d2dc5..a09f04eee417 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -972,21 +972,6 @@ static int reloc_gpu_chain(struct reloc_cache *cache) if (err) goto out_pool;

- GEM_BUG_ON(cache->rq_size + RELOC_TAIL > PAGE_SIZE / sizeof(u32)); - cmd = cache->rq_cmd + cache->rq_size; - *cmd++ = MI_ARB_CHECK; - if (cache->gen >= 8) - *cmd++ = MI_BATCH_BUFFER_START_GEN8; - else if (cache->gen >= 6) - *cmd++ = MI_BATCH_BUFFER_START; - else - *cmd++ = MI_BATCH_BUFFER_START | MI_BATCH_GTT; - *cmd++ = lower_32_bits(batch->node.start); - *cmd++ = upper_32_bits(batch->node.start); /* Always 0 for gen<8 */ - i915_gem_object_flush_map(cache->rq_vma->obj); - i915_gem_object_unpin_map(cache->rq_vma->obj); - cache->rq_vma = NULL; - err = intel_gt_buffer_pool_mark_active(pool, rq); if (err == 0) { i915_vma_lock(batch); @@ -999,15 +984,31 @@ static int reloc_gpu_chain(struct reloc_cache *cache) if (err) goto out_pool;

+ GEM_BUG_ON(cache->rq_size + RELOC_TAIL > PAGE_SIZE / sizeof(u32)); + cmd = cache->rq_cmd + cache->rq_size; + *cmd++ = MI_ARB_CHECK; + if (cache->gen >= 8) + *cmd++ = MI_BATCH_BUFFER_START_GEN8; + else if (cache->gen >= 6) + *cmd++ = MI_BATCH_BUFFER_START; + else + *cmd++ = MI_BATCH_BUFFER_START | MI_BATCH_GTT; + *cmd++ = lower_32_bits(batch->node.start); + *cmd++ = upper_32_bits(batch->node.start); /* Always 0 for gen<8 */ + cmd = i915_gem_object_pin_map(batch->obj, cache->has_llc ? I915_MAP_FORCE_WB : I915_MAP_FORCE_WC); if (IS_ERR(cmd)) { + /* We will replace the BBS with BBE upon flushing the rq */ err = PTR_ERR(cmd); goto out_pool; }

+ i915_gem_object_flush_map(cache->rq_vma->obj); + i915_gem_object_unpin_map(cache->rq_vma->obj); + /* Return with batch mapping (cmd) still pinned */ cache->rq_cmd = cmd; cache->rq_size = 0;

-- 2.20.1

Show replies by date

Chris Wilson

19 Aug 19 Aug

10:39 a.m.

New subject: [PATCH 2/2] drm/i915/gem: Fallback to using a plain kmap if reloc address space is limited

Since the processor may not support vmap with WC, or the system may be limited in virtual address space and so may fail to create such a vmap, fallback to using a plain kmap of the system pages and flush the buffer on completion.

Reported-by: Pavel Machek pavel@ucw.cz Fixes: 964a9b0f611e ("drm/i915/gem: Use chained reloc batches") Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Joonas Lahtinen joonas.lahtinen@linux.intel.com Cc: Pavel Machek pavel@ucw.cz Cc: stable@vger.kernel.org # v5.8+ --- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 25 +++++++++++++------ 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index a09f04eee417..44df98d85b38 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -950,6 +950,21 @@ static void reloc_cache_init(struct reloc_cache *cache,

#define RELOC_TAIL 4

+static u32 *__reloc_gpu_map(struct reloc_cache *cache, + struct intel_gt_buffer_pool_node *pool) +{ + u32 *map; + + map = i915_gem_object_pin_map(pool->obj, + cache->has_llc ? + I915_MAP_FORCE_WB : + I915_MAP_FORCE_WC); + if (IS_ERR(map)) /* try a plain kmap (and flush) if no WC maps */ + map = i915_gem_object_pin_map(pool->obj, I915_MAP_FORCE_WB); + + return map; +} + static int reloc_gpu_chain(struct reloc_cache *cache) { struct intel_gt_buffer_pool_node *pool; @@ -996,10 +1011,7 @@ static int reloc_gpu_chain(struct reloc_cache *cache) *cmd++ = lower_32_bits(batch->node.start); *cmd++ = upper_32_bits(batch->node.start); /* Always 0 for gen<8 */

- cmd = i915_gem_object_pin_map(batch->obj, - cache->has_llc ? - I915_MAP_FORCE_WB : - I915_MAP_FORCE_WC); + cmd = __reloc_gpu_map(cache, pool); if (IS_ERR(cmd)) { /* We will replace the BBS with BBE upon flushing the rq */ err = PTR_ERR(cmd); @@ -1096,10 +1108,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb, if (IS_ERR(pool)) return PTR_ERR(pool);

- cmd = i915_gem_object_pin_map(pool->obj, - cache->has_llc ? - I915_MAP_FORCE_WB : - I915_MAP_FORCE_WC); + cmd = __reloc_gpu_map(cache, pool); if (IS_ERR(cmd)) { err = PTR_ERR(cmd); goto out_pool;

-- 2.20.1

Pavel Machek

5:23 p.m.

Hi!

...

If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Best regards, Pavel

-- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Chris Wilson

5:36 p.m.

Quoting Pavel Machek (2020-08-19 18:23:31)

...

Hi!

...
If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Aye, best wait until we have to start competing with Chromium for memory... The suspicion is that it was the resource allocation failure path. -Chris

Pavel Machek

7:33 p.m.

Hi!

...

...
...
If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Aye, best wait until we have to start competing with Chromium for memory... The suspicion is that it was the resource allocation failure path.

Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

Best regards, Pavel

[ 5604.909393] ACPI: EC: event unblocked [ 5604.913590] usb usb2: root hub lost power or was reset [ 5604.913812] usb usb3: root hub lost power or was reset [ 5604.914046] usb usb4: root hub lost power or was reset [ 5604.918812] ata6: port disabled--ignoring [ 5604.925353] sd 0:0:0:0: [sda] Starting disk [ 5605.150042] thinkpad_acpi: ACPI backlight control delay disabled [ 5605.204955] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 5605.205931] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded [ 5605.205941] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out [ 5605.205949] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out [ 5605.207748] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded [ 5605.207757] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out [ 5605.207765] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out [ 5605.208227] ata1.00: configured for UDMA/133 [ 5605.281913] usb 5-2: reset full-speed USB device number 3 using uhci_hcd [ 5605.569752] usb 5-1: reset full-speed USB device number 2 using uhci_hcd [ 5609.082771] PM: resume devices took 4.192 seconds [ 5609.083380] OOM killer enabled. [ 5609.083387] Restarting tasks ... done. [ 5609.103164] video LNXVIDEO:00: Restoring backlight state [ 5609.150144] PM: suspend exit [ 5609.190535] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it. [ 5609.239495] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it. [ 5609.287144] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it. [ 5609.344497] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it. [ 5611.426855] wlan0: authenticate with 5c:f4:ab:10:d2:bb [ 5611.430609] wlan0: send auth to 5c:f4:ab:10:d2:bb (try 1/3) [ 5611.432552] wlan0: authenticated [ 5611.433705] wlan0: associate with 5c:f4:ab:10:d2:bb (try 1/3) [ 5611.436440] wlan0: RX AssocResp from 5c:f4:ab:10:d2:bb (capab=0x411 status=0 aid=1) [ 5611.439083] wlan0: associated [ 7744.718473] BUG: unable to handle page fault for address: f8c00000 [ 7744.718484] #PF: supervisor write access in kernel mode [ 7744.718487] #PF: error_code(0x0002) - not-present page [ 7744.718491] *pdpt = 0000000031b0b001 *pde = 0000000000000000 [ 7744.718500] Oops: 0002 [#1] PREEMPT SMP PTI [ 7744.718506] CPU: 0 PID: 3004 Comm: Xorg Not tainted 5.9.0-rc1-next-20200819+ #134 [ 7744.718509] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011 [ 7744.718518] EIP: eb_relocate_vma+0xdbf/0xf20 [ 7744.718523] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff [ 7744.718527] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000 [ 7744.718531] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04 [ 7744.718535] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246 [ 7744.718539] CR0: 80050033 CR2: f8c00000 CR3: 31ac2000 CR4: 000006b0 [ 7744.718543] Call Trace: [ 7744.718553] ? shmem_read_mapping_page_gfp+0x32/0x70 [ 7744.718560] ? eb_lookup_vmas+0x272/0x9f0 [ 7744.718565] i915_gem_do_execbuffer+0xa7b/0x2730 [ 7744.718573] ? intel_runtime_pm_put_unchecked+0xd/0x10 [ 7744.718578] ? i915_gem_gtt_pwrite_fast+0xf6/0x520 [ 7744.718586] ? __lock_acquire.isra.0+0x223/0x500 [ 7744.718592] ? cache_alloc_debugcheck_after+0x151/0x180 [ 7744.718596] ? kvmalloc_node+0x69/0x80 [ 7744.718600] ? __kmalloc+0x92/0x120 [ 7744.718604] ? kvmalloc_node+0x69/0x80 [ 7744.718608] i915_gem_execbuffer2_ioctl+0xdd/0x350 [ 7744.718613] ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0 [ 7744.718619] drm_ioctl_kernel+0x91/0xe0 [ 7744.718623] ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0 [ 7744.718627] drm_ioctl+0x1fd/0x371 [ 7744.718631] ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0 [ 7744.718639] ? posix_get_monotonic_timespec+0x1d/0x80 [ 7744.718645] ? __sys_recvmsg+0x37/0x80 [ 7744.718649] ? drm_ioctl_kernel+0xe0/0xe0 [ 7744.718654] __ia32_sys_ioctl+0x14b/0x7c6 [ 7744.718661] ? exit_to_user_mode_prepare+0x53/0x100 [ 7744.718667] do_int80_syscall_32+0x2c/0x40 [ 7744.718674] entry_INT80_32+0x111/0x111 [ 7744.718678] EIP: 0xb7fd3092 [ 7744.718683] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00 [ 7744.718687] EAX: ffffffda EBX: 0000000a ECX: c0406469 EDX: bfe67abc [ 7744.718691] ESI: b73c1000 EDI: c0406469 EBP: 0000000a ESP: bfe67a34 [ 7744.718695] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292 [ 7744.718700] ? asm_exc_nmi+0xcc/0x2bc [ 7744.718703] Modules linked in: [ 7744.718709] CR2: 00000000f8c00000 [ 7744.718714] ---[ end trace 121f748dd4d0d6ec ]--- [ 7744.718719] EIP: eb_relocate_vma+0xdbf/0xf20 [ 7744.718723] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff [ 7744.718727] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000 [ 7744.718731] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04 [ 7744.718735] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246 [ 7744.718739] CR0: 80050033 CR2: f8c00000 CR3: 31ac2000 CR4: 000006b0 [ 7744.723687] BUG: unable to handle page fault for address: f8c02038 [ 7744.723695] #PF: supervisor write access in kernel mode [ 7744.723699] #PF: error_code(0x0002) - not-present page [ 7744.723702] *pdpt = 0000000031866001 *pde = 0000000000000000 [ 7744.723711] Oops: 0002 [#2] PREEMPT SMP PTI [ 7744.723717] CPU: 1 PID: 3004 Comm: Xorg Tainted: G D 5.9.0-rc1-next-20200819+ #134 [ 7744.723720] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011 [ 7744.723728] EIP: n_tty_open+0x26/0x80 [ 7744.723733] Code: 00 00 00 90 55 89 e5 56 53 89 c3 b8 f0 22 00 00 e8 4f 39 cb ff 85 c0 74 62 89 c6 a1 00 2d 27 c5 b9 e8 2a 77 c5 ba 85 83 12 c5 <89> 46 38 8d 86 58 22 00 00 e8 8c 12 c0 ff 8d 86 a4 22 00 00 b9 e0 [ 7744.723738] EAX: 001c65c0 EBX: f2339000 ECX: c5772ae8 EDX: c5128385 [ 7744.723741] ESI: f8c02000 EDI: 00000000 EBP: f1ac1ee4 ESP: f1ac1edc [ 7744.723745] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210286 [ 7744.723751] CR0: 80050033 CR2: f8c02038 CR3: 31864000 CR4: 000006b0 [ 7744.723755] Call Trace: [ 7744.723763] tty_ldisc_open.isra.0+0x23/0x40 [ 7744.723768] tty_ldisc_reinit+0x99/0xe0 [ 7744.723772] tty_ldisc_hangup+0xc4/0x1e0 [ 7744.723776] __tty_hangup.part.0+0x13f/0x250 [ 7744.723781] tty_vhangup_session+0x11/0x20 [ 7744.723786] disassociate_ctty.part.0+0x34/0x230 [ 7744.723790] disassociate_ctty+0x28/0x30 [ 7744.723797] do_exit+0x456/0x960 [ 7744.723803] ? exit_to_user_mode_prepare+0x53/0x100 [ 7744.723808] rewind_stack_do_exit+0x11/0x13 [ 7744.723812] EIP: 0xb7fd3092 [ 7744.723815] Code: Bad RIP value. [ 7744.723819] EAX: ffffffda EBX: 0000000a ECX: c0406469 EDX: bfe67abc [ 7744.723823] ESI: b73c1000 EDI: c0406469 EBP: 0000000a ESP: bfe67a34 [ 7744.723827] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292 [ 7744.723837] ? asm_exc_nmi+0xcc/0x2bc [ 7744.723839] Modules linked in: [ 7744.723845] CR2: 00000000f8c02038 [ 7744.723851] ---[ end trace 121f748dd4d0d6ed ]--- [ 7744.723857] EIP: eb_relocate_vma+0xdbf/0xf20 [ 7744.723861] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff [ 7744.723865] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000 [ 7744.723869] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04 [ 7744.723873] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246 [ 7744.723877] CR0: 80050033 CR2: f8c02038 CR3: 31864000 CR4: 000006b0 [ 7744.723880] Fixing recursive fault but reboot is needed! [ 7749.589011] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000 [ 7749.589024] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 7749.589030] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new. [ 7749.589036] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details. [ 7749.589041] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 7749.589047] The GPU crash dump is required to analyze GPU hangs, so please always attach it. [ 7749.589053] GPU crash dump saved to /sys/class/drm/card0/error [ 7749.909841] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0 [ 7756.504232] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000 [ 7756.817879] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0 [ 7763.672921] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000 [ 7763.985882] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0 [ 7770.580999] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000 [ 7770.897884] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0 [ 7777.497036] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000 [ 7777.825882] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0 [ 7784.664999] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000

-- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Chris Wilson

7:40 p.m.

Quoting Pavel Machek (2020-08-19 20:33:26)

...

Hi!

...
...
...
If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Aye, best wait until we have to start competing with Chromium for memory... The suspicion is that it was the resource allocation failure path.

Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

...

[ 7744.718473] BUG: unable to handle page fault for address: f8c00000 [ 7744.718484] #PF: supervisor write access in kernel mode [ 7744.718487] #PF: error_code(0x0002) - not-present page [ 7744.718491] *pdpt = 0000000031b0b001 *pde = 0000000000000000 [ 7744.718500] Oops: 0002 [#1] PREEMPT SMP PTI [ 7744.718506] CPU: 0 PID: 3004 Comm: Xorg Not tainted 5.9.0-rc1-next-20200819+ #134 [ 7744.718509] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011 [ 7744.718518] EIP: eb_relocate_vma+0xdbf/0xf20

To save me guessing, paste the above location into ./scripts/decode_stacktrace.sh ./vmlinux . ./drivers/gpu/drm/i915

The f8c00000 is something running off the end of a kmap, but I didn't spot a path were we would ignore an error and keep on writing. Nevertheless it must exist. -Chris

Pavel Machek

7:47 p.m.

Hi!

...

...
Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

...
[ 7744.718473] BUG: unable to handle page fault for address: f8c00000 [ 7744.718484] #PF: supervisor write access in kernel mode [ 7744.718487] #PF: error_code(0x0002) - not-present page [ 7744.718491] *pdpt = 0000000031b0b001 *pde = 0000000000000000 [ 7744.718500] Oops: 0002 [#1] PREEMPT SMP PTI [ 7744.718506] CPU: 0 PID: 3004 Comm: Xorg Not tainted 5.9.0-rc1-next-20200819+ #134 [ 7744.718509] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011 [ 7744.718518] EIP: eb_relocate_vma+0xdbf/0xf20

To save me guessing, paste the above location into ./scripts/decode_stacktrace.sh ./vmlinux . ./drivers/gpu/drm/i915

The f8c00000 is something running off the end of a kmap, but I didn't spot a path were we would ignore an error and keep on writing. Nevertheless it must exist.

Like this?

$ ./scripts/decode_stacktrace.sh ./vmlinux . ./drivers/gpu/drm/i915 f8c00000 f8c00000 eb_relocate_vma+0xdbf/0xf20 eb_relocate_vma (i915_gem_execbuffer.c:?)

Pavel

-- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Chris Wilson

7:52 p.m.

New subject: [Intel-gfx] [PATCH 1/2] drm/i915/gem: Replace reloc chain with terminator on error unwind

Quoting Pavel Machek (2020-08-19 20:47:23)

...

Hi!

...
...
Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

...
[ 7744.718473] BUG: unable to handle page fault for address: f8c00000 [ 7744.718484] #PF: supervisor write access in kernel mode [ 7744.718487] #PF: error_code(0x0002) - not-present page [ 7744.718491] *pdpt = 0000000031b0b001 *pde = 0000000000000000 [ 7744.718500] Oops: 0002 [#1] PREEMPT SMP PTI [ 7744.718506] CPU: 0 PID: 3004 Comm: Xorg Not tainted 5.9.0-rc1-next-20200819+ #134 [ 7744.718509] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011 [ 7744.718518] EIP: eb_relocate_vma+0xdbf/0xf20

To save me guessing, paste the above location into ./scripts/decode_stacktrace.sh ./vmlinux . ./drivers/gpu/drm/i915

The f8c00000 is something running off the end of a kmap, but I didn't spot a path were we would ignore an error and keep on writing. Nevertheless it must exist.

Like this?

$ ./scripts/decode_stacktrace.sh ./vmlinux . ./drivers/gpu/drm/i915 f8c00000 f8c00000 eb_relocate_vma+0xdbf/0xf20 eb_relocate_vma (i915_gem_execbuffer.c:?)

Ok, that didn't work as well as I'm used to. Thanks, -Chris

Chris Wilson

20 Aug 20 Aug

7:36 a.m.

Quoting Pavel Machek (2020-08-19 20:33:26)

...

Hi!

...
...
...
If we hit an error during construction of the reloc chain, we need to replace the chain into the next batch with the terminator so that upon flushing the relocations so far, we do not execute a hanging batch.

Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Aye, best wait until we have to start competing with Chromium for memory... The suspicion is that it was the resource allocation failure path.

Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

Well my last remaining 32bit gen3 device is currently pushing up the daises, so could you try removing the attempt to use WC? Something like

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 44df98d85b38..b26f7de913c3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -955,10 +955,7 @@ static u32 *__reloc_gpu_map(struct reloc_cache *cache, { u32 *map;

- map = i915_gem_object_pin_map(pool->obj, - cache->has_llc ? - I915_MAP_FORCE_WB : - I915_MAP_FORCE_WC); + map = i915_gem_object_pin_map(pool->obj, I915_MAP_FORCE_WB);

on top of the previous patch. Faultinjection didn't turn up anything in eb_relocate_vma, so we need to dig deeper. -Chris

Pavel Machek

8 Sep 8 Sep

10:23 p.m.

Hi!

...

...
...
...
Thanks for the patches. I assume this should fix problem from "5.9-rc1: graphics regression moved from -next to mainline" thread.

I have applied them over current -next, and my machine seems to be working so far (but uptime is less than 30 minutes).

If the machine still works tommorow, I'll assume problem is solved.

Aye, best wait until we have to start competing with Chromium for memory... The suspicion is that it was the resource allocation failure path.

Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking screen. Machine still works reasonably well over ssh, so I guess that's an improvement.

Well my last remaining 32bit gen3 device is currently pushing up the daises, so could you try removing the attempt to use WC? Something like

+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -955,10 +955,7 @@ static u32 *__reloc_gpu_map(struct reloc_cache *cache, { u32 *map;
  map = i915_gem_object_pin_map(pool->obj,
                                cache->has_llc ?
                                I915_MAP_FORCE_WB :
                                I915_MAP_FORCE_WC);
  map = i915_gem_object_pin_map(pool->obj, I915_MAP_FORCE_WB);
on top of the previous patch. Faultinjection didn't turn up anything in eb_relocate_vma, so we need to dig deeper.

With this on top of other patches, it works.

Tested-by: Pavel Machek pavel@ucw.cz

Best regards, Pavel

-- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

1938

days inactive

1958

days old

linux-stable-mirror@lists.linaro.org

9 comments

participants

tags (0)

participants (2)

Chris Wilson
Pavel Machek