Hi folks,
I just ran some regression test on stable 4.9.122 with LTP. madvise05 triggers the below kernel panic:
[ 6785.089994] BUG: unable to handle kernel paging request at ffffeaff44488020 [ 6785.097952] IP: [] page_remove_rmap+0x27/0x580 [ 6785.104810] PGD 0 [ 6785.106859] [ 6785.108526] Oops: 0000 [#1] SMP [ 6785.112029] Modules linked in: mptctl(E) mptbase(E) tun(E) fuse(E) vfat(E) fat(E) btrfs(E) xor(E) raid6_pq(E) xfs(E) [ 6785.123905] CPU: 14 PID: 77983 Comm: madvise05 Tainted: G E 4.9.122-001.ali3000_nightly_cov_20180820_193.test.alios7.x86_64 #1 [ 6785.137880] Hardware name: Dell Inc. PowerEdge R720xd/0X6FFV, BIOS 1.3.6 09/11/2012 [ 6785.146425] task: ffff882daeb78000 task.stack: ffffc9001b438000 [ 6785.153031] RIP: 0010:[] [] page_remove_rmap+0x27/0x580 [ 6785.162461] RSP: 0018:ffffc9001b43bc50 EFLAGS: 00010246 [ 6785.168388] RAX: 0000000000000000 RBX: ffffeaff44488000 RCX: 0000000000000080 [ 6785.176351] RDX: ffffeaff44488000 RSI: 0000000000000001 RDI: ffffeaff44488000 [ 6785.184315] RBP: ffffc9001b43bc80 R08: ffff882f9d4a6540 R09: ffffffff84bcdf60 [ 6785.192277] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 6785.200241] R13: ffff882db6996910 R14: ffffea00b6da65b0 R15: 00003fffffe00000 [ 6785.208205] FS: 00002ae5a3dfbb80(0000) GS:ffff882fbf180000(0000) knlGS:0000000000000000 [ 6785.217234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6785.223646] CR2: ffffeaff44488020 CR3: 0000002f9d51c000 CR4: 00000000000606f0 [ 6785.231610] Stack: [ 6785.233851] 0000000000000000 00003ffffd6b3320 0434f9415ac47f2c ffffeaff44488000 [ 6785.242143] ffffc9001b43bdd8 ffff882db6996910 ffffc9001b43bcc0 ffffffff81355aaa [ 6785.250438] ffff882f9d4a6540 00002ae5a4400000 00002ae5a4600000 ffffffff84bcd7a0 [ 6785.258728] Call Trace: [ 6785.261460] [] zap_huge_pmd+0x28a/0x640 [ 6785.267585] [] unmap_page_range+0x532/0x630 [ 6785.274087] [] unmap_single_vma+0xa9/0x160 [ 6785.280501] [] unmap_vmas+0x5f/0xe0 [ 6785.286236] [] unmap_region+0xe4/0x1e0 [ 6785.292263] [] ? blk_finish_plug+0x3c/0x60 [ 6785.298669] [] ? SYSC_madvise+0x69b/0xed0 [ 6785.304985] [] do_munmap+0x39b/0x5b0 [ 6785.310818] [] SyS_munmap+0x78/0xb0 [ 6785.316552] [] do_syscall_64+0xf4/0x350 [ 6785.322676] [] entry_SYSCALL_64_after_swapgs+0x58/0xca [ 6785.330244] Code: 00 00 00 00 66 66 66 66 90 55 <48> 8b 57 20 48 89 f8 f6 c2 01 0f [ 6785.339011] RIP [] page_remove_rmap+0x27/0x580 [ 6785.345825] RSP [ 6785.349715] CR2: ffffeaff44488020
The same test case works well on both 4.9.119 and the latest Linus's tree. So, it looks it is caused by the L1TF patches on the stable tree.
And, the madvise05 test case can be simplified to the below test program:
#include <sys/mman.h> #include <stdio.h>
void main() { void *addr; int err;
addr = mmap(NULL, 32 * 1024 * 1024, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
if (addr == MAP_FAILED) { printf("mmap failed\n"); return; }
err = mprotect(addr, 32 * 1024 * 1024, PROT_NONE); if (err < 0) { printf("mprotect failed\n"); return; }
munmap(addr, 32 * 1024 * 1024); }
You may be already aware of this problem or any hint is appreciated.
Thanks,
Yang
On Tue, 2018-08-21 at 11:37 -0700, Yang Shi wrote:
I just ran some regression test on stable 4.9.122 with LTP. madvise05 triggers the below kernel panic:
Please could you try 4.9.123-rc1, specifically this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/c...
On 8/21/18 11:43 AM, David Woodhouse wrote:
On Tue, 2018-08-21 at 11:37 -0700, Yang Shi wrote:
I just ran some regression test on stable 4.9.122 with LTP. madvise05 triggers the below kernel panic:
Thanks, David. It works. A silly question, I don't get why this commit could solve this issue, it looks just like a code refactor. Just because it changed how to get pfn from page table entries? And, this may cause some mismatch on 4.9 stable without it?
Please could you try 4.9.123-rc1, specifically this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/c...
On Tue, Aug 21, 2018 at 01:30:20PM -0700, yang.shi@linux.alibaba.com wrote:
On 8/21/18 11:43 AM, David Woodhouse wrote:
On Tue, 2018-08-21 at 11:37 -0700, Yang Shi wrote:
I just ran some regression test on stable 4.9.122 with LTP. madvise05 triggers the below kernel panic:
Thanks, David. It works. A silly question, I don't get why this commit could solve this issue, it looks just like a code refactor. Just because it changed how to get pfn from page table entries? And, this may cause some mismatch on 4.9 stable without it?
With the L1TF patches open coded pte_val() to get the PFN can cause problems because it doesn't do the invert for PROT_NONE mappings
The cleanup changes the open coded versions to use p*_pfn(), which always works correctly.
-Andi
On 8/21/18 1:36 PM, Andi Kleen wrote:
On Tue, Aug 21, 2018 at 01:30:20PM -0700, yang.shi@linux.alibaba.com wrote:
On 8/21/18 11:43 AM, David Woodhouse wrote:
On Tue, 2018-08-21 at 11:37 -0700, Yang Shi wrote:
I just ran some regression test on stable 4.9.122 with LTP. madvise05 triggers the below kernel panic:
Thanks, David. It works. A silly question, I don't get why this commit could solve this issue, it looks just like a code refactor. Just because it changed how to get pfn from page table entries? And, this may cause some mismatch on 4.9 stable without it?
With the L1TF patches open coded pte_val() to get the PFN can cause problems because it doesn't do the invert for PROT_NONE mappings
The cleanup changes the open coded versions to use p*_pfn(), which always works correctly.
Thanks. Got it.
-Andi
linux-stable-mirror@lists.linaro.org