As of 4.13.11 (and also with 4.14-rc) we have an issue where when serving nfs4 sometimes we get the following BUG. When this bug happens, it usually also causes the motherboard to no longer POST until we externally re-flash the BIOS (using the BMC web interface). If a motherboard does not have an external way to flash the BIOS, this would brick the hardware.
The issue was introduced somewhere between 4.13.8 and 4.13.11 in the stable series 4.13 kernels. It seems to be much easier to trigger on 4.14 kernels than 4.13 kernels.
We are working on bisecting it, but it is slow going since it often takes several reboots to trigger the issue.
The taint is caused by the "gkuart" an out-of-kernel driver which is a fork of the cp210x driver with GPIO lines added to it, we can provide the source for this if needed.
When the BIOS is gets broke, we see these messages in the shutdown logs:
[ 2206.698884] kvm: exiting hardware virtualization [ 2206.700160] e1000e: EEE TX LPI TIMER: 00t [ 2206.743126] ACPI MEMORY or I/O RESET_REG.
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0 [ 58.964597] PGD 0 P4D 0 [ 58.965208] Oops: 0000 [#1] SMP [ 58.965847] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat gkuart(O) usbserial x86_pkg_temp_thermal ipmi_ssif tpm_tis tpm_tis_core ie31200_edac ext4 mbcache jbd2 e1000e crc32c_intel [ 58.969163] CPU: 0 PID: 3970 Comm: nfsd Tainted: G O 4.14.0-rc8-git-kratos-1-00012-gd6a2cf07f0c9 #1 [ 58.970693] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 58.971685] task: ffff88040b286200 task.stack: ffffc90002c94000 [ 58.972576] RIP: 0010:vfs_statfs+0x73/0xb0 [ 58.973329] RSP: 0018:ffffc90002c97b30 EFLAGS: 00010202 [ 58.974188] RAX: 0000000000000000 RBX: ffffc90002c97bf8 RCX: 0000000000001c00 [ 58.975253] RDX: 0000000000000c00 RSI: 0000000000000020 RDI: 0000000000000000 [ 58.976213] RBP: ffffc90002c97bc8 R08: 0000000000000000 R09: 00000000000000ff [ 58.977161] R10: 000000000038be3a R11: ffff88040ec440c8 R12: ffff88040c5ba000 [ 58.978107] R13: ffff88040a86e000 R14: ffff88040c5c1000 R15: ffffc90002c97bf8 [ 58.979051] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 58.980448] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 58.981419] CR2: 0000000000000230 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 58.982483] Call Trace: [ 58.983108] nfsd4_encode_fattr+0x1f3/0x2070 [ 58.983873] ? find_inode_fast+0x52/0x90 [ 58.984587] ? get_acl+0x17/0xf0 [ 58.985258] ? generic_permission+0x122/0x1a0 [ 58.986019] nfsd4_encode_getattr+0x25/0x30 [ 58.986746] nfsd4_encode_operation+0x98/0x1a0 [ 58.987485] nfsd4_proc_compound+0x3eb/0x5c0 [ 58.988206] nfsd_dispatch+0xa8/0x230 [ 58.988891] svc_process_common+0x347/0x640 [ 58.989619] svc_process+0x100/0x1b0 [ 58.990334] nfsd+0xe3/0x150 [ 58.990988] kthread+0xfc/0x130 [ 58.991651] ? nfsd_destroy+0x60/0x60 [ 58.992364] ? kthread_create_on_node+0x40/0x40 [ 58.993153] ret_from_fork+0x25/0x30 [ 58.993858] Code: d1 83 c9 08 40 f6 c6 04 0f 45 d1 89 d1 80 cd 04 40 f6 c6 08 0f 45 d1 89 d1 80 cd 08 40 f6 c6 10 0f 45 d1 89 d1 80 cd 10 83 e6 20 <48> 8b b7 30 02 00 00 0f 45 d1 83 ca 20 89 f1 83 e1 10 89 cf 83 [ 58.996592] RIP: vfs_statfs+0x73/0xb0 RSP: ffffc90002c97b30 [ 58.997474] CR2: 0000000000000230 [ 58.998147] ---[ end trace c3a6e976d53aaa00 ]--- [ 107.669217] random: crng init done [ 210.170059] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 210.176363] IP: vfs_statfs+0x73/0xb0 [ 210.177032] PGD 0 P4D 0 [ 210.177633] Oops: 0000 [#2] SMP [ 210.178286] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat gkuart(O) usbserial x86_pkg_temp_thermal ipmi_ssif tpm_tis tpm_tis_core ie31200_edac ext4 mbcache jbd2 e1000e crc32c_intel [ 210.192120] CPU: 0 PID: 3969 Comm: nfsd Tainted: G D O 4.14.0-rc8-git-kratos-1-00012-gd6a2cf07f0c9 #1 [ 210.203168] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 210.204140] task: ffff880409a7aa00 task.stack: ffffc90002c8c000 [ 210.205168] RIP: 0010:vfs_statfs+0x73/0xb0 [ 210.205893] RSP: 0018:ffffc90002c8fb30 EFLAGS: 00010202 [ 210.206708] RAX: 0000000000000000 RBX: ffffc90002c8fbf8 RCX: 0000000000001c00 [ 210.218314] RDX: 0000000000000c00 RSI: 0000000000000020 RDI: 0000000000000000 [ 210.219364] RBP: ffffc90002c8fbc8 R08: 0000000000000000 R09: 00000000000000ff [ 210.220426] R10: 000000000038be3a R11: ffff88040ec440c8 R12: ffff88040c5b8000 [ 210.221455] R13: ffff88040a86e000 R14: ffff88040c5c4000 R15: ffffc90002c8fbf8 [ 210.222484] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 210.223894] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 210.224938] CR2: 0000000000000230 CR3: 0000000001e0a003 CR4: 00000000001606f0 [ 210.226020] Call Trace: [ 210.226615] nfsd4_encode_fattr+0x1f3/0x2070 [ 210.227348] ? find_inode_fast+0x52/0x90 [ 210.238225] ? get_acl+0x17/0xf0 [ 210.238890] ? generic_permission+0x122/0x1a0 [ 210.239637] nfsd4_encode_getattr+0x25/0x30 [ 210.240365] nfsd4_encode_operation+0x98/0x1a0 [ 210.241127] nfsd4_proc_compound+0x3eb/0x5c0 [ 210.241868] nfsd_dispatch+0xa8/0x230 [ 210.242564] svc_process_common+0x347/0x640 [ 210.243294] svc_process+0x100/0x1b0 [ 210.243969] nfsd+0xe3/0x150 [ 210.244582] kthread+0xfc/0x130 [ 210.255467] ? nfsd_destroy+0x60/0x60 [ 210.256153] ? kthread_create_on_node+0x40/0x40 [ 210.256892] ret_from_fork+0x25/0x30 [ 210.257570] Code: d1 83 c9 08 40 f6 c6 04 0f 45 d1 89 d1 80 cd 04 40 f6 c6 08 0f 45 d1 89 d1 80 cd 08 40 f6 c6 10 0f 45 d1 89 d1 80 cd 10 83 e6 20 <48> 8b b7 30 02 00 00 0f 45 d1 83 ca 20 89 f1 83 e1 10 89 cf 83 [ 210.260340] RIP: vfs_statfs+0x73/0xb0 RSP: ffffc90002c8fb30 [ 210.261157] CR2: 0000000000000230 [ 210.261810] ---[ end trace c3a6e976d53aaa01 ]---
On Wed, Nov 8, 2017 at 4:43 PM, Patrick McLean chutzpah@gentoo.org wrote:
As of 4.13.11 (and also with 4.14-rc) we have an issue where when serving nfs4 sometimes we get the following BUG. When this bug happens, it usually also causes the motherboard to no longer POST until we externally re-flash the BIOS (using the BMC web interface). If a motherboard does not have an external way to flash the BIOS, this would brick the hardware.
That sounds like your BIOS is just broken.
The kernel oops is probably just a trigger for that - possibly because you reboot with a particular state that breaks the BIOS.
Also, are you sure you really need to reflash the BIOS? It's actually fairly hard to overwrite the BIOS itself, but crashing with bad hardware state (where "bad" can just mean "unexpected by the BIOS") can cause the BIOS to not properly re-initialize things, and hang at boot.
So not booting cleanly from a warm reset is a reasonably common BIOS failure.
And yes, reflashing tends to force a full initialization and thus "fixes" things, but it may be a big hammer when a cold boot or just a "reset BIOS to safe defaults" might be sufficient.
In pretty much all cases this is a sign of a nasty BIOS problem, though, and you may want to look into a firmware update from the vendor for that.
But on to the kernel side:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
0: 83 c9 08 or $0x8,%ecx 3: 40 f6 c6 04 test $0x4,%sil 7: 0f 45 d1 cmovne %ecx,%edx a: 89 d1 mov %edx,%ecx c: 80 cd 04 or $0x4,%ch f: 40 f6 c6 08 test $0x8,%sil 13: 0f 45 d1 cmovne %ecx,%edx 16: 89 d1 mov %edx,%ecx 18: 80 cd 08 or $0x8,%ch 1b: 40 f6 c6 10 test $0x10,%sil 1f: 0f 45 d1 cmovne %ecx,%edx 22: 89 d1 mov %edx,%ecx 24: 80 cd 10 or $0x10,%ch 27: 83 e6 20 and $0x20,%esi 2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction 31: 0f 45 d1 cmovne %ecx,%edx 34: 83 ca 20 or $0x20,%edx 37: 89 f1 mov %esi,%ecx 39: 83 e1 10 and $0x10,%ecx 3c: 89 cf mov %ecx,%edi
and all those odd cmovne and bit-ops are just the bit selection code in flags_by_mnt(), which is inlined through calculate_f_flags (which is _also_ inlined) into vfs_statfs().
Sadly, gcc makes a mess of it and actually generates code that looks like the original C. I would have hoped that gcc could have turned
if (x & BIT) y |= OTHER_BIT;
into
y |= (x & BIT) shifted-by-the-bit-difference-between BIT/OTHER_BIT;
but that doesn't happen. We actually do it by hand in some other more critical places, but it's painful to do by hand (because the shift direction/amount is not trivial to do in C).
Anyway, that cmovne noise makes it a bit hard to see the actual part that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
Now, afaik, mnt->mnt_sb should never be NULL in the first place for a proper path. And the vfs_statfs() code itself hasn't changed in a while.
Which does seem to implicate nfsd as having passed in a bad path to vfs_statfs(). But I'm not seeing any changes in nfsd either.
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
But bringing in Al Viro and Bruce Fields explicitly in case they see something. And Darrick, just in case it might be xfs.
Linus
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
0: 83 c9 08 or $0x8,%ecx 3: 40 f6 c6 04 test $0x4,%sil 7: 0f 45 d1 cmovne %ecx,%edx a: 89 d1 mov %edx,%ecx c: 80 cd 04 or $0x4,%ch f: 40 f6 c6 08 test $0x8,%sil 13: 0f 45 d1 cmovne %ecx,%edx 16: 89 d1 mov %edx,%ecx 18: 80 cd 08 or $0x8,%ch 1b: 40 f6 c6 10 test $0x10,%sil 1f: 0f 45 d1 cmovne %ecx,%edx 22: 89 d1 mov %edx,%ecx 24: 80 cd 10 or $0x10,%ch 27: 83 e6 20 and $0x20,%esi 2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction 31: 0f 45 d1 cmovne %ecx,%edx 34: 83 ca 20 or $0x20,%edx 37: 89 f1 mov %esi,%ecx 39: 83 e1 10 and $0x10,%ecx 3c: 89 cf mov %ecx,%edi
and all those odd cmovne and bit-ops are just the bit selection code in flags_by_mnt(), which is inlined through calculate_f_flags (which is _also_ inlined) into vfs_statfs().
Sadly, gcc makes a mess of it and actually generates code that looks like the original C. I would have hoped that gcc could have turned
if (x & BIT) y |= OTHER_BIT;
into
y |= (x & BIT) shifted-by-the-bit-difference-between BIT/OTHER_BIT;
but that doesn't happen. We actually do it by hand in some other more critical places, but it's painful to do by hand (because the shift direction/amount is not trivial to do in C).
Anyway, that cmovne noise makes it a bit hard to see the actual part that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is
Interesting...
struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ unsigned char s_blocksize_bits; unsigned long s_blocksize; loff_t s_maxbytes; /* Max file size */ struct file_system_type *s_type; const struct super_operations *s_op; const struct dquot_operations *dq_op; const struct quotactl_ops *s_qcop; const struct export_operations *s_export_op; unsigned long s_flags; ...
s_flags is preceded list_head, u32, unsigned char, 2 u64 and 5 pointers. IOW, 10 64bit words. And sure enough, amd64 builds here have mov 0x50(%rdi),%rsi in the corresponding place. What config and toolchain had produced that?
I would definitely start with turning the randomize crap off, just to exclude the compiler weirdness. Incidentally, randomizing anything that contains a hash chain and key... super_block is not the worst here - struct dentry is clear "winner". Anything in struct dentry { /* RCU lookup touched fields */ unsigned int d_flags; /* protected by d_lock */ seqcount_t d_seq; /* per dentry seqlock */ struct hlist_bl_node d_hash; /* lookup hash list */ struct dentry *d_parent; /* parent directory */ struct qstr d_name; struct inode *d_inode; /* Where the name belongs to - NULL is * negative */ moving into a separate cache line and we've just doubled cache footprint of hash chain traversal.
How much reordering does that gcc misfeature do and why do we enable that in the first place?
On 2017-11-08 06:40 PM, Linus Torvalds wrote:
On Wed, Nov 8, 2017 at 4:43 PM, Patrick McLean chutzpah@gentoo.org wrote:
As of 4.13.11 (and also with 4.14-rc) we have an issue where when serving nfs4 sometimes we get the following BUG. When this bug happens, it usually also causes the motherboard to no longer POST until we externally re-flash the BIOS (using the BMC web interface). If a motherboard does not have an external way to flash the BIOS, this would brick the hardware.
That sounds like your BIOS is just broken.
All the dead boards were from the same vendor. We are going to try some boards from another vendor today.
The kernel oops is probably just a trigger for that - possibly because you reboot with a particular state that breaks the BIOS.
Also, are you sure you really need to reflash the BIOS? It's actually fairly hard to overwrite the BIOS itself, but crashing with bad hardware state (where "bad" can just mean "unexpected by the BIOS") can cause the BIOS to not properly re-initialize things, and hang at boot.
So not booting cleanly from a warm reset is a reasonably common BIOS failure.
And yes, reflashing tends to force a full initialization and thus "fixes" things, but it may be a big hammer when a cold boot or just a "reset BIOS to safe defaults" might be sufficient.
In pretty much all cases this is a sign of a nasty BIOS problem, though, and you may want to look into a firmware update from the vendor for that.
We tried a cold power off (physically unplugging the machine from power) and a CMOS reset, and neither helped. The only thing that actually restored one of the dead boards was a reflash. I did the reflash with the latest code when I reflashed it.
But on to the kernel side:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
0: 83 c9 08 or $0x8,%ecx 3: 40 f6 c6 04 test $0x4,%sil 7: 0f 45 d1 cmovne %ecx,%edx a: 89 d1 mov %edx,%ecx c: 80 cd 04 or $0x4,%ch f: 40 f6 c6 08 test $0x8,%sil 13: 0f 45 d1 cmovne %ecx,%edx 16: 89 d1 mov %edx,%ecx 18: 80 cd 08 or $0x8,%ch 1b: 40 f6 c6 10 test $0x10,%sil 1f: 0f 45 d1 cmovne %ecx,%edx 22: 89 d1 mov %edx,%ecx 24: 80 cd 10 or $0x10,%ch 27: 83 e6 20 and $0x20,%esi 2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction 31: 0f 45 d1 cmovne %ecx,%edx 34: 83 ca 20 or $0x20,%edx 37: 89 f1 mov %esi,%ecx 39: 83 e1 10 and $0x10,%ecx 3c: 89 cf mov %ecx,%edi
and all those odd cmovne and bit-ops are just the bit selection code in flags_by_mnt(), which is inlined through calculate_f_flags (which is _also_ inlined) into vfs_statfs().
Sadly, gcc makes a mess of it and actually generates code that looks like the original C. I would have hoped that gcc could have turned
if (x & BIT) y |= OTHER_BIT;
into
y |= (x & BIT) shifted-by-the-bit-difference-between BIT/OTHER_BIT;
but that doesn't happen. We actually do it by hand in some other more critical places, but it's painful to do by hand (because the shift direction/amount is not trivial to do in C).
Anyway, that cmovne noise makes it a bit hard to see the actual part that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
Now, afaik, mnt->mnt_sb should never be NULL in the first place for a proper path. And the vfs_statfs() code itself hasn't changed in a while.
Which does seem to implicate nfsd as having passed in a bad path to vfs_statfs(). But I'm not seeing any changes in nfsd either.
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
It's an ext4 filesystem.
But bringing in Al Viro and Bruce Fields explicitly in case they see something. And Darrick, just in case it might be xfs.
Thanks
On Thu, Nov 09, 2017 at 11:34:19AM -0800, Patrick McLean wrote:
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
It's an ext4 filesystem.
Had there been toolchain changes around the same period?
On 2017-11-09 11:38 AM, Al Viro wrote:
On Thu, Nov 09, 2017 at 11:34:19AM -0800, Patrick McLean wrote:
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
It's an ext4 filesystem.
Had there been toolchain changes around the same period?
No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1.
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction
that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
Now, afaik, mnt->mnt_sb should never be NULL in the first place for a proper path. And the vfs_statfs() code itself hasn't changed in a while.
Which does seem to implicate nfsd as having passed in a bad path to vfs_statfs(). But I'm not seeing any changes in nfsd either.
It definitely is NULL mnt->mnt_sb and that should never happen. All struct mount instances are allocated by alloc_vfsmnt(). Its callers are * vfs_kern_mount(). Assigns ->mnt_sb to root->d_sb before anyone else sees the address of that object. * clone_mnt(). Assigns ->mnt_sb to that of preexisting instance before anyone else sees the address of that object.
No other callers exist and no other places ever modify the value of that field.
All instances of struct dentry are created by __d_alloc()[*], which assigns ->d_sb (never to be modified afterwards) *and* dereferences the pointer it has stored in ->d_sb before the created struct dentry becomes visible to anyone else. No struct dentry should ever be observed with NULL ->d_sb; the only way to get that is memory corruption or looking at freed instance after its memory has been reused for something else and zeroed.
In other words, we should never observe a struct mount with NULL ->mnt.mnt_sb - not without memory corruption or looking at freed instance.
The pointer in that case should've come from exp->ex_path.mnt, exp being the argument of nfsd4_encode_fattr(). Sure, it might have been a dangling reference. However, it looks a lot more like a memory corruptor *OR* miscompiled kernel.
What kind of load do the reproducer boxen have and how fast does that bug trigger? Would it be possible to slap something like if (unlikely(!exp->exp_path.mnt->mnt_sb)) { struct mount *m = real_mount(exp->exp_path.mnt); printk(KERN_ERR "mnt: %p\n", exp->exp_path.mnt); printk(KERN_ERR "name: [%s]\n", m->mnt_devname); printk(KERN_ERR "ns: [%p]\n", m->mnt_ns); printk(KERN_ERR "parent: [%p]\n", m->mnt_parent); WARN_ON(1); err = -EINVAL; goto out_nfserr; } in the beginning of nfsd4_encode_fattr() (with include of ../mount.h added in fs/nfsd/nfs4xdr.c) and see what will it catch?
Both with and without randomized structs, if possible - I might be barking at the wrong tree, but IMO the very first step in localizing that crap is to find out whether it's toolchain-related or not.
[*] strictly speaking, there is one exception - lib/test_printf.c has four static struct dentry instances. No chance of those being returned by any ->mount() instance, though.
On 2017-11-09 11:37 AM, Al Viro wrote:
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction
that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
All instances of struct dentry are created by __d_alloc()[*], which assigns ->d_sb (never to be modified afterwards) *and* dereferences the pointer it has stored in ->d_sb before the created struct dentry becomes visible to anyone else. No struct dentry should ever be observed with NULL ->d_sb; the only way to get that is memory corruption or looking at freed instance after its memory has been reused for something else and zeroed.
In other words, we should never observe a struct mount with NULL ->mnt.mnt_sb - not without memory corruption or looking at freed instance.
The pointer in that case should've come from exp->ex_path.mnt, exp being the argument of nfsd4_encode_fattr(). Sure, it might have been a dangling reference. However, it looks a lot more like a memory corruptor *OR* miscompiled kernel.
What kind of load do the reproducer boxen have and how fast does that bug trigger? Would it be possible to slap something like if (unlikely(!exp->exp_path.mnt->mnt_sb)) { struct mount *m = real_mount(exp->exp_path.mnt); printk(KERN_ERR "mnt: %p\n", exp->exp_path.mnt); printk(KERN_ERR "name: [%s]\n", m->mnt_devname); printk(KERN_ERR "ns: [%p]\n", m->mnt_ns); printk(KERN_ERR "parent: [%p]\n", m->mnt_parent); WARN_ON(1); err = -EINVAL; goto out_nfserr; } in the beginning of nfsd4_encode_fattr() (with include of ../mount.h added in fs/nfsd/nfs4xdr.c) and see what will it catch?
Both with and without randomized structs, if possible - I might be barking at the wrong tree, but IMO the very first step in localizing that crap is to find out whether it's toolchain-related or not.
The reproducer boxen are not under particularly heavy load, they are serving NFS to 1 or 2 clients (which are essentially embedded devices). When the bug triggers, it usually triggers pretty fast and reliably, but it seems to only trigger on some subset of bootups. Once it fails to trigger, we seem to have to reboot to get it to trigger.
I should be able to have some results with that added in a few hours. It's weirdly unreliable to reproduce this.
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean chutzpah@gentoo.org wrote:
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
It might be worth just verifying without RANDSTRUCT in particular.
That case has probably not gotten a huge amount of testing. As Al points out, it can cause absolutely horrendous cache access pattern changes, but it might also be triggering some corruption in case there's a problem with the plugin, or with some piece of kernel code that gets confused by it.
And most obviously: if there is some module or part of the kernel that got compiled with a different seed for the randstruct hashing, that will break in nasty nasty ways. Your out-of-kernel module is the obvious suspect for something like that, but honestly, it could be some missing build dependency, or simply a missing special case in the plugin itself a missing __no_randomize_layout or any number of things.
We've hit gcc bugs many times before - and the plugins are just new opportunities to hit cases that have gotten a lot less testing than the "normal" code flow has.
The structleak plugin is much less likely to be a problem (simply because it's a much simpler plugin), but hey, something being NULL when it shouldn't possibly be might be a stray "leak initialization".
So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins.
Just to narrow it down a bit.
Linus
On Thu, Nov 09, 2017 at 12:04:19PM -0800, Linus Torvalds wrote:
That case has probably not gotten a huge amount of testing. As Al points out, it can cause absolutely horrendous cache access pattern changes, but it might also be triggering some corruption in case there's a problem with the plugin, or with some piece of kernel code that gets confused by it.
I suspect that it might be an effect of randomize shite done both to struct mount *AND* to struct vfsmount embedded into it. With pointers to embedded struct vfsmount kept around a lot, and container_of() used to get from them to corresponding struct mount.
That smells like a combination of idiocy that might have never occured to the authors of said gcc plugin.
On the other hand, triggered gcc bugs certainly do add randomness, so good luck explaining to the security community that it's not a good idea...
On 2017-11-09 12:04 PM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean chutzpah@gentoo.org wrote:
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
It might be worth just verifying without RANDSTRUCT in particular.
And most obviously: if there is some module or part of the kernel that got compiled with a different seed for the randstruct hashing, that will break in nasty nasty ways. Your out-of-kernel module is the obvious suspect for something like that, but honestly, it could be some missing build dependency, or simply a missing special case in the plugin itself a missing __no_randomize_layout or any number of things.
We will check our fork against the in-kernel cp201x driver to make sure we didn't miss anything, but it seems odd we would be hitting the issue so consistently in the NFS code path, rather than somewhere in USB, serial, or GPIO paths.
So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins.
I haven't been able to reproduce it with RANDSTRUCT disabled (and structleak enabled). I will keep trying for a little while more, but evidence seems to be pointing to that.
Something must have changed since 4.13.8 to trigger this though. This did not crop up at all until we tried 4.13.11, where it we saw it pretty quickly. We have a pretty large number of machines running 4.13.6 with RANDSTRUCT enabled and running a the same workload with many more clients, and have not seen this bug at all.
On Fri, Nov 10, 2017 at 2:58 AM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-09 12:04 PM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean chutzpah@gentoo.org wrote:
We will check our fork against the in-kernel cp201x driver to make sure we didn't miss anything, but it seems odd we would be hitting the issue so consistently in the NFS code path, rather than somewhere in USB, serial, or GPIO paths.
So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins.
I haven't been able to reproduce it with RANDSTRUCT disabled (and structleak enabled). I will keep trying for a little while more, but evidence seems to be pointing to that.
Something must have changed since 4.13.8 to trigger this though. This did not crop up at all until we tried 4.13.11, where it we saw it pretty quickly. We have a pretty large number of machines running 4.13.6 with RANDSTRUCT enabled and running a the same workload with many more clients, and have not seen this bug at all.
I couldn't find anything overly suspicious between 4.13.8 and 4.13.11, see the full list of commits since 3.14.6 at https://pastebin.com/AcxBZR7H
The ones I couldn't immediately rule out (but no smoking gun either) would be:
9970679f497a x86/cpu/AMD: Apply the Erratum 688 fix when the BIOS doesn't ca6711747c5a assoc_array: Fix a buggy node-splitting case 2fbb8bf749b5 xfs: move two more RT specific functions into CONFIG_XFS_RT 1e1427356d8d xfs: trim writepage mapping to within eof 9df9b634f637 xfs: cancel dirty pages on invalidation cd3f0bee1b94 xfs: handle error if xfs_btree_get_bufs fails 58cfca25f540 xfs: reinit btree pointer on attr tree inactivation walk 659a9989b68b xfs: don't change inode mode if ACL update fails 88ccd3b6884a xfs: move more RT specific code under CONFIG_XFS_RT 5733ebee586c xfs: Don't log uninitialised fields in inode structures 199a7448c097 xfs: handle racy AIO in xfs_reflink_end_cow ee5d69c908a1 xfs: always swap the cow forks when swapping extents 2888145444f1 xfs: Capture state of the right inode in xfs_iflush_done d0fa252b207f xfs: perag initialization should only touch m_ag_max_usable for AG 0 8da6f7fbe43c xfs: update i_size after unwritten conversion in dio completion a9eac76e958b xfs: report zeroed or not correctly in xfs_zero_range() 67d51bdcc9f4 fs/xfs: Use %pS printk format for direct addresses 2bf3122f2130 xfs: evict CoW fork extents when performing finsert/fcollapse a58a0826656d xfs: don't unconditionally clear the reflink flag on zero-block files c61e905e0ee2 iomap_dio_rw: Allocate AIO completion queue before submitting dio 7610595830bb pkcs7: Prevent NULL pointer dereference, since sinfo is not always set. 24a33a0c96f3 KEYS: don't let add_key() update an uninstantiated key ad4aa448c9b2 FS-Cache: fix dereference of NULL user_key_payload f45b8fe12221 KEYS: Fix race between updating and finding a negative key e56be12012c2 ecryptfs: fix dereference of NULL user_key_payload 363ce0b01fe0 fscrypt: fix dereference of NULL user_key_payload cc757d55c903 lib/digsig: fix dereference of NULL user_key_payload f5e97214207f x86/microcode/intel: Disable late loading on model 79 7b5e405b7878 Revert "tools/power turbostat: stop migrating, unless '-m'" 8b1e10789c84 KEYS: encrypted: fix dereference of NULL user_key_payload a258a35a9930 mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lock e47a56cbf519 usb: xhci: Handle error condition in xhci_stop_device() d53911e63388 usb: xhci: Reset halted endpoint if trb is noop d1120fe38b3f xhci: Cleanup current_cmd in xhci_cleanup_command_queue() 301d332138d2 xhci: Identify USB 3.1 capable hosts by their port protocol capability 015e94ead900 usb: hub: Allow reset retry for USB2 devices on connect bounce 1916547b28bd usb: quirks: add quirk for WORLDE MINI MIDI keyboard e3a038930502 usb: cdc_acm: Add quirk for Elatec TWN3 c2110c8dea7a USB: serial: metro-usb: add MS7820 device id 775462fd5c53 USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor() a9fdf6354267 USB: devio: Revert "USB: devio: Don't corrupt user memory"
However, you mentioned cp210x, and I noticed related changes in 4.13.8:
e21045a22395 USB: serial: console: fix use-after-free after failed setup 6c7cb458405e USB: serial: console: fix use-after-free on disconnect 4b3e3c7282d6 USB: serial: qcserial: add Dell DW5818, DW5819 c796da1d110f USB: serial: option: add support for TP-Link LTE module e7e0b4b39663 USB: serial: cp210x: add support for ELV TFD500 1ae2c690f967 USB: serial: cp210x: fix partnum regression 78a02c93648e USB: serial: ftdi_sio: add id for Cypress WICED dev board
You could try reverting those seven, this could point to your forked driver if it makes a difference.
Arnd
On Thu, Nov 9, 2017 at 5:58 PM, Patrick McLean chutzpah@gentoo.org wrote:
Something must have changed since 4.13.8 to trigger this though.
Well, yes and no.
Obviously something changed, but it doesn't necessarily have to be anything particular.
Almost every time we've seen compiler bugs, it's been an innocuous change that just happened to trigger a latent issue. Pretty much by definition compiler bugs tend to be about rare situations, so it's some odd special case that triggers.
Since it's apparently fairly repeatable for you, a bisection between 4.13.8 and 4.13.11 would be very interesting, and shouldn't take all that long. There's only 142 commits in that range, so even just a partial bisection of say four of five rounds should narrow it down to just a couple of commits. And even a full bisection should only take something like 8 build/test cycles.
Arnd pointed to some commits that might be relevant for the cp210x module, but those are all already in 4.13.8, so if 4.13.8 really is rock solid for you, I don't think that's it.
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
Linus
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 5:58 PM, Patrick McLean chutzpah@gentoo.org wrote:
Something must have changed since 4.13.8 to trigger this though.
Arnd pointed to some commits that might be relevant for the cp210x module, but those are all already in 4.13.8, so if 4.13.8 really is rock solid for you, I don't think that's it.
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories. Any given kernel will always see the same issue, but after a "make mrproper" and recompile (with the same .config), the issue will often change. I suspect that all 3 of these problems are actually the same issue manifesting itself in different ways depending on what seed the RANDSTRUCT gcc plugin is using.
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
On 2017-11-10 03:26 PM, Patrick McLean wrote:
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 5:58 PM, Patrick McLean chutzpah@gentoo.org wrote:
Something must have changed since 4.13.8 to trigger this though.
Arnd pointed to some commits that might be relevant for the cp210x module, but those are all already in 4.13.8, so if 4.13.8 really is rock solid for you, I don't think that's it.
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories. Any given kernel will always see the same issue, but after a "make mrproper" and recompile (with the same .config), the issue will often change. I suspect that all 3 of these problems are actually the same issue manifesting itself in different ways depending on what seed the RANDSTRUCT gcc plugin is using.
Further update on this, using the same seed for RANDSTRUCT, I have reproduced this issue on v4.13.0, so it does not seem to be recently introduced. The older kernel apparently only worked for us because we were lucky. Generally we always compile new kernels from a fresh tree, so they are never using the same seed.
In case someone wants to play with this, here are some interesting seeds (in include/generated/randomize_layout_hash.h):
Produce a NULL pointer dereference (though I am not sure what the client does to produce this). 5970d6494d0f4236ec57147a46e700f4f501536236d96f6f68ea223e06a258bc
All files for nfsd4 clients appear as directories owned as root, no matter the real owner (this happens for all clients we have tested): 3f158cd1014800ce5eb6c1f532ac64f2357fdb9a684096557d2fbb1d281f325e
This is the seed that was breaking motherboards (make sure you have a way to flash the BIOS with this one): 3e32f2d1b4a3dde9f2fd95151386cd1d5bd6167597a0b868f6273aabfc5712dd
Finally, here is a seed that produces a kernel that does not exhibit any problems we are aware of: e8698c12137fcd1dcbff6d1ed97e5d766128447a27ce9f9d61e0cb8c05ad4d3b
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
[ Bringing in the gcc plugin people and the kernel hardening list, since it now is no longer even remotely looking like a nfsd, vfs or filesystem issue any more ]
Kees, Emese, the whole thread is on lkml, but there's clearly something horribly wrong with RANDSTRUCT, and it's not new even though it looked that way for a while.
Patrick seems to trigger it with nfsd, so it might be specific to that.
Alternatively, it might just be that very few people run RANDSTRUCT-built kernels, or just have been lucky with the seeding.
Sorry for top-posting, but there's not really anything in the email itself to reply to, other than saying thanks to Patrick for narrowing it down like this.
It would have been very interesting if it had actually bisected to something, but it seems that the real issue is just the choice of seeding for RANDSTRUCT.
Linus
On Fri, Nov 10, 2017 at 4:27 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-10 03:26 PM, Patrick McLean wrote:
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories. Any given kernel will always see the same issue, but after a "make mrproper" and recompile (with the same .config), the issue will often change. I suspect that all 3 of these problems are actually the same issue manifesting itself in different ways depending on what seed the RANDSTRUCT gcc plugin is using.
Further update on this, using the same seed for RANDSTRUCT, I have reproduced this issue on v4.13.0, so it does not seem to be recently introduced. The older kernel apparently only worked for us because we were lucky. Generally we always compile new kernels from a fresh tree, so they are never using the same seed.
In case someone wants to play with this, here are some interesting seeds (in include/generated/randomize_layout_hash.h):
Produce a NULL pointer dereference (though I am not sure what the client does to produce this). 5970d6494d0f4236ec57147a46e700f4f501536236d96f6f68ea223e06a258bc
All files for nfsd4 clients appear as directories owned as root, no matter the real owner (this happens for all clients we have tested): 3f158cd1014800ce5eb6c1f532ac64f2357fdb9a684096557d2fbb1d281f325e
This is the seed that was breaking motherboards (make sure you have a way to flash the BIOS with this one): 3e32f2d1b4a3dde9f2fd95151386cd1d5bd6167597a0b868f6273aabfc5712dd
Finally, here is a seed that produces a kernel that does not exhibit any problems we are aware of: e8698c12137fcd1dcbff6d1ed97e5d766128447a27ce9f9d61e0cb8c05ad4d3b
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
On Fri, Nov 10, 2017 at 6:36 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
[ Bringing in the gcc plugin people and the kernel hardening list, since it now is no longer even remotely looking like a nfsd, vfs or filesystem issue any more ]
Kees, Emese, the whole thread is on lkml, but there's clearly something horribly wrong with RANDSTRUCT, and it's not new even though it looked that way for a while.
It wouldn't be the first issue we've seen; it's (obviously) a pretty aggressive change to the resulting build.
Patrick seems to trigger it with nfsd, so it might be specific to that.
Alternatively, it might just be that very few people run RANDSTRUCT-built kernels, or just have been lucky with the seeding.
Given its potential cache-line abuse, I'm not surprised that its usage is more limited than other features.
Sorry for top-posting, but there's not really anything in the email itself to reply to, other than saying thanks to Patrick for narrowing it down like this.
Agreed; thanks Patrick! :) Given that the issue is non-deterministic, I wonder if the bug is related to some kind of missing RCU or barrier that goes unnoticed in normal struct layouts.
It would have been very interesting if it had actually bisected to something, but it seems that the real issue is just the choice of seeding for RANDSTRUCT.
That's where we've seen bugs in the past: some pathological ordering of a struct uncovers a corner case. In the past it's been much more deterministic: doesn't build, or immediately crashes on boot, etc.
I'll take a closer look at this and see if I can provide something to narrow it down.
-Kees
Linus
On Fri, Nov 10, 2017 at 4:27 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-10 03:26 PM, Patrick McLean wrote:
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories. Any given kernel will always see the same issue, but after a "make mrproper" and recompile (with the same .config), the issue will often change. I suspect that all 3 of these problems are actually the same issue manifesting itself in different ways depending on what seed the RANDSTRUCT gcc plugin is using.
Further update on this, using the same seed for RANDSTRUCT, I have reproduced this issue on v4.13.0, so it does not seem to be recently introduced. The older kernel apparently only worked for us because we were lucky. Generally we always compile new kernels from a fresh tree, so they are never using the same seed.
In case someone wants to play with this, here are some interesting seeds (in include/generated/randomize_layout_hash.h):
Produce a NULL pointer dereference (though I am not sure what the client does to produce this). 5970d6494d0f4236ec57147a46e700f4f501536236d96f6f68ea223e06a258bc
All files for nfsd4 clients appear as directories owned as root, no matter the real owner (this happens for all clients we have tested): 3f158cd1014800ce5eb6c1f532ac64f2357fdb9a684096557d2fbb1d281f325e
This is the seed that was breaking motherboards (make sure you have a way to flash the BIOS with this one): 3e32f2d1b4a3dde9f2fd95151386cd1d5bd6167597a0b868f6273aabfc5712dd
Finally, here is a seed that produces a kernel that does not exhibit any problems we are aware of: e8698c12137fcd1dcbff6d1ed97e5d766128447a27ce9f9d61e0cb8c05ad4d3b
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
Linus
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
On 2017-11-11 09:31 AM, Linus Torvalds wrote:
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
I just tested it with gcc 7.2, and was able to reproduce the NULL pointer dereference, the backtrace looks slightly different this time.
I will also test with binutils 2.29, though I doubt that will make any difference.
[ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.166563] IP: vfs_statfs+0x7c/0xc0 [ 56.167249] PGD 0 P4D 0 [ 56.167860] Oops: 0000 [#1] SMP [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 56.193898] Call Trace: [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 [ 56.195267] ? generic_permission+0x12c/0x1a0 [ 56.196025] nfsd4_encode_getattr+0x25/0x30 [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.198268] nfsd_dispatch+0xe8/0x220 [ 56.198968] svc_process_common+0x475/0x640 [ 56.199696] ? nfsd_destroy+0x60/0x60 [ 56.200404] svc_process+0xf2/0x1a0 [ 56.201079] nfsd+0xe3/0x150 [ 56.201706] kthread+0x117/0x130 [ 56.202354] ? kthread_create_on_node+0x40/0x40 [ 56.203100] ret_from_fork+0x25/0x30 [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 [ 56.207110] CR2: 0000000000000560 [ 56.207763] ---[ end trace d452986a80f64aaa ]---
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
On Mon, Nov 13, 2017 at 2:48 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-11 09:31 AM, Linus Torvalds wrote:
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
I just tested it with gcc 7.2, and was able to reproduce the NULL pointer dereference, the backtrace looks slightly different this time.
I will also test with binutils 2.29, though I doubt that will make any difference.
[ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.166563] IP: vfs_statfs+0x7c/0xc0 [ 56.167249] PGD 0 P4D 0 [ 56.167860] Oops: 0000 [#1] SMP [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 56.193898] Call Trace: [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 [ 56.195267] ? generic_permission+0x12c/0x1a0 [ 56.196025] nfsd4_encode_getattr+0x25/0x30 [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.198268] nfsd_dispatch+0xe8/0x220 [ 56.198968] svc_process_common+0x475/0x640 [ 56.199696] ? nfsd_destroy+0x60/0x60 [ 56.200404] svc_process+0xf2/0x1a0 [ 56.201079] nfsd+0xe3/0x150 [ 56.201706] kthread+0x117/0x130 [ 56.202354] ? kthread_create_on_node+0x40/0x40 [ 56.203100] ret_from_fork+0x25/0x30 [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 [ 56.207110] CR2: 0000000000000560 [ 56.207763] ---[ end trace d452986a80f64aaa ]---
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
How reliable is this crash? The best idea I have to isolate it would be to bisect the additions of the __randomize_layout markings on various structures. I would start with the ones Al is most upset to see randomized. ;)
All that said, I'd like to better understand the BIOS side of this a little better. In the first email in this thread, you showed two BUGs separated by a little time, which implies to me that the NULL deref and the BIOS no longer POSTing are separate (though seemingly related) issues. Have you had machines survive the BUG without blowing up the BIOS?
I'm still trying to wrap my head around how the BIOS could be blowing up. I assume there's some magic memory address that is getting poked as a result of some struct randomization bug, so tracking that down should be possible assuming you can stand reflashing your BIOS across the bisects.
For the first step, I'd try a revert of 9225331b310821760f39ba55b00b8973602adbb5, which enables a large portion of struct randomization. If that doesn't change things, I can provide a series that reverts 3859a271a003aba01e45b85c9d8b355eb7bf25f9 and then re-applies __randomize_layout one structure per patch, and you could bisect that?
-Kees
On 2017-11-16 04:54 PM, Kees Cook wrote:
On Mon, Nov 13, 2017 at 2:48 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-11 09:31 AM, Linus Torvalds wrote:
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
I just tested it with gcc 7.2, and was able to reproduce the NULL pointer dereference, the backtrace looks slightly different this time.
I will also test with binutils 2.29, though I doubt that will make any difference.
[ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.166563] IP: vfs_statfs+0x7c/0xc0 [ 56.167249] PGD 0 P4D 0 [ 56.167860] Oops: 0000 [#1] SMP [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 56.193898] Call Trace: [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 [ 56.195267] ? generic_permission+0x12c/0x1a0 [ 56.196025] nfsd4_encode_getattr+0x25/0x30 [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.198268] nfsd_dispatch+0xe8/0x220 [ 56.198968] svc_process_common+0x475/0x640 [ 56.199696] ? nfsd_destroy+0x60/0x60 [ 56.200404] svc_process+0xf2/0x1a0 [ 56.201079] nfsd+0xe3/0x150 [ 56.201706] kthread+0x117/0x130 [ 56.202354] ? kthread_create_on_node+0x40/0x40 [ 56.203100] ret_from_fork+0x25/0x30 [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 [ 56.207110] CR2: 0000000000000560 [ 56.207763] ---[ end trace d452986a80f64aaa ]---
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
How reliable is this crash? The best idea I have to isolate it would be to bisect the additions of the __randomize_layout markings on various structures. I would start with the ones Al is most upset to see randomized. ;)
It's pretty reliable, once I get a bad seed I can reproduce the crash pretty quickly.
All that said, I'd like to better understand the BIOS side of this a little better. In the first email in this thread, you showed two BUGs separated by a little time, which implies to me that the NULL deref and the BIOS no longer POSTing are separate (though seemingly related) issues. Have you had machines survive the BUG without blowing up the BIOS?
We had 3 machines die due to the BIOS issue (all of them pretty quickly with the bad-seed kernel). All the dead machines had the same motherboard model. I have not managed to reproduce the issue again on the machine I restored via the IPMI interface, I suspect that it may be a bug in the BIOS that was fixed in a more recent version.
I'm still trying to wrap my head around how the BIOS could be blowing up. I assume there's some magic memory address that is getting poked as a result of some struct randomization bug, so tracking that down should be possible assuming you can stand reflashing your BIOS across the bisects.
That is our theory, some magic memory address that caused an overwrite of the flash where the BIOS code is stored. We are working under the assumption that it was fixed in a more recent BIOS update, since I have not managed to reproduce the issue on the resurrected machine.
For the first step, I'd try a revert of 9225331b310821760f39ba55b00b8973602adbb5, which enables a large portion of struct randomization. If that doesn't change things, I can provide a series that reverts 3859a271a003aba01e45b85c9d8b355eb7bf25f9 and then re-applies __randomize_layout one structure per patch, and you could bisect that?
Sure, I can bisect that.
On Fri, Nov 17, 2017 at 11:03 AM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-16 04:54 PM, Kees Cook wrote:
On Mon, Nov 13, 2017 at 2:48 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-11 09:31 AM, Linus Torvalds wrote:
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
I just tested it with gcc 7.2, and was able to reproduce the NULL pointer dereference, the backtrace looks slightly different this time.
I will also test with binutils 2.29, though I doubt that will make any difference.
[ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.166563] IP: vfs_statfs+0x7c/0xc0 [ 56.167249] PGD 0 P4D 0 [ 56.167860] Oops: 0000 [#1] SMP [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 56.193898] Call Trace: [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 [ 56.195267] ? generic_permission+0x12c/0x1a0 [ 56.196025] nfsd4_encode_getattr+0x25/0x30 [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.198268] nfsd_dispatch+0xe8/0x220 [ 56.198968] svc_process_common+0x475/0x640 [ 56.199696] ? nfsd_destroy+0x60/0x60 [ 56.200404] svc_process+0xf2/0x1a0 [ 56.201079] nfsd+0xe3/0x150 [ 56.201706] kthread+0x117/0x130 [ 56.202354] ? kthread_create_on_node+0x40/0x40 [ 56.203100] ret_from_fork+0x25/0x30 [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 [ 56.207110] CR2: 0000000000000560 [ 56.207763] ---[ end trace d452986a80f64aaa ]---
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
How reliable is this crash? The best idea I have to isolate it would be to bisect the additions of the __randomize_layout markings on various structures. I would start with the ones Al is most upset to see randomized. ;)
It's pretty reliable, once I get a bad seed I can reproduce the crash pretty quickly.
All that said, I'd like to better understand the BIOS side of this a little better. In the first email in this thread, you showed two BUGs separated by a little time, which implies to me that the NULL deref and the BIOS no longer POSTing are separate (though seemingly related) issues. Have you had machines survive the BUG without blowing up the BIOS?
We had 3 machines die due to the BIOS issue (all of them pretty quickly with the bad-seed kernel). All the dead machines had the same motherboard model. I have not managed to reproduce the issue again on the machine I restored via the IPMI interface, I suspect that it may be a bug in the BIOS that was fixed in a more recent version.
I'm still trying to wrap my head around how the BIOS could be blowing up. I assume there's some magic memory address that is getting poked as a result of some struct randomization bug, so tracking that down should be possible assuming you can stand reflashing your BIOS across the bisects.
That is our theory, some magic memory address that caused an overwrite of the flash where the BIOS code is stored. We are working under the assumption that it was fixed in a more recent BIOS update, since I have not managed to reproduce the issue on the resurrected machine.
Okay, well that's certainly better than having to reflash at every bisection step! :)
For the first step, I'd try a revert of 9225331b310821760f39ba55b00b8973602adbb5, which enables a large portion of struct randomization. If that doesn't change things, I can provide a series that reverts 3859a271a003aba01e45b85c9d8b355eb7bf25f9 and then re-applies __randomize_layout one structure per patch, and you could bisect that?
Sure, I can bisect that.
Okay, that should at least let us know if this is a specific struct that is not expecting to get randomized, or if there is some deeper flaw. Here's the tree, based on 4.14: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/r...
With commit d9e12200852d, all randomization selections are reverted. I would expect this to be a "good" kernel for the bisect.
The very end of the series (commit d893c17b3146), everything is back to being randomized. I would expect this to be a "bad" kernel.
Each step between those two commits adds randomization to a single struct (with the filesystem stuff near the front).
Here's hoping it'll be something obvious. :) Thanks for taking the time to debug this!
-Kees
On 2017-11-17 01:26 PM, Kees Cook wrote:
On Fri, Nov 17, 2017 at 11:03 AM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-16 04:54 PM, Kees Cook wrote:
On Mon, Nov 13, 2017 at 2:48 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-11 09:31 AM, Linus Torvalds wrote:
Boris Lukashev points out that Patrick should probably check a newer version of gcc.
I looked around, and in one of the emails, Patrick said:
"No changes, both the working and broken kernels were built with distro-provided gcc 5.4.0 and binutils 2.28.1"
and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but it's a bug-fix release to a pretty old branch that is not exactly new.
It would probably be good to check if the problems persist with gcc 6.x or 7.x.. I have no idea which gcc version the randstruct people tend to use themselves.
I just tested it with gcc 7.2, and was able to reproduce the NULL pointer dereference, the backtrace looks slightly different this time.
I will also test with binutils 2.29, though I doubt that will make any difference.
[ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.166563] IP: vfs_statfs+0x7c/0xc0 [ 56.167249] PGD 0 P4D 0 [ 56.167860] Oops: 0000 [#1] SMP [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 [ 56.193898] Call Trace: [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 [ 56.195267] ? generic_permission+0x12c/0x1a0 [ 56.196025] nfsd4_encode_getattr+0x25/0x30 [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.198268] nfsd_dispatch+0xe8/0x220 [ 56.198968] svc_process_common+0x475/0x640 [ 56.199696] ? nfsd_destroy+0x60/0x60 [ 56.200404] svc_process+0xf2/0x1a0 [ 56.201079] nfsd+0xe3/0x150 [ 56.201706] kthread+0x117/0x130 [ 56.202354] ? kthread_create_on_node+0x40/0x40 [ 56.203100] ret_from_fork+0x25/0x30 [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 [ 56.207110] CR2: 0000000000000560 [ 56.207763] ---[ end trace d452986a80f64aaa ]---
On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook keescook@chromium.org wrote:
I'll take a closer look at this and see if I can provide something to narrow it down.
How reliable is this crash? The best idea I have to isolate it would be to bisect the additions of the __randomize_layout markings on various structures. I would start with the ones Al is most upset to see randomized. ;)
It's pretty reliable, once I get a bad seed I can reproduce the crash pretty quickly.
For the first step, I'd try a revert of 9225331b310821760f39ba55b00b8973602adbb5, which enables a large portion of struct randomization. If that doesn't change things, I can provide a series that reverts 3859a271a003aba01e45b85c9d8b355eb7bf25f9 and then re-applies __randomize_layout one structure per patch, and you could bisect that?
Sure, I can bisect that.
Okay, that should at least let us know if this is a specific struct that is not expecting to get randomized, or if there is some deeper flaw. Here's the tree, based on 4.14: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/r...
With commit d9e12200852d, all randomization selections are reverted. I would expect this to be a "good" kernel for the bisect.
I am still getting the crash at d9e12200852d, I figured I would double-check the "good" and "bad" kernels before starting a full bisect.
I guess it must be something somewhere else? I am happy to test or bisect more patches.
Here is the BUG message for reference:
[ 56.495987] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 [ 56.497404] IP: vfs_statfs+0x7c/0xc0 [ 56.498092] PGD 0 P4D 0 [ 56.498716] Oops: 0000 [#1] SMP [ 56.499366] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat gkuart(O) usbserial x86_pkg_temp_thermal tpm_tis ipmi_ssif tpm_tis_core ie31200_edac ext4 mbcache jbd2 e1000e crc32c_intel [ 56.502653] CPU: 0 PID: 3975 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1-00061-gd893c17b3146 #3 [ 56.504071] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 56.504957] task: ffff88040cba7000 task.stack: ffffc90002c08000 [ 56.505843] RIP: 0010:vfs_statfs+0x7c/0xc0 [ 56.506571] RSP: 0018:ffffc90002c0bb28 EFLAGS: 00010202 [ 56.507383] RAX: 0000000000000000 RBX: ffffc90002c0bbf0 RCX: 0000000000000020 [ 56.508354] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 56.509545] RBP: ffffc90002c0bbc0 R08: ffffffffffffff00 R09: 00000000000000ff [ 56.510622] R10: 000000000038be3a R11: ffff8804087563e8 R12: 0000000000000000 [ 56.511693] R13: ffff88040c68d000 R14: ffff88040c4df000 R15: ffffc90002c0bbf0 [ 56.512764] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 [ 56.514216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 56.515199] CR2: 0000000000000560 CR3: 0000000001e0a005 CR4: 00000000001606f0 [ 56.516268] Call Trace: [ 56.516903] nfsd4_encode_fattr+0x201/0x1f90 [ 56.517686] ? generic_permission+0x12c/0x1a0 [ 56.518467] nfsd4_encode_getattr+0x25/0x30 [ 56.519220] nfsd4_encode_operation+0x98/0x1b0 [ 56.519991] nfsd4_proc_compound+0x2a0/0x5e0 [ 56.520758] nfsd_dispatch+0xe8/0x220 [ 56.521476] svc_process_common+0x475/0x640 [ 56.522221] ? nfsd_destroy+0x60/0x60 [ 56.522923] svc_process+0xf2/0x1a0 [ 56.523611] nfsd+0xe3/0x150 [ 56.524241] kthread+0x117/0x130 [ 56.524896] ? kthread_create_on_node+0x40/0x40 [ 56.525630] ret_from_fork+0x25/0x30 [ 56.526306] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce 00 10 00 00 83 e1 20 0f 45 d6 <48> 8b b7 60 05 00 00 bf 10 00 00 00 83 ca 20 89 f1 83 e1 10 0f [ 56.528885] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c0bb28 [ 56.529772] CR2: 0000000000000560 [ 56.530464] ---[ end trace e6cf48f1f8c0ee4e ]---
The very end of the series (commit d893c17b3146), everything is back to being randomized. I would expect this to be a "bad" kernel.
Each step between those two commits adds randomization to a single struct (with the filesystem stuff near the front).
Here's hoping it'll be something obvious. :) Thanks for taking the time to debug this!
On Fri, Nov 17, 2017 at 4:27 PM, Patrick McLean chutzpah@gentoo.org wrote:
I am still getting the crash at d9e12200852d, I figured I would double-check the "good" and "bad" kernels before starting a full bisect.
.. but without GCC_PLUGIN_RANDSTRUCT it's solid?
Kees removed even the baseline "randomize pure function pointer structures", so at that commit, nothing should be randomized.
But maybe the plugin code itself ends up confusing gcc somehow?
Even when it doesn't actually do that "relayout_struct()" on the structure, it always does those TYPE_ATTRIBUTES() games.
Linus
On 2017-11-17 04:55 PM, Linus Torvalds wrote:
On Fri, Nov 17, 2017 at 4:27 PM, Patrick McLean chutzpah@gentoo.org wrote:
I am still getting the crash at d9e12200852d, I figured I would double-check the "good" and "bad" kernels before starting a full bisect.
.. but without GCC_PLUGIN_RANDSTRUCT it's solid?
Yes, without GCC_PLUGIN_RANDSTRUCT it's solid.
Kees removed even the baseline "randomize pure function pointer structures", so at that commit, nothing should be randomized.
But maybe the plugin code itself ends up confusing gcc somehow?
Even when it doesn't actually do that "relayout_struct()" on the structure, it always does those TYPE_ATTRIBUTES() games.
On Fri, Nov 17, 2017 at 5:54 PM, Patrick McLean chutzpah@gentoo.org wrote:
On 2017-11-17 04:55 PM, Linus Torvalds wrote:
On Fri, Nov 17, 2017 at 4:27 PM, Patrick McLean chutzpah@gentoo.org wrote:
I am still getting the crash at d9e12200852d, I figured I would double-check the "good" and "bad" kernels before starting a full bisect.
.. but without GCC_PLUGIN_RANDSTRUCT it's solid?
Yes, without GCC_PLUGIN_RANDSTRUCT it's solid.
That's strange. With d9e12200852d the shuffle_seed variables won't ever actually get used. (i.e. I wouldn't expect the seed to change any behavior.)
Can you confirm with something like this:
diff --git a/scripts/gcc-plugins/randomize_layout_plugin.c b/scripts/gcc-plugins/randomize_layout_plugin.c index cdaac8c66734..aac570a57d7d 100644 --- a/scripts/gcc-plugins/randomize_layout_plugin.c +++ b/scripts/gcc-plugins/randomize_layout_plugin.c @@ -267,12 +267,10 @@ static void shuffle(const_tree type, tree *newtree, unsigned long length)
structname = ORIG_TYPE_NAME(type);
-#ifdef __DEBUG_PLUGIN fprintf(stderr, "Shuffling struct %s %p\n", (const char *)structname, type); #ifdef __DEBUG_VERBOSE debug_tree((tree)type); #endif -#endif
for (i = 0; i < 4; i++) { seed[i] = shuffle_seed[i];
You should see no reports of "Shuffling struct ..."
And if it reports nothing, and you're on d9e12200852d, can you confirm that switching to a "good" seed fixes it? (If it _does_, then I suspect a build artifact being left behind or something odd like that.)
Kees removed even the baseline "randomize pure function pointer structures", so at that commit, nothing should be randomized.
But maybe the plugin code itself ends up confusing gcc somehow?
Even when it doesn't actually do that "relayout_struct()" on the structure, it always does those TYPE_ATTRIBUTES() games.
FWIW, myself doing a build at d9e12200852d with and without GCC_PLUGIN_RANDSTRUCT _appears_ to produce identical objdump output where I did spot-checks.
Also, do you have any other plugins enabled? (Can you send your .config?)
-Kees
On Fri, Nov 17, 2017 at 9:14 PM, Kees Cook keescook@chromium.org wrote:
FWIW, myself doing a build at d9e12200852d with and without GCC_PLUGIN_RANDSTRUCT _appears_ to produce identical objdump output where I did spot-checks.
That would probably be a good thing to check anyway - check the difference between GCC_PLUGIN_RANDSTRUCT on and off at that commit.
Just do
objdump --disassemble vmlinux > file
and compare the two files for where the differences start occurring.
Linus
On Fri, Nov 17, 2017 at 9:29 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Nov 17, 2017 at 9:14 PM, Kees Cook keescook@chromium.org wrote:
FWIW, myself doing a build at d9e12200852d with and without GCC_PLUGIN_RANDSTRUCT _appears_ to produce identical objdump output where I did spot-checks.
That would probably be a good thing to check anyway - check the difference between GCC_PLUGIN_RANDSTRUCT on and off at that commit.
Just do
objdump --disassemble vmlinux > file
and compare the two files for where the differences start occurring.
Yeah, I was just doing that now. Looks like there _is_ something getting changed just from having the plugin enabled, but it appears localized. For me, the first non-offset change happens in lookup_user_key and persists for a while.
-ffffffff813893a7: 0f 85 55 03 00 00 jne ffffffff81389702 <lookup_user_key+0x3f2> -ffffffff813893ad: f0 41 ff 06 lock incl (%r14) -ffffffff813893b1: 83 fb 07 cmp $0x7,%ebx -ffffffff813893b4: 4c 89 b5 70 ff ff ff mov %r14,-0x90(%rbp) ... +ffffffff813893a7: 0f 85 35 03 00 00 jne ffffffff813896e2 <lookup_user_key+0x3d2> +ffffffff813893ad: 4d 89 f0 mov %r14,%r8 +ffffffff813893b0: f0 41 ff 06 lock incl (%r14) +ffffffff813893b4: 83 fb 07 cmp $0x7,%ebx +ffffffff813893b7: 4c 89 b5 70 ff ff ff mov %r14,-0x90(%rbp)
And removing the TYPE_ATTRIBUTES() poking makes the register storage differences go away, but there's still a 0x40 byte offset delta.
I'll continue looking at this tomorrow.
-Kees
On Fri, Nov 10, 2017 at 03:26:27PM -0800, Patrick McLean wrote:
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 5:58 PM, Patrick McLean chutzpah@gentoo.org wrote:
Something must have changed since 4.13.8 to trigger this though.
Arnd pointed to some commits that might be relevant for the cp210x module, but those are all already in 4.13.8, so if 4.13.8 really is rock solid for you, I don't think that's it.
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories.
That suggests that stat.uid is 0 and stat.mode & S_IFMT is 0040000 in the stat structure that nfsd passed to vfs_getattr().
No idea what sort of information is useful when tracking down this kind of bug, but you could also run wireshark and take a look at the server's GETATTR replies to see if there's some other corruption.
--b.
Any given kernel will always see the same issue, but after a "make mrproper" and recompile (with the same .config), the issue will often change. I suspect that all 3 of these problems are actually the same issue manifesting itself in different ways depending on what seed the RANDSTRUCT gcc plugin is using.
Because in the end, compiler bugs are very rare. They are particularly annoying when they do happen, though, so they loom big in the mind of people who have had to chase them down.
On Fri, Nov 10, 2017 at 08:13:06PM -0500, J. Bruce Fields wrote:
On Fri, Nov 10, 2017 at 03:26:27PM -0800, Patrick McLean wrote:
On 2017-11-10 10:42 AM, Linus Torvalds wrote:
On Thu, Nov 9, 2017 at 5:58 PM, Patrick McLean chutzpah@gentoo.org wrote:
Something must have changed since 4.13.8 to trigger this though.
Arnd pointed to some commits that might be relevant for the cp210x module, but those are all already in 4.13.8, so if 4.13.8 really is rock solid for you, I don't think that's it.
I really don't see anything that looks even half-way suspicious in that 4.13.8..11 range. But as mentioned, compiler interactions can be _really_ subtle.
And hey, it can be a real kernel bug too, that just happens to be exposed by RANDSTRUCT, so a bisect really would be very nice.
I am working on bisecting the issue now, but I think I have some more evidence pointing to a compiler issue related to RANDSTRUCT. There are actually 3 issues that we have seen. Sometimes we get the null pointer deref in the initial message, sometimes we get the GPF, and sometimes we see an issue where the NFS clients see all files as root-owned directories.
That suggests that stat.uid is 0 and stat.mode & S_IFMT is 0040000 in the stat structure that nfsd passed to vfs_getattr().
No idea what sort of information is useful when tracking down this kind of bug, but you could also run wireshark and take a look at the server's GETATTR replies to see if there's some other corruption.
FWIW, having looked at some of the __bugger_layout users... Compiler bugs aside, * use in struct {dentry,inode,mount,block_device} has to go - cache use patterns at hash lookups are _not_ something to play with like that. * struct file_lock and struct super_block - ditto, only it's not hash lookups that hurt here. struct vm_area_struct, while we are at it. * struct group_info - Cthulhu's pus-leaking warts, what's the point randomizing _that_? No, really - here's the damn thing in all its glory: struct group_info { atomic_t usage; int ngroups; kgid_t gid[0]; } __randomize_layout; I really hope that plugin does *not* try to move the ->gid[] anywhere... Which leaves us a choice between putting ->usage first or second. Sure, every bit helps, but... even for security theatre that looks a bit too pathetic. * struct vfsmount. Wow. All of log2(3!) bits. Congratulations. At least that's better than struct path. Oh, wait - they'd done struct path as well...
What the hell had they been doing? Muscarine old-fashioned way? Looks like a mix of pointless and truly dangerous. And then there are compiler bugs and the charming effect on reproducibility...
On 2017-11-09 11:51 AM, Patrick McLean wrote:
On 2017-11-09 11:37 AM, Al Viro wrote:
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Here is the BUG we are getting:
[ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [ 58.963918] IP: vfs_statfs+0x73/0xb0
The code disassembles to
2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction
that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
All instances of struct dentry are created by __d_alloc()[*], which assigns ->d_sb (never to be modified afterwards) *and* dereferences the pointer it has stored in ->d_sb before the created struct dentry becomes visible to anyone else. No struct dentry should ever be observed with NULL ->d_sb; the only way to get that is memory corruption or looking at freed instance after its memory has been reused for something else and zeroed.
In other words, we should never observe a struct mount with NULL ->mnt.mnt_sb - not without memory corruption or looking at freed instance.
The pointer in that case should've come from exp->ex_path.mnt, exp being the argument of nfsd4_encode_fattr(). Sure, it might have been a dangling reference. However, it looks a lot more like a memory corruptor *OR* miscompiled kernel.
What kind of load do the reproducer boxen have and how fast does that bug trigger? Would it be possible to slap something like if (unlikely(!exp->exp_path.mnt->mnt_sb)) { struct mount *m = real_mount(exp->exp_path.mnt); printk(KERN_ERR "mnt: %p\n", exp->exp_path.mnt); printk(KERN_ERR "name: [%s]\n", m->mnt_devname); printk(KERN_ERR "ns: [%p]\n", m->mnt_ns); printk(KERN_ERR "parent: [%p]\n", m->mnt_parent); WARN_ON(1); err = -EINVAL; goto out_nfserr; } in the beginning of nfsd4_encode_fattr() (with include of ../mount.h added in fs/nfsd/nfs4xdr.c) and see what will it catch?
Both with and without randomized structs, if possible - I might be barking at the wrong tree, but IMO the very first step in localizing that crap is to find out whether it's toolchain-related or not.
That condition did not seem to trigger, and I am getting a slightly different crash message (GPF rather than null pointer dereference). Here is the dump from the latest crash (with CONFIG_GCC_PLUGIN_STRUCTLEAK, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL and CONFIG_GCC_PLUGIN_RANDSTRUCT all enabled).
[ 36.834232] general protection fault: 0000 [#1] SMP [ 36.835168] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat gkuart(O) usbserial x86_pkg_temp_thermal ie31200_edac tpm_tis ipmi_ssif tpm_tis_core ext4 mbcache jbd2 e1000e crc32c_intel [ 36.839120] CPU: 1 PID: 3969 Comm: nfsd Tainted: G O 4.14.0-rc8-git-kratos-1-00053-gd93d4ce103fd-dirty #1 [ 36.840883] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 [ 36.841892] task: ffff88040a0b1c80 task.stack: ffffc900027bc000 [ 36.842887] RIP: 0010:vfs_statfs+0x73/0xb0 [ 36.843728] RSP: 0018:ffffc900027bfb30 EFLAGS: 00010202 [ 36.844687] RAX: 0000000000000000 RBX: ffffc900027bfbf8 RCX: 000000000000180d [ 36.845891] RDX: 000000000000080d RSI: 0000000000000020 RDI: e2006d6574737973 [ 36.847075] RBP: ffffc900027bfbc8 R08: 0000000000000000 R09: 00000000000000ff [ 36.848175] R10: 000000000038be3a R11: ffff88040b687578 R12: 0000000000000000 [ 36.849260] R13: ffff88040d7dc400 R14: ffff88040d38b000 R15: ffffc900027bfbf8 [ 36.850347] FS: 0000000000000000(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000 [ 36.851891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.852873] CR2: 00007f049228edc0 CR3: 0000000001e0a004 CR4: 00000000001606e0 [ 36.853942] Call Trace: [ 36.854667] nfsd4_encode_fattr+0x34e/0x23b0 [ 36.855578] ? ext4_get_acl+0x1b2/0x260 [ext4] [ 36.856485] ? get_acl+0x7a/0xf0 [ 36.857266] ? generic_permission+0x125/0x1a0 [ 36.858150] nfsd4_encode_getattr+0x25/0x30 [ 36.859002] nfsd4_encode_operation+0x98/0x1a0 [ 36.859889] nfsd4_proc_compound+0x3eb/0x5c0 [ 36.860736] nfsd_dispatch+0xa8/0x230 [ 36.861538] svc_process_common+0x347/0x640 [ 36.862383] svc_process+0x100/0x1b0 [ 36.863204] nfsd+0xe0/0x150 [ 36.863984] kthread+0xfc/0x130 [ 36.864781] ? nfsd_destroy+0x50/0x50 [ 36.865624] ? kthread_create_on_node+0x40/0x40 [ 36.866529] ? do_group_exit+0x3a/0xb0 [ 36.867362] ret_from_fork+0x25/0x30 [ 36.868188] Code: d1 83 c9 08 40 f6 c6 04 0f 45 d1 89 d1 80 cd 04 40 f6 c6 08 0f 45 d1 89 d1 80 cd 08 40 f6 c6 10 0f 45 d1 89 d1 80 cd 10 83 e6 20 <48> 8b b7 b0 05 00 00 0f 45 d1 83 ca 20 89 f1 83 e1 10 89 cf 83 [ 36.871101] RIP: vfs_statfs+0x73/0xb0 RSP: ffffc900027bfb30 [ 36.872059] ---[ end trace 603ac898c4e2d616 ]---
I haven't been able to reproduce it with CONFIG_GCC_PLUGIN_RANDSTRUCT disabled, so it seems like it must be a bug there. It's odd that it just surfaced recently though, we have been using that since it was added.
The reproducer boxen are not under particularly heavy load, they are serving NFS to 1 or 2 clients (which are essentially embedded devices). When the bug triggers, it usually triggers pretty fast and reliably, but it seems to only trigger on some subset of bootups. Once it fails to trigger, we seem to have to reboot to get it to trigger.
I should be able to have some results with that added in a few hours. It's weirdly unreliable to reproduce this.
We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Anyway, that cmovne noise makes it a bit hard to see the actual part that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
Now, afaik, mnt->mnt_sb should never be NULL in the first place for a proper path. And the vfs_statfs() code itself hasn't changed in a while.
Which does seem to implicate nfsd as having passed in a bad path to vfs_statfs(). But I'm not seeing any changes in nfsd either.
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
But bringing in Al Viro and Bruce Fields explicitly in case they see something. And Darrick, just in case it might be xfs.
Looking at https://lkml.org/lkml/2017/11/8/1086 for the actual oops...
It doesn't remind me of any known issue.
And I don't see how we can call vfs_statfs() with a bad path: nfsd4_encode_getattr would have to have been called with nfserr 0 and ga_fhp->fh_export bad.
Looking at nfsd4_proc_compound, I can't see how we could get there in the op->status == 0 case without the fh_verify() in nfsd4_getattr having succeeded and assigned the result to ga_fhp.
So either I'm overlooking something or the bug's elsewhere.
It sounds like you're varying *only* the server version, so there's not much chance that this could be triggered by changes in client behavior?
--b.
On 2017-11-09 12:47 PM, J. Bruce Fields wrote:
On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
Anyway, that cmovne noise makes it a bit hard to see the actual part that matters (and that traps) but I'm almost certain that it's the "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags() when it then does
flags_by_sb(mnt->mnt_sb->s_flags);
and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is NULL, because we wouldn't have gotten this far if it was.
Now, afaik, mnt->mnt_sb should never be NULL in the first place for a proper path. And the vfs_statfs() code itself hasn't changed in a while.
Which does seem to implicate nfsd as having passed in a bad path to vfs_statfs(). But I'm not seeing any changes in nfsd either.
In particular, there are *no* nfsd changes in that 4.13.8..4.13.11 range. There is a bunch of xfs changes, though. What's the underlying filesystem that you are exporting?
But bringing in Al Viro and Bruce Fields explicitly in case they see something. And Darrick, just in case it might be xfs.
Looking at https://lkml.org/lkml/2017/11/8/1086 for the actual oops...
It doesn't remind me of any known issue.
So either I'm overlooking something or the bug's elsewhere.
It sounds like you're varying *only* the server version, so there's not much chance that this could be triggered by changes in client behavior?
We are definitely only varying the kernel on the server, nothing on the client side is changing. The client in this case is essentially an embedded device that we do not have a whole lot of control over.
On Thu, Nov 09 2017, Linus Torvalds torvalds@linux-foundation.org wrote:
The code disassembles to
0: 83 c9 08 or $0x8,%ecx 3: 40 f6 c6 04 test $0x4,%sil 7: 0f 45 d1 cmovne %ecx,%edx a: 89 d1 mov %edx,%ecx c: 80 cd 04 or $0x4,%ch f: 40 f6 c6 08 test $0x8,%sil 13: 0f 45 d1 cmovne %ecx,%edx 16: 89 d1 mov %edx,%ecx 18: 80 cd 08 or $0x8,%ch 1b: 40 f6 c6 10 test $0x10,%sil 1f: 0f 45 d1 cmovne %ecx,%edx 22: 89 d1 mov %edx,%ecx 24: 80 cd 10 or $0x10,%ch 27: 83 e6 20 and $0x20,%esi 2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction 31: 0f 45 d1 cmovne %ecx,%edx 34: 83 ca 20 or $0x20,%edx 37: 89 f1 mov %esi,%ecx 39: 83 e1 10 and $0x10,%ecx 3c: 89 cf mov %ecx,%edi
and all those odd cmovne and bit-ops are just the bit selection code in flags_by_mnt(), which is inlined through calculate_f_flags (which is _also_ inlined) into vfs_statfs().
Sadly, gcc makes a mess of it and actually generates code that looks like the original C. I would have hoped that gcc could have turned
if (x & BIT) y |= OTHER_BIT;
into
y |= (x & BIT) shifted-by-the-bit-difference-between BIT/OTHER_BIT;
but that doesn't happen.
Actually, new enough gcc (7.1, I think) does contain a pattern that does this, but unfortunately only if one spells it
y |= (x & BIT) ? OTHER_BIT : 0;
which is half-way to doing it by hand, I suppose. Doing the
- if (mnt_flags & MNT_READONLY) - flags |= ST_RDONLY; + flags |= (mnt_flags & MNT_READONLY) ? ST_RDONLY : 0;
and pasting into godbolt.org, one can apparently get gcc to compile it to
flags_by_mnt(int): leal (%rdi,%rdi), %edx movl %edi, %eax sarl $6, %eax movl %edx, %ecx andl $1, %eax andl $12, %edx andl $2, %ecx orl %ecx, %eax orl %eax, %edx movl %edi, %eax sall $7, %eax andl $7168, %eax orl %edx, %eax ret
Rasmus
On Mon, Nov 13, 2017 at 2:59 PM, Rasmus Villemoes linux@rasmusvillemoes.dk wrote:
Sadly, gcc makes a mess of it and actually generates code that looks like the original C.[...]
Actually, new enough gcc (7.1, I think) does contain a pattern that does this, but unfortunately only if one spells it
y |= (x & BIT) ? OTHER_BIT : 0;
Ahh, I should have recognized that, I think that's what we ended up doing with the VM_READ -> PROT_READ translation in a few places, exactly because gcc would then recognize it and do the much better code generation.
which is half-way to doing it by hand, I suppose.
Yeah, but it is at least acceptable, and the code is still legible C. The alternatives of doing it _entirely_ by hand tend to be much worse (ie you end up using a macro from hell that checks which of the two bits are bigger and shifting in the right direction by using multiplication or division).
So let's just rewrite that mnt_flags conversion that way, justr to get gcc to generate the obvious code.
It's a bit sad how gcc didn't pick up on the original code, especially since it had already done the much more complicated translation of doing the if-conversion.
Thanks for pointing out the gcc pattern.
Linus
On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
So let's just rewrite that mnt_flags conversion that way, justr to get gcc to generate the obvious code.
Oh wow. I tried to do the same thing in fs/namespace.c where it does the reverse bit translation, and gcc makes a _horrible_ mess of it and actually makes the code much worse because for some reason the pattern doesn't trigger.
So this gcc optimization is apparently pretty damn fragile in general. It triggers for the trivial cases, but then other code around it can confuse it badly.
So I don't think I'll touch this, it seems to not be really reliably something that makes gcc generate what should be the obvious code..
Linus
On Wed, 8 Nov 2017 16:43:17 -0800 Patrick McLean chutzpah@gentoo.org wrote:
As of 4.13.11 (and also with 4.14-rc) we have an issue where when serving nfs4 sometimes we get the following BUG. When this bug happens, it usually also causes the motherboard to no longer POST until we externally re-flash the BIOS (using the BMC web interface). If a motherboard does not have an external way to flash the BIOS, this would brick the hardware.
If that is a production x86 system then you need to raise a large red flag with the vendor because it should not even be possible to splat the BIOS firmware on a modern PC by running even malicious OS code.
Not only that but if it has a flaw, and you bisect down to create a reproducer then it's not going to take the bad guys very long to turn it into an interesting toy to run if they ever exploit a box with that board.
Alan
linux-stable-mirror@lists.linaro.org