I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on downloaded files. The failures are undeterministic and similar to the failures you get with bad ram. I tried to diagnose the problem with various testing tools and found that "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors usually within 60 sec:
stress-ng-cpu: Newton-Rapshon sqrt not accurate enough stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
Nothing relevant has changed recently in the VM but the host kernel was upgraded from 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There is only one kvm related change in that range so I tried to revert that one.
By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce the stress-ng error and I have no segfault or other problems with the guest.
The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks like a broken 4.14 backport. That backport also cause problems for other people. https://bugzilla.kernel.org/show_bug.cgi?id=202419
I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works for hours without it.
Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng version 0.07.16 and run "stress-ng --verify --cpu 1".
Here is the qemu-3.1.0 commandline generated by libvirt: /usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes -machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive if=none,id=drive-ide0-0-1,readonly=on -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=23,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on downloaded files. The failures are undeterministic and similar to the failures you get with bad ram. I tried to diagnose the problem with various testing tools and found that "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors usually within 60 sec:
stress-ng-cpu: Newton-Rapshon sqrt not accurate enough stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
Nothing relevant has changed recently in the VM but the host kernel was upgraded from 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There is only one kvm related change in that range so I tried to revert that one.
By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce the stress-ng error and I have no segfault or other problems with the guest.
This is the second report of this issue:
https://bugzilla.kernel.org/show_bug.cgi?id=202419
Upon inspection, the commit in question is obviously buggy, kvm_arch_vcpu_ioctl_run() doubles up on kvm_{load,put}_guest_fpu().
The ordering of mainline commits:
f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
and
5663d8f9bbe4 ("kvm: x86: fix WARN due to uninitialized guest FPU state")
were reversed when backported to 4.14. Commit 5663d8f9bbe4 even explicitly notes that it fixes f775b13eedee. I'll send a patch.
The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks like a broken 4.14 backport. That backport also cause problems for other people. https://bugzilla.kernel.org/show_bug.cgi?id=202419
I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works for hours without it.
Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng version 0.07.16 and run "stress-ng --verify --cpu 1".
Here is the qemu-3.1.0 commandline generated by libvirt: /usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes -machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive if=none,id=drive-ide0-0-1,readonly=on -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=23,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on downloaded files. The failures are undeterministic and similar to the failures you get with bad ram. I tried to diagnose the problem with various testing tools and found that "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors usually within 60 sec:
stress-ng-cpu: Newton-Rapshon sqrt not accurate enough stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
Nothing relevant has changed recently in the VM but the host kernel was upgraded from 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There is only one kvm related change in that range so I tried to revert that one.
By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce the stress-ng error and I have no segfault or other problems with the guest.
The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks like a broken 4.14 backport. That backport also cause problems for other people. https://bugzilla.kernel.org/show_bug.cgi?id=202419
I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works for hours without it.
Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng version 0.07.16 and run "stress-ng --verify --cpu 1".
Here is the qemu-3.1.0 commandline generated by libvirt: /usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes -machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive if=none,id=drive-ide0-0-1,readonly=on -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=23,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
Interesting, thank you for the report.
Could you confirm whether this issue reproduces on a newer kernel that has that patch (4.19.18 for example)?
-- Thanks, Sasha
On Mon, Jan 28, 2019 at 03:14:53PM -0500, Sasha Levin wrote:
On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on downloaded files. The failures are undeterministic and similar to the failures you get with bad ram. I tried to diagnose the problem with various testing tools and found that "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors usually within 60 sec:
stress-ng-cpu: Newton-Rapshon sqrt not accurate enough stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
Nothing relevant has changed recently in the VM but the host kernel was upgraded from 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There is only one kvm related change in that range so I tried to revert that one.
By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce the stress-ng error and I have no segfault or other problems with the guest.
[...]
Interesting, thank you for the report.
Could you confirm whether this issue reproduces on a newer kernel that has that patch (4.19.18 for example)?
The bug is specific to 4.14, two dependent commits were applied in the wrong order and introduced the bug. I have a patch, in the process of typing up the changelog.
linux-stable-mirror@lists.linaro.org