On Thu, 10 Oct 2024 09:47:04 +0100, Oliver Upton oliver.upton@linux.dev wrote:
On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton oliver.upton@linux.dev wrote:
Then if we can't register the MMIO region for the distributor everything comes crashing down and a vCPU has made it into the KVM_RUN loop w/ the VGIC-shaped rug pulled out from under it. There's definitely another functional bug here where a vCPU's attempts to poke the distributor wind up reaching userspace as MMIO exits. But we can worry about that another day.
I don't think that one is that bad. Userspace got us here, and they now see an MMIO exit for something that it is not prepared to handle. Suck it up and die (on a black size M t-shirt, please).
LOL, I'll remember that.
The situation I have in mind is a bit harder to blame on userspace, though. Supposing that the whole VM was set up correctly, multiple vCPUs entering KVM_RUN concurrently could cause this race and have 'unexpected' MMIO exits go out to userspace.
vcpu-0 vcpu-1 ====== ====== kvm_vgic_map_resources() dist->ready = true mutex_unlock(config_lock) kvm_vgic_map_resources() if (vgic_ready()) return 0
< enter guest > typer = writel(0, GICD_CTLR) < data abort > kvm_io_bus_write(...) <= No GICD, out to userspace vgic_register_dist_iodev()
A small but stupid window to race with.
Ah, gotcha. I guess getting rid of the early-out in kvm_vgic_map_resources() would plug that one. Want to post a fix for that?
If memory serves, kvm_vgic_map_resources() used to do all of this behind the config_lock to cure the race, but that wound up inverting lock ordering on srcu.
Probably something like that. We also used to hold the kvm lock, which made everything much simpler, but awfully wrong.
Note to self: Impose strict ordering on GIC initialization v. vCPU creation if/when we get a new flavor of irqchip.
One of the things we should have done when introducing GICv3 is to impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is final. I remember some push-back on the QEMU side of things, as they like to decouple things, but this has proved to be a nightmare.
Pushing more of the initialization complexity into userspace feels like the right thing. Since we clearly have no idea what we're doing :)
KVM APIv2?
The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its callees are allowed to destroy VM-scoped structures in error handling.
I think this is symptomatic of more general issue: we perform VM-wide configuration in the context of a vcpu. We have tons of this stuff to paper over the lack of a "this VM is fully configured" barrier.
I wonder whether we could sidestep things by punting the finalisation of the VM to a different context (workqueue?) and simply return -EAGAIN or -EINTR to userspace while we're processing it. That doesn't solve the "I'm missing parts of the address map and I'm going to die" part though.
Throwing it back at userspace would be nice, but unfortunately for ABI I think we need to block/spin vCPUs in the kernel til the VM is in fully working condition. A fragile userspace could explode for a 'spurious' EAGAIN/EINTR where there wasn't one before.
EINTR needs to be handled already, as this is how you report preemption by a signal. But yeah, overall, I'm not enthralled with much so far...
M.