On Thu, Oct 10, 2024 at 01:47:05PM +0100, Marc Zyngier wrote:
On Thu, 10 Oct 2024 09:47:04 +0100, Oliver Upton oliver.upton@linux.dev wrote:
A small but stupid window to race with.
Ah, gotcha. I guess getting rid of the early-out in kvm_vgic_map_resources() would plug that one. Want to post a fix for that?
Yep, will do.
If memory serves, kvm_vgic_map_resources() used to do all of this behind the config_lock to cure the race, but that wound up inverting lock ordering on srcu.
Probably something like that. We also used to hold the kvm lock, which made everything much simpler, but awfully wrong.
Note to self: Impose strict ordering on GIC initialization v. vCPU creation if/when we get a new flavor of irqchip.
One of the things we should have done when introducing GICv3 is to impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is final. I remember some push-back on the QEMU side of things, as they like to decouple things, but this has proved to be a nightmare.
Pushing more of the initialization complexity into userspace feels like the right thing. Since we clearly have no idea what we're doing :)
KVM APIv2?
Even better, we can just go straight to v3 and skip all the mistakes we would've made in v2.
The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its callees are allowed to destroy VM-scoped structures in error handling.
I think this is symptomatic of more general issue: we perform VM-wide configuration in the context of a vcpu. We have tons of this stuff to paper over the lack of a "this VM is fully configured" barrier.
I wonder whether we could sidestep things by punting the finalisation of the VM to a different context (workqueue?) and simply return -EAGAIN or -EINTR to userspace while we're processing it. That doesn't solve the "I'm missing parts of the address map and I'm going to die" part though.
Throwing it back at userspace would be nice, but unfortunately for ABI I think we need to block/spin vCPUs in the kernel til the VM is in fully working condition. A fragile userspace could explode for a 'spurious' EAGAIN/EINTR where there wasn't one before.
EINTR needs to be handled already, as this is how you report preemption by a signal.
Of course, I'm just assuming userspace is mean and will complain if no signal actually arrives.