On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton oliver.upton@linux.dev wrote:
On Wed, Oct 09, 2024 at 12:36:32PM -0700, Sean Christopherson wrote:
On Wed, Oct 09, 2024, Oliver Upton wrote:
On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
As there is very little ordering in the KVM API, userspace can instanciate a half-baked GIC (missing its memory map, for example) at almost any time.
This means that, with the right timing, a thread running vcpu-0 can enter the kernel without a GIC configured and get a GIC created behind its back by another thread. Amusingly, it will pick up that GIC and start messing with the data structures without the GIC having been fully initialised.
Huh, I'm definitely missing something. Could you remind me where we open up this race between KVM_RUN && kvm_vgic_create()?
Sorry, I sent the patch bombs away and decided to get my life back for the evening... Doesn't help that the commit message isn't very clear (if not wrong in some respects),.
Ah, duh, I see it now. kvm_arch_vcpu_run_pid_change() doesn't serialize on a VM lock, and kvm_vgic_map_resources() has an early return for vgic_ready() letting it blow straight past the config_lock.
That. The problem is not so much with the vgic creation (which doesn't do much) but with the vgic_init() part followed by the map_resources horror.
Then if we can't register the MMIO region for the distributor everything comes crashing down and a vCPU has made it into the KVM_RUN loop w/ the VGIC-shaped rug pulled out from under it. There's definitely another functional bug here where a vCPU's attempts to poke the distributor wind up reaching userspace as MMIO exits. But we can worry about that another day.
I don't think that one is that bad. Userspace got us here, and they now see an MMIO exit for something that it is not prepared to handle. Suck it up and die (on a black size M t-shirt, please).
If memory serves, kvm_vgic_map_resources() used to do all of this behind the config_lock to cure the race, but that wound up inverting lock ordering on srcu.
Probably something like that. We also used to hold the kvm lock, which made everything much simpler, but awfully wrong.
Note to self: Impose strict ordering on GIC initialization v. vCPU creation if/when we get a new flavor of irqchip.
One of the things we should have done when introducing GICv3 is to impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is final. I remember some push-back on the QEMU side of things, as they like to decouple things, but this has proved to be a nightmare.
I'd thought the fact that the latter takes all the vCPU mutexes and checks if any vCPU in the VM has run would be enough to guard against such a race, but clearly not...
Any chance that fixing bugs where vCPU0 can be accessed (and run!) before its fully online help?
That's an equally gross bug, but kvm_vgic_create() should still be safe w.r.t. vCPU creation since both hold the kvm->lock in the right spot. That is, since kvm_vgic_create() is called under the lock any vCPUs visible to userspace should exist in the vCPU xarray.
The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its callees are allowed to destroy VM-scoped structures in error handling.
I think this is symptomatic of more general issue: we perform VM-wide configuration in the context of a vcpu. We have tons of this stuff to paper over the lack of a "this VM is fully configured" barrier.
I wonder whether we could sidestep things by punting the finalisation of the VM to a different context (workqueue?) and simply return -EAGAIN or -EINTR to userspace while we're processing it. That doesn't solve the "I'm missing parts of the address map and I'm going to die" part though.
E.g. if that closes the vCPU0 hole, maybe the vCPU1 case can be handled a bit more gracefully?
I think this is about as graceful as we can be. The sorts of screw-ups that precipitate this error handling may involve stupidity across several KVM ioctls, meaning it is highly unlikely to be attributable / recoverable.
That's my take as well. We're faced with luserspace that's out to get us, and by the time we're in the context of a vcpu, it is too late.
I don't see how to fix this without mandating a UABI change.
Thanks,
M.