On Fri, Feb 02, 2024, Sean Christopherson wrote:
On Fri, Feb 02, 2024, Shaoqin Huang wrote:
v2->v3:
- Rebase to v6.8-rc2.
- Use TEST_ASSERT().
Patch says otherwise.
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg) return; }
- sem_getvalue(&sem_vcpu_stop, &sem_val);
- assert(sem_val == 0);
- sem_getvalue(&sem_vcpu_cont, &sem_val);
- assert(sem_val == 0);
- /*
- We reserve page table for 2 times of extra dirty mem which
- will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg) sync_global_to_guest(vm, iteration); }
- /*
*
* Before we set the host_quit, let the vcpu has time to run, to make
* sure we consume the sem_vcpu_stop and the vcpu consume the
* sem_vcpu_cont, to keep the semaphore balance.
*/
- usleep(p->interval * 1000);
Sorry, I wasn't as explicit as I should have been. When I said I don't have a better solution, I did not mean to imply that I am ok with busy waiting as a hack-a-around.
Against my better judgment, I spent half an hour slogging through this test to figure out what's going wrong. IIUC, the problem is essentially that the test instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets host_quit, which results in the vCPU running one extra (unvalidated) iteration.
For the other modes, which stop if and only if vcpu_sync_stop_requested is set, the extra iteration is a non-issue. But because the dirty ring variant stops after every exit (to purge the ring), it hangs without an extra "continue".
So rather than blindly fire off an extra sem_vcpu_cont that may or may not be consumed, I believe the test can simply set host_quit _before_ the final "continue" so that the vCPU worker doesn't run an extra iteration.
I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so no issues. I didn't see the assert you hit, but without the fix, I did see this fire within a few loops (less than 5 I think)l
assert(host_log_mode == LOG_MODE_DIRTY_RING || atomic_read(&vcpu_sync_stop_requested) == false);
I'll post this as two patches: one to fix the bug, and a second to have the LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above assert() can be modified as below.
Scratch patch 2, I'm pretty sure the vCPU worker can race with the main task and clear vcpu_sync_stop_requested just before the main task sets it, which would result in a false positive. I didn't see any failures, but I'm 99% certain the race exists. I suspect there are some other warts in the test that would complicate attempts to clean things up, so for now I'l just post the fix for the imbalance bug.