Hi Juergen,
On 3/4/19 4:14 PM, Juergen Gross wrote:
On 01/03/2019 03:35, Dongli Zhang wrote:
This issue is only for stable 4.9.x (e.g., 4.9.160), while the root cause is still in the lasted mainline kernel.
This is obviated by new feature patch set ended with b672592f0221 ("sched/cputime: Remove generic asm headers").
After xen guest is up for long time, once we hotplug new vcpu, the corresponding steal usage might become 100% and the steal time from /proc/stat would increase abnormally.
As we cannot wait for long time to reproduce the issue, here is how I reproduce it on purpose by accounting a large initial steal clock for new vcpu 2 and 3.
- Apply the below patch to guest 4.9.160 to account large initial steal clock
for new vcpu 2 and 3:
diff --git a/drivers/xen/time.c b/drivers/xen/time.c index ac5f23f..3cf629e 100644 --- a/drivers/xen/time.c +++ b/drivers/xen/time.c @@ -85,7 +85,14 @@ u64 xen_steal_clock(int cpu) struct vcpu_runstate_info state; xen_get_runstate_snapshot_cpu(&state, cpu);
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
if (cpu == 2 || cpu == 3)
return state.time[RUNSTATE_runnable]
+ state.time[RUNSTATE_offline]
+ 0x00071e87e677aa12;
else
return state.time[RUNSTATE_runnable]
+ state.time[RUNSTATE_offline];
} void xen_setup_runstate_info(int cpu)
- Boot hvm guest with "vcpus=2" and "maxvcpus=4". By default, VM boot with
vcpu 0 and 1.
- Hotplug vcpu 2 and 3 via "xl vcpu-set <domid> 4" on dom0.
In my env, the steal becomes 100% within 10s after the "xl vcpu-set" command on dom0.
I can reproduce on kvm with similar method. However, as the initial steal clock on kvm guest is always 0, I do not think it is easy to hit this issue on kvm.
The root cause is that the return type of jiffies_to_usecs() is 'unsigned int', but not 'unsigned long'. As a result, the leading 32 bits are discarded.
jiffies_to_usecs() is indirectly triggered by cputime_to_nsecs() at line 264. If guest is already up for long time, the initial steal time for new vcpu might be large and the leading 32 bits of jiffies_to_usecs() would be discarded.
As a result, the steal at line 259 is always large and the this_rq()->prev_steal_time at line 264 is always small. The difference at line 260 is always large during each time steal_account_process_time() is involved. Finally, the steal time in /proc/stat would increase abnormally.
252 static __always_inline cputime_t steal_account_process_time(cputime_t maxtime) 253 { 254 #ifdef CONFIG_PARAVIRT 255 if (static_key_false(¶virt_steal_enabled)) { 256 cputime_t steal_cputime; 257 u64 steal; 258 259 steal = paravirt_steal_clock(smp_processor_id()); 260 steal -= this_rq()->prev_steal_time; 261 262 steal_cputime = min(nsecs_to_cputime(steal), maxtime); 263 account_steal_time(steal_cputime); 264 this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime); 265 266 return steal_cputime; 267 } 268 #endif 269 return 0; 270 }
I have emailed the kernel mailing list about the return type of jiffies_to_usecs() and jiffies_to_msecs():
https://lkml.org/lkml/2019/2/26/899
So far, I have two solutions:
- Change the return type from 'unsigned int' to 'unsigned long' as in above
link and I am afraid it would bring side effect. The return type in latest mainline kernel is still 'unsigned int'.
- Something like below based on stable 4.9.160:
- use jiffies64_to_nsecs() instead of trying to open code it.
Thank you very much for the suggestion!
I have tested that jiffies64_to_nsecs() works well by reproducing the issue in kvm guest on purpose.
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 734377a..94aff43 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -287,13 +287,13 @@ extern unsigned long preset_lpj; extern unsigned int jiffies_to_msecs(const unsigned long j); extern unsigned int jiffies_to_usecs(const unsigned long j);
+extern u64 jiffies64_to_nsecs(u64 j); + static inline u64 jiffies_to_nsecs(const unsigned long j) { - return (u64)jiffies_to_usecs(j) * NSEC_PER_USEC; + return jiffies64_to_nsecs(j); }
-extern u64 jiffies64_to_nsecs(u64 j);
Below is the patch used to reproduce on kvm. cpu 2 is added to reproduce on purpose via "device_add qemu64-x86_64-cpu,id=core2,socket-id=2,core-id=0,thread-id=0" in qemu monitor. Guest is boot with qemu cmd "-smp 2,maxcpus=3"
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 77f17cb..54d46e0 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -407,7 +407,10 @@ static u64 kvm_steal_clock(int cpu) rmb(); } while ((version & 1) || (version != src->version));
- return steal; + if (cpu == 0 || cpu == 1) + return steal; + else + return steal + 0x00071e87e677aa12; }
Dongli Zhang