On 1/11/19 4:57 PM, Hans van Kranenburg wrote:
On 1/11/19 3:01 PM, Juergen Gross wrote:
On 11/01/2019 14:12, Hans van Kranenburg wrote:
Hi,
On 1/11/19 1:08 PM, Juergen Gross wrote:
Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface") broke Xen guest time handling across migration:
[ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 187.251137] OOM killer disabled. [ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 187.252299] suspending xenstore... [ 187.266987] xen:grant_table: Grant tables using version 1 layout [18446743811.706476] OOM killer enabled. [18446743811.706478] Restarting tasks ... done. [18446743811.720505] Setting capacity to 16777216
Fix that by setting xen_sched_clock_offset at resume time to ensure a monotonic clock value.
[...]
I'm throwing around a PV domU over a bunch of test servers with live migrate now, and in between the kernel logging, I'm seeing this:
[Fri Jan 11 13:58:42 2019] Freezing user space processes ... (elapsed 0.002 seconds) done. [Fri Jan 11 13:58:42 2019] OOM killer disabled. [Fri Jan 11 13:58:42 2019] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [Fri Jan 11 13:58:42 2019] suspending xenstore... [Fri Jan 11 13:58:42 2019] ------------[ cut here ]------------ [Fri Jan 11 13:58:42 2019] Current state: 1 [Fri Jan 11 13:58:42 2019] WARNING: CPU: 3 PID: 0 at kernel/time/clockevents.c:133 clockevents_switch_state+0x48/0xe0 [...]
This always happens on every *first* live migrate that I do after starting the domU.
Yeah, its a WARN_ONCE().
Ok, false alarm. It's there, but not caused by this patch.
I changed the WARN_ONCE to WARN for funs, and now I get it a lot more already (v2):
https://paste.debian.net/plainh/d535a379
And you didn't see it with v1 of the patch?
No.
I was wrong. I tried a bit more, and I can also reproduce without v1 or v2 patch at all, and I can reproduce it with v4.19.9. Just sometimes needs a dozen times more live migrating it before it shows up.
I cannot make it happen to show up with the Debian 4.19.9 distro kernel, that's why it was new for me.
So, let's ignore it in this thread now.
On the first glance this might be another bug just being exposed by my patch.
I'm investigating further, but this might take some time. Could you meanwhile verify the same happens with kernel 5.0-rc1? That was the one I tested with and I didn't spot that WARN.
I have Linux 5.0-rc1 with v2 on top now, which gives me this on live migrate:
[...] [ 51.871076] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 51.871091] #PF error: [normal kernel read fault] [ 51.871100] PGD 0 P4D 0 [ 51.871109] Oops: 0000 [#1] SMP NOPTI [ 51.871117] CPU: 0 PID: 36 Comm: xenwatch Not tainted 5.0.0-rc1 #1 [ 51.871132] RIP: e030:blk_mq_map_swqueue+0x103/0x270 [...]
Dunno about all the 5.0-rc1 crashes yet. I can provide more feedback about that if you want, but not in here I presume.
Hans