On Fri, Sep 17, 2021 at 11:11:49AM +0200, Peter Zijlstra wrote:
On Thu, Sep 16, 2021 at 10:07:07AM -0500, Bjorn Helgaas wrote:
This seems to be an ongoing issue, not just a point defect in a single product, and I really hate the onesy-twosy nature of this. Is there really no way to detect this issue automatically or fix whatever Linux bug makes us trip over this? I am no clock expert, so I have absolutely no idea whether this is possible.
X86 is gifted with the grant total of _0_ reliable clocks. Given no accurate time, it is impossible to tell which one of them is broken worst. Although I suppose we could attempt to synchronize against the PMU or MPERF..
We could possibly disable the tsc watchdog for X86_FEATURE_TSC_KNOWN_FREQ && X86_FEATURE_TSC_ADJUST I suppose.
And then have people with 'creative' BIOS get to keep the pieces.
Alternatively, we can change what the TSC watchdog does for X86_FEATURE_TSC_ADJUST machines. Instead of checking time against HPET it can check if TSC_ADJUST changes. That should make it more resillient vs HPET time itself being off.