Hi all,
Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on second timeout") causes a regression on several systems. Symptoms are: system reboots automatically after a short period of time if watchdog is enabled (by systemd for example). This has been reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
Unfortunately this commit was backported to all stable kernel branches (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the case, BTW, as there is no Fixes tag and no Cc to stable@vger either. And the fix is not trivial, has apparently not seen enough testing, and addresses a problem that has a known and simple workaround. IMHO it should never have been accepted as a stable patch in the first place. Especially when the previous attempt to fix this issue already ended with a regression and a revert.
Anyway... After a glance at the patch, I see what looks like a nice thinko:
+ if (p->smi_res && + (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
The author most certainly meant inl(SMI_EN(p)) (the register's value) and not SMI_EN(p) (the register's address).
On 03.08.21 16:51, Jean Delvare wrote:
Hi all,
Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on second timeout") causes a regression on several systems. Symptoms are: system reboots automatically after a short period of time if watchdog is enabled (by systemd for example). This has been reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
Unfortunately this commit was backported to all stable kernel branches (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the case, BTW, as there is no Fixes tag and no Cc to stable@vger either. And the fix is not trivial, has apparently not seen enough testing, and addresses a problem that has a known and simple workaround. IMHO it should never have been accepted as a stable patch in the first place. Especially when the previous attempt to fix this issue already ended with a regression and a revert.
Anyway... After a glance at the patch, I see what looks like a nice thinko:
- if (p->smi_res &&
(SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
The author most certainly meant inl(SMI_EN(p)) (the register's value) and not SMI_EN(p) (the register's address).
https://lkml.org/lkml/2021/7/26/349
Jan
On 03.08.21 16:59, Jan Kiszka wrote:
On 03.08.21 16:51, Jean Delvare wrote:
Hi all,
Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on second timeout") causes a regression on several systems. Symptoms are: system reboots automatically after a short period of time if watchdog is enabled (by systemd for example). This has been reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
Unfortunately this commit was backported to all stable kernel branches (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the case, BTW, as there is no Fixes tag and no Cc to stable@vger either. And the fix is not trivial, has apparently not seen enough testing, and addresses a problem that has a known and simple workaround. IMHO it should never have been accepted as a stable patch in the first place. Especially when the previous attempt to fix this issue already ended with a regression and a revert.
Anyway... After a glance at the patch, I see what looks like a nice thinko:
- if (p->smi_res &&
(SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
The author most certainly meant inl(SMI_EN(p)) (the register's value) and not SMI_EN(p) (the register's address).
That's for the fix (in line with your analysis).
I was also wondering if backporting that quickly was needed. Didn't propose it, though.
Jan
On 8/3/21 8:01 AM, Jan Kiszka wrote:
On 03.08.21 16:59, Jan Kiszka wrote:
On 03.08.21 16:51, Jean Delvare wrote:
Hi all,
Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on second timeout") causes a regression on several systems. Symptoms are: system reboots automatically after a short period of time if watchdog is enabled (by systemd for example). This has been reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
Unfortunately this commit was backported to all stable kernel branches (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the case, BTW, as there is no Fixes tag and no Cc to stable@vger either. And the fix is not trivial, has apparently not seen enough testing, and addresses a problem that has a known and simple workaround. IMHO it should never have been accepted as a stable patch in the first place. Especially when the previous attempt to fix this issue already ended with a regression and a revert.
Anyway... After a glance at the patch, I see what looks like a nice thinko:
- if (p->smi_res &&
(SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
The author most certainly meant inl(SMI_EN(p)) (the register's value) and not SMI_EN(p) (the register's address).
Yes, shame on me that I didn't see that.
That's for the fix (in line with your analysis).
I was also wondering if backporting that quickly was needed. Didn't propose it, though.
I'd suggest to discuss that with Greg and Sasha. Backporting is pretty aggressive nowadays.
Guenter
On Tue, 3 Aug 2021 16:59:02 +0200, Jan Kiszka wrote:
For the record, I tested this fix successfully on my system (as in: no more random reboots).
Tested-by: Jean Delvare jdelvare@suse.de
Thanks,
On 8/3/21 10:59 AM, Jan Kiszka wrote:
It fixes the problem for me (the person who opened the Bugzilla report) too, thanks!
Tested-by: Michael Marley michael@michaelmarley.com
On 04.08.21 02:04, Michael Marley wrote:
On 8/3/21 10:59 AM, Jan Kiszka wrote:
It fixes the problem for me (the person who opened the Bugzilla report) too, thanks!
Tested-by: Michael Marley michael@michaelmarley.com
Thanks for the confirmation!
Yeah, sorry, that original mistake is truly mine. In fact, I wrote this code twice [1] but only messed it up here. Unfortunate that it spread so quickly. I'll try discuss this with stable people eventually, if it was a one off or if there are more such cases.
Jan
[1] https://github.com/siemens/efibootguard/commit/aa89fe3cbd883198c23eaec43c444...
On Tue, Aug 03, 2021 at 04:51:08PM +0200, Jean Delvare wrote:
Hi all,
Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on second timeout") causes a regression on several systems. Symptoms are: system reboots automatically after a short period of time if watchdog is enabled (by systemd for example). This has been reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=213809
Unfortunately this commit was backported to all stable kernel branches (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the case, BTW, as there is no Fixes tag and no Cc to stable@vger either. And the fix is not trivial, has apparently not seen enough testing, and addresses a problem that has a known and simple workaround. IMHO it should never have been accepted as a stable patch in the first place. Especially when the previous attempt to fix this issue already ended with a regression and a revert.
Anyway... After a glance at the patch, I see what looks like a nice thinko:
- if (p->smi_res &&
(SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
The author most certainly meant inl(SMI_EN(p)) (the register's value) and not SMI_EN(p) (the register's address).
Let me go revert this from the stable trees now, thanks for the report.
greg k-h
linux-stable-mirror@lists.linaro.org