Regressions in stable releases

List overview All Threads
Download

newer

older

[for-linus][PATCH 1/3] tracepoint:...

stable-rc/queue/5.4 baseline: 112...

Guenter Roeck

5 Aug 2021 5 Aug '21

4:11 p.m.

Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

Thanks, Guenter

Show replies by date

Greg Kroah-Hartman

5 Aug 5 Aug

4:19 p.m.

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...

Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list? I have not noticed them being reported there, or on the stable list :(

...

Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going to be much more helpful here.

thanks,

greg k-h

Guenter Roeck

5:39 p.m.

On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:

...

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...
Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list? I have not noticed them being reported there, or on the stable list :(

I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the report on the stable list, or at least that was the idea. Should I send separate emails to regressions@ with details ?

...

...
Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going to be much more helpful here.

Yes, I know. Of course it didn't help that our internal testing didn't catch the problem until after the fact either.

Guenter

Greg Kroah-Hartman

5:43 p.m.

On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:

...

On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:

...
On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...
Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list? I have not noticed them being reported there, or on the stable list :(

I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the report on the stable list, or at least that was the idea. Should I send separate emails to regressions@ with details ?

For regressions in Linus's tree, yes please do. I have seen many stable regressions also sent there as they mirror regressions in Linus's tree (right now there is at least one ACPI regression that hopefully will show up in Linus's tree soon that has hit stable as well..)

...

...
...
Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going to be much more helpful here.

Yes, I know. Of course it didn't help that our internal testing didn't catch the problem until after the fact either.

There will always be issues that can only be caught on real hardware, we are all just human. The goal is to handle them quickly when they are caught.

I'll go revert the above commits now, thanks.

greg k-h

Guenter Roeck

7:44 p.m.

On Thu, Aug 05, 2021 at 07:43:20PM +0200, Greg Kroah-Hartman wrote:

...

On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:

...
On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:

...
On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...
Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list? I have not noticed them being reported there, or on the stable list :(

I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the report on the stable list, or at least that was the idea. Should I send separate emails to regressions@ with details ?

For regressions in Linus's tree, yes please do. I have seen many stable regressions also sent there as they mirror regressions in Linus's tree (right now there is at least one ACPI regression that hopefully will show up in Linus's tree soon that has hit stable as well..)

...
...
...
Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going to be much more helpful here.

Yes, I know. Of course it didn't help that our internal testing didn't catch the problem until after the fact either.

There will always be issues that can only be caught on real hardware, we are all just human. The goal is to handle them quickly when they are caught.

I'll go revert the above commits now, thanks.

If you do that, it needs to be done all the way to v4.19.y.

Guenter

Greg Kroah-Hartman

7:51 p.m.

On Thu, Aug 05, 2021 at 12:44:54PM -0700, Guenter Roeck wrote:

...

On Thu, Aug 05, 2021 at 07:43:20PM +0200, Greg Kroah-Hartman wrote:

...
On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:

...
On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:

...
On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...
Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list? I have not noticed them being reported there, or on the stable list :(

I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the report on the stable list, or at least that was the idea. Should I send separate emails to regressions@ with details ?

For regressions in Linus's tree, yes please do. I have seen many stable regressions also sent there as they mirror regressions in Linus's tree (right now there is at least one ACPI regression that hopefully will show up in Linus's tree soon that has hit stable as well..)

...
...
...
Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going to be much more helpful here.

Yes, I know. Of course it didn't help that our internal testing didn't catch the problem until after the fact either.

There will always be issues that can only be caught on real hardware, we are all just human. The goal is to handle them quickly when they are caught.

I'll go revert the above commits now, thanks.

If you do that, it needs to be done all the way to v4.19.y.

4.9.y for the first one, 4.4.y for the second.

thanks,

greg k-h

Willy Tarreau

4:42 p.m.

Hi Guenter,

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...

Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

The first one is really interesting. The author did all the job right by documenting what commit this patch fixed, this commit was indeed present in the stable branches, and given that the change is probably only understood by the driver's maintainer, it's very likely that he did that in good faith after some testing on real hardware. So there's little chance that any extra form of automated testing will catch this if it worked at least in one place.

It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

One solution might be to further increase the delay between the patch and its integration, but do we think it could increase the likelyhood that the bug is detected and reported in *some* environment ? And if so, is the overall situation any better with *some* users experiencing a possible rare regression compared to leaving 100% of the users exposed to a known bug in stable branches ? That's always difficult. As a user of the stable branches I personally prefer to risk a rare regression and report it than not getting fixes, because if the risk of regression in a patch is 1%, I'd rather get 99 useful fixes and 1 regression than no fix for a bug that bothers me.

So very likely the most robust solution is to further encourage users to report regressions as soon as they are met so that the faulty commits are spotted, reverted, and their author is aware that a corner case was identified. Greg is always very fast to respond to requests for reverts.

Also, for a developer, being aware of a deployment exhibiting an issue is extremely valuable, and the chances of spotting an issue and getting it fixed are much higher if the delay between integration and deployment is shorter. Otherwise it can sometimes take months to years before driver code lands into users' hands, especially with embedded systems where the rule remains "if it's not broken, don't touch it".

So in the end, the more often users upgrade, the better both for them and to spot issues. I know it doesn't please everyone, but while nobody likes bugs, someone has to face them at some point in order to report them :-/

In an ideal world we could imagine that postponing sensitive backports to older branches would improve their stability and reduce users' exposure. We're doing this to a certain extent in haproxy and it sort-of works. But the cost of keeping fixes in queue and postponing them is high and the risk of failing a backport is much higher this way, because either you prepare all the backports at once and you risk that the context changed between the initial backport and the merge, or you have to them at the last moment, without remembering any of the analysis that had to be done for the first branches.

Maybe in the end a sweet spot could be to just release older branches less often and with more patches each time, offering more chances to expose the faulty backports to more recent branches and affecting super-stable users even less ?

Just my two cents on this never-ending debate :-/ Willy

Guenter Roeck

5:29 p.m.

On Thu, Aug 05, 2021 at 06:42:54PM +0200, Willy Tarreau wrote:

...

Hi Guenter,

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:

...
Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode Breaks SPI access on all Mediatek devices for small transactions (including all Mediatek based Chromebooks since they use small SPI transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled Breaks Bluetooth on various devices (Mediatek and possibly others) Discussion: https://lkml.org/lkml/2021/7/28/569

Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there. Still, it shows that our test coverage is far from where it needs to be, and/or that we may be too aggressive with backporting patches to stable releases.

If you have an idea how to improve the situation, please let me know.

The first one is really interesting. The author did all the job right by documenting what commit this patch fixed, this commit was indeed present in the stable branches, and given that the change is probably only understood by the driver's maintainer, it's very likely that he did that in good faith after some testing on real hardware. So there's little chance that any extra form of automated testing will catch this if it worked at least in one place.

It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance. It claims to fix something that wasn't broken. FIFO rx mode was working just fine, handled in the receive interrupt as one would expect. The patch copies data from the rx fifo before the transfer is even started. I do not think it was tested on real hardware, or at least fifo receive transfer was not tested.

The patch _does_ fix a problem on the transmit side, but the patch subject doesn't mention that.

Guenter

Willy Tarreau

6:30 p.m.

On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:

...

...
It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance.

So that's the core of the problem. Stable maintainers cannot be tasked to try to analyse each patch in its finest details to figure whether a maintainer that's expected to be more knowledgeable than them on their driver did something wrong.

Then in my opinion we should encourage *not* to use "Fixes:" on untested patches (untested patches will always happen due to hardware availability or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

Willy

...

From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001

From: Willy Tarreau w@1wt.eu Date: Thu, 5 Aug 2021 20:24:30 +0200 Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if untested

'Fixes:' usually is taken as authority and often results in a backport. If a patch couldn't be tested although it looks perfectly valid, better not add this tag to leave a final chance for backporters to ask about the relevance of the backport or to check for future tests.

Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Willy Tarreau w@1wt.eu --- Documentation/process/submitting-patches.rst | 9 +++++++++ 1 file changed, 9 insertions(+)

diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 0852bcf73630..54782b0e2f4c 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -140,6 +140,15 @@ An example call:: $ git log -1 --pretty=fixes 54a4f0239f2e Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")

+Please note that a 'Fixes:' tag will most often result in your patch being +automatically backported to stable branches. If for any reason you could not +test that it really fixes the problem (for example, because the bug is not +reproducible, or because you did not have access to the required hardware +at the time of writing the patch to verify it does not cause regressions), +even if you are absolutely certain of your patch's validity, do not include +a 'Fixes:' tag and instead explain the situation in the commit message in +plain English. + .. _split_changes:

Separate your changes

-- 2.17.5

Guenter Roeck

6 Aug 6 Aug

2:16 p.m.

On 8/5/21 11:30 AM, Willy Tarreau wrote:

...

On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:

...
...
It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance.

So that's the core of the problem. Stable maintainers cannot be tasked to try to analyse each patch in its finest details to figure whether a maintainer that's expected to be more knowledgeable than them on their driver did something wrong.

Then in my opinion we should encourage *not* to use "Fixes:" on untested patches (untested patches will always happen due to hardware availability or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

Willy

...
From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001

From: Willy Tarreau w@1wt.eu Date: Thu, 5 Aug 2021 20:24:30 +0200 Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if untested

'Fixes:' usually is taken as authority and often results in a backport. If a patch couldn't be tested although it looks perfectly valid, better not add this tag to leave a final chance for backporters to ask about the relevance of the backport or to check for future tests.

Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Willy Tarreau w@1wt.eu

Documentation/process/submitting-patches.rst | 9 +++++++++ 1 file changed, 9 insertions(+)

diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 0852bcf73630..54782b0e2f4c 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -140,6 +140,15 @@ An example call:: $ git log -1 --pretty=fixes 54a4f0239f2e Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") +Please note that a 'Fixes:' tag will most often result in your patch being +automatically backported to stable branches. If for any reason you could not +test that it really fixes the problem (for example, because the bug is not +reproducible, or because you did not have access to the required hardware +at the time of writing the patch to verify it does not cause regressions), +even if you are absolutely certain of your patch's validity, do not include +a 'Fixes:' tag and instead explain the situation in the commit message in +plain English.

I don't think that would be a good idea, First, it would discourage people from using Fixes: tags, and second it would not really solve the problem either.

While I am sure that the patch in question wasn't really tested (after all, it broke EC communication on all Mediatek Chromebooks using SPI to communicate with the EC), we don't know what the author tested. It may well be (and is quite likely) that the author _did_ test the patch, only not the affected code path. That means the author may still have added a Fixes: tag in the wrong belief to have tested the fix, even with the above restrictions in place.

The same is true for the Bluetooth patch: This patch was for sure tested and works on many Bluetooth devices, except for those with a specific Mediatek controller and driver. Again, the Fixes: tag would be there (and it would be completely unreasonable to expect that infrastructure patches are tested on all affected hardware).

No, I think all we can do is to improve test coverage.

Thanks, Guenter

Willy Tarreau

2:42 p.m.

On Fri, Aug 06, 2021 at 07:16:32AM -0700, Guenter Roeck wrote:

...

On 8/5/21 11:30 AM, Willy Tarreau wrote:

...
On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:

...
...
It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance.

So that's the core of the problem. Stable maintainers cannot be tasked to try to analyse each patch in its finest details to figure whether a maintainer that's expected to be more knowledgeable than them on their driver did something wrong.

Then in my opinion we should encourage *not* to use "Fixes:" on untested patches (untested patches will always happen due to hardware availability or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

Willy

...
From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001

From: Willy Tarreau w@1wt.eu Date: Thu, 5 Aug 2021 20:24:30 +0200 Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if untested

'Fixes:' usually is taken as authority and often results in a backport. If a patch couldn't be tested although it looks perfectly valid, better not add this tag to leave a final chance for backporters to ask about the relevance of the backport or to check for future tests.

Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Willy Tarreau w@1wt.eu

Documentation/process/submitting-patches.rst | 9 +++++++++ 1 file changed, 9 insertions(+)

diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 0852bcf73630..54782b0e2f4c 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -140,6 +140,15 @@ An example call:: $ git log -1 --pretty=fixes 54a4f0239f2e Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") +Please note that a 'Fixes:' tag will most often result in your patch being +automatically backported to stable branches. If for any reason you could not +test that it really fixes the problem (for example, because the bug is not +reproducible, or because you did not have access to the required hardware +at the time of writing the patch to verify it does not cause regressions), +even if you are absolutely certain of your patch's validity, do not include +a 'Fixes:' tag and instead explain the situation in the commit message in +plain English.

I don't think that would be a good idea, First, it would discourage people from using Fixes: tags, and second it would not really solve the problem either.

While I am sure that the patch in question wasn't really tested (after all, it broke EC communication on all Mediatek Chromebooks using SPI to communicate with the EC), we don't know what the author tested. It may well be (and is quite likely) that the author _did_ test the patch, only not the affected code path. That means the author may still have added a Fixes: tag in the wrong belief to have tested the fix, even with the above restrictions in place.

But this is exactly the principle of a regression, and the first rule that evreyone learns in programming is that 100% coverage never exists. That's particularly true in large projects where the skills are distributed over many people. Everyone faithfully tests the faulty case (i.e. I tested that it does solve the problem I could reproduce), without a knowledge of plenty of other cases. I do cause regressions myself when fixing bugs, despite running plenty of regtests, and I hate this. The only thing I can do in this case is to communicate about it as soon as detected and produce another fix as fast as possible.

...

No, I think all we can do is to improve test coverage.

Improving test coverage is a way to *reduce* the frequency of regressions, it *never* eliminates them. And I'd really like that people start to accept that they are part of a normal development process, as careful as it can be, and that it's precisely why deploying updates from least sensitive to most sensitive environments is extremely important, and why bug reports are critical to the overall stability. There's never a zero-risk software nor their is any zero-risk upgrade. It's probably convenient to some users to complain that "the kernel maintainers" broke their system during an update between two stable releases, but this is just looking at the single failure in the middle of thousands of fixes and the fact that they maybe didn't even prepare the necessary rollback plan that must come with *any* software upgrade. And given that the alternative is to keep all their bugs unfixed, I find this a bit exagerated sometimes.

I noticed that among people who I've seen complaining about regressions in -stable that extremely rare are those who faced more than one in a major branch. Sure some are more painful than others, but that's still an extremely high success rate.

Now maybe there are *classes* of bugs that could be better detected through automated testing (new platforms, new workloads etc). But I strongly doubt it will make whiners complain less given that their use cases are probably mostly widely covered already and still slip through the rare cracks. And they are likely not the ones reporting problems here in the first place.

Cheers, Willy

Sasha Levin

2:37 p.m.

On Thu, Aug 05, 2021 at 08:30:55PM +0200, Willy Tarreau wrote:

...

On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:

...
...
It looks like a typical "works for me" regression. The best thing that could possibly be done to limit such occurrences would be to wait "long enough" before backporting them, in hope to catch breakage reports before the backport, but here there were already 3 weeks between the patch was submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance.

So that's the core of the problem. Stable maintainers cannot be tasked to try to analyse each patch in its finest details to figure whether a maintainer that's expected to be more knowledgeable than them on their driver did something wrong.

Then in my opinion we should encourage *not* to use "Fixes:" on untested patches (untested patches will always happen due to hardware availability or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

No, please let's not. If there is no testing story behind a buggy patch then it'll explode either when we go to the next version, or when we pull it into -stable.

If we avoid taking groups of patches into -stable it'll just mean that we end up with a huge amount of issues waiting for us during a version upgrade.

Yes, we may be taking bugs in now, but the regression rate (according to LWN) is pretty low, and the somewhat linear distribution of those bugs throughout our releases makes them managable (to review when they're sent out, to find during testing, to bisect if we hit the bug).

As Guenter points out, the path forward should be to improve our testing story.

-- Thanks, Sasha

Willy Tarreau

2:52 p.m.

Hi Sasha,

On Fri, Aug 06, 2021 at 10:37:47AM -0400, Sasha Levin wrote:

...

...
Then in my opinion we should encourage *not* to use "Fixes:" on untested patches (untested patches will always happen due to hardware availability or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

No, please let's not. If there is no testing story behind a buggy patch then it'll explode either when we go to the next version, or when we pull it into -stable.

If we avoid taking groups of patches into -stable it'll just mean that we end up with a huge amount of issues waiting for us during a version upgrade.

I agree with this and that was the point I was explaining as well: someone has to detect those bugs, and unfortunately if they're so hard to see that they can't be detected before the users, it has to hit a user.

...

Yes, we may be taking bugs in now, but the regression rate (according to LWN) is pretty low, and the somewhat linear distribution of those bugs throughout our releases makes them managable (to review when they're sent out, to find during testing, to bisect if we hit the bug).

I totally agree that they're extremely low. I only faced one in production in 4 or 5 years now, and even then it was not a true one, it was caused by a context change that made one local patch silently apply at the wrong place.

...

As Guenter points out, the path forward should be to improve our testing story.

Yes but we also have to make people understand that it only improves the situation a little bit and doesn't magically result in zero regression. Most whiners seem to say "I've met a regression once, that's unacceptable". This will not change their experience, it will just reduce the number of whiners who complain about their first bug ever. The amount of effort to invest in testing to reduce regressions by just 1% can be huge, and at some point one has to wonder where resources are better assigned.

Again, I tend to think that releasing older releases less often (and with more patches each time) could reassure some users. It used to happen in the past when Paul, Ben and I were in charge of older & slower branches, and it sometimes allowed us to drop one patch and its subsequent revert from a series. That's still one regression saved whenever it happens. And this maintains the principle of "older==extremely stable with slower fixes, newer==very stable with faster fixes".

Cheers, Willy

1428

days inactive

1429

days old

linux-stable-mirror@lists.linaro.org

12 comments

participants

tags (0)

participants (4)

Greg Kroah-Hartman
Guenter Roeck
Sasha Levin
Willy Tarreau