Re: Userspace regression in LTS and stable kernels

List overview All Threads
Download

newer

older

[PATCH] mm: Fix the pgtable leak

[PATCH 0/8 v3] Stable material...

Greg Kroah-Hartman

15 Feb 2019 15 Feb '19

7 a.m.

On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...

On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote:

...
Your shebang line exceeds BINPRM_BUF_SIZE. Before the said commit the kernel silently truncated the shebang line (and corrupted it), now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

...

I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

Where is Kees's fix? I'll be glad to queue it up, or just revert the above commit, which ever people think is easiest.

thanks,

greg k-h

Show replies by date

Greg Kroah-Hartman

15 Feb 15 Feb

7:13 a.m.

New subject: Userspace regression in LTS and stable kernels

On Fri, Feb 15, 2019 at 08:00:22AM +0100, Greg Kroah-Hartman wrote:

...

On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote:

...
Your shebang line exceeds BINPRM_BUF_SIZE. Before the said commit the kernel silently truncated the shebang line (and corrupted it), now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

...
I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

Where is Kees's fix? I'll be glad to queue it up, or just revert the above commit, which ever people think is easiest.

Ah, I see the fix now, _after_ I just pushed out a bunch of stable releases. I'll go queue it up and push it out with just that fix in it now...

thanks,

greg k-h

Michal Hocko

9:10 a.m.

New subject: Userspace regression in LTS and stable kernels

On Fri 15-02-19 08:00:22, Greg KH wrote:

...

On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote:

...
Your shebang line exceeds BINPRM_BUF_SIZE. Before the said commit the kernel silently truncated the shebang line (and corrupted it), now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

...

...
I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

I thought we were through this already. Automagic autoselection of patches in the core kernel (or mmotm tree patches in particular) is too dangerous. We try hard to consider each and every patch for stable. Even if something slips through then it is much more preferred to ask for a stable backport in the respective email thread and wait for a conclusion before adding it.

-- Michal Hocko SUSE Labs

Greg Kroah-Hartman

9:20 a.m.

New subject: Userspace regression in LTS and stable kernels

On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...

On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote:

...
Your shebang line exceeds BINPRM_BUF_SIZE. Before the said commit the kernel silently truncated the shebang line (and corrupted it), now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

...

...
...
I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

I thought we were through this already. Automagic autoselection of patches in the core kernel (or mmotm tree patches in particular) is too dangerous. We try hard to consider each and every patch for stable. Even if something slips through then it is much more preferred to ask for a stable backport in the respective email thread and wait for a conclusion before adding it.

We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Sometimes things will slip through like this, it happens. And really, a 3 day turn-around-time to resolve this is pretty good, don't you think?

It also seems like we need another test to catch this problem from ever happening again :)

thanks,

greg k-h

Michal Hocko

9:42 a.m.

New subject: Userspace regression in LTS and stable kernels

On Fri 15-02-19 10:20:13, Greg KH wrote:

...

On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote:

...
Your shebang line exceeds BINPRM_BUF_SIZE. Before the said commit the kernel silently truncated the shebang line (and corrupted it), now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

...

...
...
...
I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

I thought we were through this already. Automagic autoselection of patches in the core kernel (or mmotm tree patches in particular) is too dangerous. We try hard to consider each and every patch for stable. Even if something slips through then it is much more preferred to ask for a stable backport in the respective email thread and wait for a conclusion before adding it.

We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But this is one of the area where things that make sense might still break because it is hard to assume what userspace depends on.

...

Sometimes things will slip through like this, it happens. And really, a 3 day turn-around-time to resolve this is pretty good, don't you think?

Yes, but that doesn't make any difference on the fact that this was not marked for stable and I still think this is not a stable material - at least not at this moment.

...

It also seems like we need another test to catch this problem from ever happening again :)

Agreed on this.

-- Michal Hocko SUSE Labs

Sasha Levin

3:19 p.m.

New subject: Userspace regression in LTS and stable kernels

On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:

...

On Fri 15-02-19 10:20:13, Greg KH wrote:

...
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger richard.weinberger@gmail.com wrote: > > Your shebang line exceeds BINPRM_BUF_SIZE. > Before the said commit the kernel silently truncated the shebang line > (and corrupted it), > now it tells the user that the line is too long.

It doesn't matter if it "corrupted" things by truncating it. All that matters is "it used to work, now it doesn't"

Yes, maybe it never *should* have worked. And yes, it's sad that people apparently had cases that depended on this odd behavior, but there we are.

I see that Kees has a patch to fix it up.

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

You're assuming that if we wouldn't have taken this patch to stable somehow someone else would notice this bug and fix it.

What test do we have that would catch this? Which testsuite tests for long shebang lines? Where is the test added together with this patch that covers this and similar cases?

The fact is that many patches are not tested until they get to stable, whether we add them the same week they went upstream or months later. This is a great case for this: I doubt anyone but NixOS does this crazy thing with shebang lines, so who else would discover the bug?

If this is indeed a case of us jumping the gun and shipping stuff too early before all tests are complete, please point me to the test that we missed and I'll make sure that for any future kernel release it gets run before we ship a stable kernel.

...

...
...
...
...
I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.

This came in through Sasha's tools, which give people a week or so to say "hey, this isn't a stable patch!" and it seems everyone ignored that :(

I thought we were through this already. Automagic autoselection of patches in the core kernel (or mmotm tree patches in particular) is too dangerous. We try hard to consider each and every patch for stable. Even if something slips through then it is much more preferred to ask for a stable backport in the respective email thread and wait for a conclusion before adding it.

We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But this is one of the area where things that make sense might still break because it is hard to assume what userspace depends on.

Great, so the solution is to just not take these things into stable at all? The solution should be to add tests to the patches that go in there to verify their correctness and that they don't regress in the future.

If you're really concerned about subsystems being brittle the solution is to improve their testing rather push stuff in and hope nothing explodes.

On one hand you Ack it saying it looks great to you and should be merged, but on the other hand you're saying that you don't really trust the patch?

Really, if I wouldn't pick this patch now what do you think would have happened? It would just pop up in a few months as we roll our stable kernel forward.

...

...
Sometimes things will slip through like this, it happens. And really, a 3 day turn-around-time to resolve this is pretty good, don't you think?

Yes, but that doesn't make any difference on the fact that this was not marked for stable and I still think this is not a stable material - at least not at this moment.

Hindsight is 20/20 :)

If people were good at understanding the impact and implications their patch has on the kernel we would never introduce new bugs!

I'll happily list a bunch more patches where folks didn't think they're stable material, but turned out to be important fixes.

-- Thanks, Sasha

Michal Hocko

3:52 p.m.

New subject: Userspace regression in LTS and stable kernels

On Fri 15-02-19 10:19:12, Sasha Levin wrote:

...

On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 10:20:13, Greg KH wrote:

...
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote:

...
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:

> On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger > richard.weinberger@gmail.com wrote: > > > > Your shebang line exceeds BINPRM_BUF_SIZE. > > Before the said commit the kernel silently truncated the shebang line > > (and corrupted it), > > now it tells the user that the line is too long. > > It doesn't matter if it "corrupted" things by truncating it. All that > matters is "it used to work, now it doesn't" > > Yes, maybe it never *should* have worked. And yes, it's sad that > people apparently had cases that depended on this odd behavior, but > there we are. > > I see that Kees has a patch to fix it up. >

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

You're assuming that if we wouldn't have taken this patch to stable somehow someone else would notice this bug and fix it.

What test do we have that would catch this? Which testsuite tests for long shebang lines? Where is the test added together with this patch that covers this and similar cases?

The test is the "users out there". Right now we do not have any specialized test case because we haven't even realized it might break something. The main difference between breaking on the bleeding edge vs. stable tree is that people running on bleeding edge are more likely to expect a breakage while stable users would most likely prefer to not be guinea pigs and have, well stable trees. [...]

...

...
...
We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But this is one of the area where things that make sense might still break because it is hard to assume what userspace depends on.

Great, so the solution is to just not take these things into stable at all?

No, but if the patch author and the maintainer have considered the stable tree and haven't found convincing arguments to mark for stable then it is likely that the patch doesn't need an urgent backporting.

...

The solution should be to add tests to the patches that go in there to verify their correctness and that they don't regress in the future.

If you're really concerned about subsystems being brittle the solution is to improve their testing rather push stuff in and hope nothing explodes.

On one hand you Ack it saying it looks great to you and should be merged, but on the other hand you're saying that you don't really trust the patch?

No. But I didn't consider it a stable material. You just do not really need all the patches in the stable, right? I have already said that this code is there for ages and fixing it is good to have for future but considering that nobody was really complaining then a backporting just adds a risk and as it turned out that risk was really not zero.

...

Really, if I wouldn't pick this patch now what do you think would have happened? It would just pop up in a few months as we roll our stable kernel forward.

and that would be a different kernel version and people kinda expect bugs with newer versions. This is not the case with the stable update.

But I guess we are just repeating the same discussion over and over. Our expectations about what the stable kernel should be differs a lot. I would like to see fewer but only important fixes while you would like to take as many fixes as possible.

-- Michal Hocko SUSE Labs

Samuel Dionne-Riel

4:18 p.m.

New subject: Userspace regression in LTS and stable kernels

On 15/02/2019, Michal Hocko mhocko@kernel.org wrote:

...

On Fri 15-02-19 10:19:12, Sasha Levin wrote:

...
On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 10:20:13, Greg KH wrote:

...
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote: > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds > torvalds@linux-foundation.org wrote: > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger > > richard.weinberger@gmail.com wrote: > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE. > > > Before the said commit the kernel silently truncated the > > > shebang line > > > (and corrupted it), > > > now it tells the user that the line is too long. > > > > It doesn't matter if it "corrupted" things by truncating it. > > All that > > matters is "it used to work, now it doesn't" > > > > Yes, maybe it never *should* have worked. And yes, it's sad > > that > > people apparently had cases that depended on this odd > > behavior, but > > there we are. > > > > I see that Kees has a patch to fix it up. > > > > Greg, I think we have a problem here. > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate > shebang > string") wasn't marked for backporting. And, presumably as a > consequence, Kees's fix "exec: load_script: allow interpreter > argument > truncation" was not marked for backporting. > > 8099b047ecc431518 hasn't even appeared in a Linus released > kernel, yet > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

You're assuming that if we wouldn't have taken this patch to stable somehow someone else would notice this bug and fix it.

What test do we have that would catch this? Which testsuite tests for long shebang lines? Where is the test added together with this patch that covers this and similar cases?

The test is the "users out there". Right now we do not have any specialized test case because we haven't even realized it might break something. The main difference between breaking on the bleeding edge vs. stable tree is that people running on bleeding edge are more likely to expect a breakage while stable users would most likely prefer to not be guinea pigs and have, well stable trees. [...]

...
...
...
We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But this is one of the area where things that make sense might still break because it is hard to assume what userspace depends on.

Great, so the solution is to just not take these things into stable at all?

No, but if the patch author and the maintainer have considered the stable tree and haven't found convincing arguments to mark for stable then it is likely that the patch doesn't need an urgent backporting.

...
The solution should be to add tests to the patches that go in there to verify their correctness and that they don't regress in the future.

If you're really concerned about subsystems being brittle the solution is to improve their testing rather push stuff in and hope nothing explodes.

On one hand you Ack it saying it looks great to you and should be merged, but on the other hand you're saying that you don't really trust the patch?

No. But I didn't consider it a stable material. You just do not really need all the patches in the stable, right? I have already said that this code is there for ages and fixing it is good to have for future but considering that nobody was really complaining then a backporting just adds a risk and as it turned out that risk was really not zero.

I'm sorry to interject here, but the issue was reported on the Kernel.org Bugzilla on February 2nd

- https://bugzilla.kernel.org/show_bug.cgi?id=202497

In the interest of better communication, if the need arises again, how should bugs in the RC kernels be reported so they (1) are spotted by the right maintainers and (2) not backported even though they were reported as causing breaking changes?

...

...
Really, if I wouldn't pick this patch now what do you think would have happened? It would just pop up in a few months as we roll our stable kernel forward.

and that would be a different kernel version and people kinda expect bugs with newer versions. This is not the case with the stable update.

But I guess we are just repeating the same discussion over and over. Our expectations about what the stable kernel should be differs a lot. I would like to see fewer but only important fixes while you would like to take as many fixes as possible. -- Michal Hocko SUSE Labs

-- — Samuel Dionne-Riel

Sasha Levin

6:02 p.m.

New subject: Userspace regression in LTS and stable kernels

On Fri, Feb 15, 2019 at 11:18:30AM -0500, Samuel Dionne-Riel wrote:

...

I'm sorry to interject here, but the issue was reported on the Kernel.org Bugzilla on February 2nd

https://bugzilla.kernel.org/show_bug.cgi?id=202497

In the interest of better communication, if the need arises again, how should bugs in the RC kernels be reported so they (1) are spotted by the right maintainers and (2) not backported even though they were reported as causing breaking changes?

Sadly our bugzilla is rarely used, even though the information in that particular bug report is perfect.

Maybe pinging LKML and stable@vger.kernel.org would be enough, specially if you know that it's a stable commit that caused the regression.

-- Thanks, Sasha

Sasha Levin

6 p.m.

New subject: Userspace regression in LTS and stable kernels

On Fri, Feb 15, 2019 at 04:52:00PM +0100, Michal Hocko wrote:

...

On Fri 15-02-19 10:19:12, Sasha Levin wrote:

...
On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 10:20:13, Greg KH wrote:

...
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote:

...
On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote: > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote: > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger > > richard.weinberger@gmail.com wrote: > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE. > > > Before the said commit the kernel silently truncated the shebang line > > > (and corrupted it), > > > now it tells the user that the line is too long. > > > > It doesn't matter if it "corrupted" things by truncating it. All that > > matters is "it used to work, now it doesn't" > > > > Yes, maybe it never *should* have worked. And yes, it's sad that > > people apparently had cases that depended on this odd behavior, but > > there we are. > > > > I see that Kees has a patch to fix it up. > > > > Greg, I think we have a problem here. > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang > string") wasn't marked for backporting. And, presumably as a > consequence, Kees's fix "exec: load_script: allow interpreter argument > truncation" was not marked for backporting. > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

It came in 5.0-rc1, so it fits the "in a Linus released kernel" requirement. If we are to wait until it shows up in a -final, that would be months too late for almost all of these types of patches that are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

You're assuming that if we wouldn't have taken this patch to stable somehow someone else would notice this bug and fix it.

What test do we have that would catch this? Which testsuite tests for long shebang lines? Where is the test added together with this patch that covers this and similar cases?

The test is the "users out there". Right now we do not have any specialized test case because we haven't even realized it might break something. The main difference between breaking on the bleeding edge vs. stable tree is that people running on bleeding edge are more likely to expect a breakage while stable users would most likely prefer to not be guinea pigs and have, well stable trees. [...]

Exactly, and my argument here is that no one really tests Linus's tree. Sure, folks run -rc kernels and report bugs, but no one actually runs these kernels at larger scales.

Most "users out there" wouldn't see this patch until it ends up in a stable kernel.

...

...
...
...
We have a list of blacklisted files/subsystems for people that do not want this to happen to their area of the kernel. The patch seemed to make sense, and it passed all known tests that we currently have.

Yes, the patch makes sense (I wouldn't give my acked-by otherwise). But this is one of the area where things that make sense might still break because it is hard to assume what userspace depends on.

Great, so the solution is to just not take these things into stable at all?

No, but if the patch author and the maintainer have considered the stable tree and haven't found convincing arguments to mark for stable then it is likely that the patch doesn't need an urgent backporting.

Are you suggesting that waiting longer would somehow made this "safer"? This goes back to my argument above.

...

...
The solution should be to add tests to the patches that go in there to verify their correctness and that they don't regress in the future.

If you're really concerned about subsystems being brittle the solution is to improve their testing rather push stuff in and hope nothing explodes.

On one hand you Ack it saying it looks great to you and should be merged, but on the other hand you're saying that you don't really trust the patch?

No. But I didn't consider it a stable material. You just do not really need all the patches in the stable, right? I have already said that this code is there for ages and fixing it is good to have for future but considering that nobody was really complaining then a backporting just adds a risk and as it turned out that risk was really not zero.

...
Really, if I wouldn't pick this patch now what do you think would have happened? It would just pop up in a few months as we roll our stable kernel forward.

and that would be a different kernel version and people kinda expect bugs with newer versions. This is not the case with the stable update.

But I guess we are just repeating the same discussion over and over. Our expectations about what the stable kernel should be differs a lot. I would like to see fewer but only important fixes while you would like to take as many fixes as possible.

Maybe to clarify here: I don't want to blindly take as much patches as I can. I want to take patches based on testing results: if something looks like a fix and it passes all our tests, there shouldn't be a reason not to take it.

My view is that humans are terrible at writing and understanding code: if folks fully understood the impact of their patches we would never have bugs, right? Assuming we both agree here that we make mistakes and introduce bugs, why do you think that these very same people fully understand whether a patch should go in stable or not?

The approach of manually deciding if a patch needs to go in stable is wrong and it doesn't scale. We need to beef up our testing story and make these decisions based off of that, and not our error-prone brains that introduced these bugs to begin with.

Look at the outcome of this very issue: people sprung into action and fixed this bug quickly, but how many tests were added as a result of this? How do we know it's not going to regress again?

-- Thanks, Sasha

Michal Hocko

18 Feb 18 Feb

12:56 p.m.

New subject: Userspace regression in LTS and stable kernels

On Fri 15-02-19 13:00:26, Sasha Levin wrote:

...

On Fri, Feb 15, 2019 at 04:52:00PM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 10:19:12, Sasha Levin wrote:

...
On Fri, Feb 15, 2019 at 10:42:05AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 10:20:13, Greg KH wrote:

...
On Fri, Feb 15, 2019 at 10:10:00AM +0100, Michal Hocko wrote:

...
On Fri 15-02-19 08:00:22, Greg KH wrote: > On Thu, Feb 14, 2019 at 12:20:27PM -0800, Andrew Morton wrote: > > On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds torvalds@linux-foundation.org wrote: > > > > > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger > > > richard.weinberger@gmail.com wrote: > > > > > > > > Your shebang line exceeds BINPRM_BUF_SIZE. > > > > Before the said commit the kernel silently truncated the shebang line > > > > (and corrupted it), > > > > now it tells the user that the line is too long. > > > > > > It doesn't matter if it "corrupted" things by truncating it. All that > > > matters is "it used to work, now it doesn't" > > > > > > Yes, maybe it never *should* have worked. And yes, it's sad that > > > people apparently had cases that depended on this odd behavior, but > > > there we are. > > > > > > I see that Kees has a patch to fix it up. > > > > > > > Greg, I think we have a problem here. > > > > 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang > > string") wasn't marked for backporting. And, presumably as a > > consequence, Kees's fix "exec: load_script: allow interpreter argument > > truncation" was not marked for backporting. > > > > 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet > > it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x. > > It came in 5.0-rc1, so it fits the "in a Linus released kernel" > requirement. If we are to wait until it shows up in a -final, that > would be months too late for almost all of these types of patches that > are picked up.

rc1 is just a too early. Waiting few more rcs or even a final release for something that people do not see as an issue should be just fine. Consider this particular patch and tell me why it had to be rushed in the first place. The original code was broken for _years_ but I do not remember anybody would be complaining.

This patch was in 4.20.10, which was released on Feb 12 while 5.0-rc1 came out on Jan 6. Over a month delay.

Obviously not long enough.

You're assuming that if we wouldn't have taken this patch to stable somehow someone else would notice this bug and fix it.

What test do we have that would catch this? Which testsuite tests for long shebang lines? Where is the test added together with this patch that covers this and similar cases?

The test is the "users out there". Right now we do not have any specialized test case because we haven't even realized it might break something. The main difference between breaking on the bleeding edge vs. stable tree is that people running on bleeding edge are more likely to expect a breakage while stable users would most likely prefer to not be guinea pigs and have, well stable trees. [...]

Exactly, and my argument here is that no one really tests Linus's tree.

I would beg to disagree. The testing coverage is smaller of course because most people are running on a distribution/stable kernels.

...

Sure, folks run -rc kernels and report bugs, but no one actually runs these kernels at larger scales.

And this just screams that a (much) more time has to pass before fixes which are nice-to-have are passed to the stable tree - assuming they are not fixing something that users of the said stable tree are seeing the issue of course.

...

Most "users out there" wouldn't see this patch until it ends up in a stable kernel.

...and this would be on a kernel version upgrade when some breakage is expected and tolerated more than on minor version stable update.

[...]

...

...
But I guess we are just repeating the same discussion over and over. Our expectations about what the stable kernel should be differs a lot. I would like to see fewer but only important fixes while you would like to take as many fixes as possible.

Maybe to clarify here: I don't want to blindly take as much patches as I can. I want to take patches based on testing results: if something looks like a fix and it passes all our tests, there shouldn't be a reason not to take it.

There are many things we do not have any tests for. E.g. I wasn't even aware that Perl (and others) are dealing with an excessive shebang input by re-reading the input. There are always going to be corner cases like that. The underlying thing is that nobody seem to be complaining about the original issue addressed by Oleg. So why the heck should we push it to the stable tree and _risk_ a regression.

...

My view is that humans are terrible at writing and understanding code: if folks fully understood the impact of their patches we would never have bugs, right? Assuming we both agree here that we make mistakes and introduce bugs, why do you think that these very same people fully understand whether a patch should go in stable or not?

I haven't really seen a script that would be more efficient in this evaluation. With a lack of the full test coverage I do not see this going to change anytime soon.

...

The approach of manually deciding if a patch needs to go in stable is wrong and it doesn't scale. We need to beef up our testing story and make these decisions based off of that, and not our error-prone brains that introduced these bugs to begin with.

Look at the outcome of this very issue: people sprung into action and fixed this bug quickly, but how many tests were added as a result of this? How do we know it's not going to regress again?

Yes, the issue got identified and analyzed quickly. There was no questioning this part. It is the regression in stable that bothers me. You have exposed users of a tree, which is supposed to be stable, to a bug which was totally unnecessary because nobody cared for the parsing behavior for years.

-- Michal Hocko SUSE Labs

2513

days inactive

2516

days old

linux-stable-mirror@lists.linaro.org

10 comments

participants

tags (0)

participants (4)

Greg Kroah-Hartman
Michal Hocko
Samuel Dionne-Riel
Sasha Levin