Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
I browsed android common kernel logs and the upstream stable kernel tree, found some related changes.
printk: handle blank console arguments passed in. (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...) Revert "init/console: Use ttynull as a fallback when there is no console" (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...)
It looks like upstream also noticed the regression introduced by the commit, and the workaround is to use "ttynull" to handle "console=" case. But the "ttynull" was reverted due to some other reasons mentioned in the commit message.
Any insight or recommendation will be appreciated.
Thanks, Yi Fan
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
What exact "performance issue" are you seeing?
And what kernel version are you seeing it on?
I browsed android common kernel logs and the upstream stable kernel tree, found some related changes.
printk: handle blank console arguments passed in. (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...) Revert "init/console: Use ttynull as a fallback when there is no console" (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...)
It looks like upstream also noticed the regression introduced by the commit, and the workaround is to use "ttynull" to handle "console=" case. But the "ttynull" was reverted due to some other reasons mentioned in the commit message.
Any insight or recommendation will be appreciated.
What problem exactly are you now seeing? And does it also happen on 5.15?
thanks,
greg k-h
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices. [YF] local test after reverting the above single commit passes, no issue is observed.
And what kernel version are you seeing it on?
[YF] it was first found on some products w/ kernel version 4.14.210. through bisection, we located the commit on 4.14.200.
I browsed android common kernel logs and the upstream stable kernel tree, found some related changes.
printk: handle blank console arguments passed in. (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...) Revert "init/console: Use ttynull as a fallback when there is no console" (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...)
It looks like upstream also noticed the regression introduced by the commit, and the workaround is to use "ttynull" to handle "console=" case. But the "ttynull" was reverted due to some other reasons mentioned in the commit message.
Any insight or recommendation will be appreciated.
What problem exactly are you now seeing? And does it also happen on 5.15?
[YF] we do not perform any tests on 5.15 yet. so no idea about whether the issue happens on 5.15.
thanks,
greg k-h
On Thu, Nov 04, 2021 at 12:40:32PM -0700, Yi Fan wrote:
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices. [YF] local test after reverting the above single commit passes, no issue is observed.
And what type of device is this?
If you see this thread: https://lore.kernel.org/r/f19c18fd-20b3-b694-5448-7d899966a868@roeck-us.net it looks like chromeos devices have now disabled this change, and there was a long discussion about possible issues and solutions.
Can you try the patch set referenced in that thread to see if that resolves the issue for you or not? Given that I have not seen any reports of this being an issue since over a year ago, odds are it has been resolved already with some change that we probably also need to backport to 4.14.y.
So any help in identifying that change would be appreciated.
And what kernel version are you seeing it on?
[YF] it was first found on some products w/ kernel version 4.14.210. through bisection, we located the commit on 4.14.200.
I browsed android common kernel logs and the upstream stable kernel tree, found some related changes.
printk: handle blank console arguments passed in. (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...) Revert "init/console: Use ttynull as a fallback when there is no console" (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...)
It looks like upstream also noticed the regression introduced by the commit, and the workaround is to use "ttynull" to handle "console=" case. But the "ttynull" was reverted due to some other reasons mentioned in the commit message.
Any insight or recommendation will be appreciated.
What problem exactly are you now seeing? And does it also happen on 5.15?
[YF] we do not perform any tests on 5.15 yet. so no idea about whether the issue happens on 5.15.
How about any other newer stable kernel version like 5.4.y or 5.10.y?
thanks,
greg k-h
On Mon, Nov 8, 2021 at 12:00 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 12:40:32PM -0700, Yi Fan wrote:
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices. [YF] local test after reverting the above single commit passes, no issue is observed.
And what type of device is this?
[YF] it happens on multiple devices on the 4.14.y kernel. (sorry cannot disclose the device type here.)
If you see this thread: https://lore.kernel.org/r/f19c18fd-20b3-b694-5448-7d899966a868@roeck-us.net it looks like chromeos devices have now disabled this change, and there was a long discussion about possible issues and solutions.
Can you try the patch set referenced in that thread to see if that resolves the issue for you or not? Given that I have not seen any reports of this being an issue since over a year ago, odds are it has been resolved already with some change that we probably also need to backport to 4.14.y.
So any help in identifying that change would be appreciated.
[YF] thanks for the context. I did not find a clear patch that seems to solve this issue yet. [YF] for the time being, reverting the offending commit seems the safest solution for the 4.14.y.
And what kernel version are you seeing it on?
[YF] it was first found on some products w/ kernel version 4.14.210. through bisection, we located the commit on 4.14.200.
I browsed android common kernel logs and the upstream stable kernel tree, found some related changes.
printk: handle blank console arguments passed in. (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...) Revert "init/console: Use ttynull as a fallback when there is no console" (link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...)
It looks like upstream also noticed the regression introduced by the commit, and the workaround is to use "ttynull" to handle "console=" case. But the "ttynull" was reverted due to some other reasons mentioned in the commit message.
Any insight or recommendation will be appreciated.
What problem exactly are you now seeing? And does it also happen on 5.15?
[YF] we do not perform any tests on 5.15 yet. so no idea about whether the issue happens on 5.15.
How about any other newer stable kernel version like 5.4.y or 5.10.y?
[YF] so far there is no easy way to replicate the issue. We have future products that are on 5.4.y and 5.10.y. I will keep monitoring whether similar issues are found.
thanks,
greg k-h
On Mon, Nov 08, 2021 at 11:17:07AM -0800, Yi Fan wrote:
On Mon, Nov 8, 2021 at 12:00 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 12:40:32PM -0700, Yi Fan wrote:
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices. [YF] local test after reverting the above single commit passes, no issue is observed.
And what type of device is this?
[YF] it happens on multiple devices on the 4.14.y kernel. (sorry cannot disclose the device type here.)
That's not helpful :(
Can you say "server" or "tiny device you hold in your hand"?
How about architecture type?
If you see this thread: https://lore.kernel.org/r/f19c18fd-20b3-b694-5448-7d899966a868@roeck-us.net it looks like chromeos devices have now disabled this change, and there was a long discussion about possible issues and solutions.
Can you try the patch set referenced in that thread to see if that resolves the issue for you or not? Given that I have not seen any reports of this being an issue since over a year ago, odds are it has been resolved already with some change that we probably also need to backport to 4.14.y.
So any help in identifying that change would be appreciated.
[YF] thanks for the context. I did not find a clear patch that seems to solve this issue yet. [YF] for the time being, reverting the offending commit seems the safest solution for the 4.14.y.
What about for the 4.19.y kernel tree? Why is this limited to just 4.14.y?
Can you send a patch that reverts this from 4.14 that explains why it should be removed?
thanks,
greg k-h
On Tue 2021-11-09 07:27:35, Greg KH wrote:
On Mon, Nov 08, 2021 at 11:17:07AM -0800, Yi Fan wrote:
On Mon, Nov 8, 2021 at 12:00 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 12:40:32PM -0700, Yi Fan wrote:
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote:
Resend the email using plain text.
I found some kernel performance regression issues that might be related w/ 4.14.y LTS commit.
4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
The issue is observed when "console=" is used as a kernel parameter to disable the kernel console.
I think that I see the problem. linux-4.14.y stable branch currently ignores "console=" parameter. As a result, a console (ttyX) is enabled by default.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices.
This might happen when there is a flood of messages to be printed to the console. It does not happen when there is no console.
It has been fixed by the upstream commit 3cffa06aeef7ece30f6b5ac0 ("printk/console: Allow to disable console output by using console="" or console=null")
The fix needs some tweaking for the stable branches because __add_preferred_console() has gained more parameters over time.
It seems that all longterm stable branches are affected. I am going to prepare the backports.
Best Regards, Petr
Thanks a lot, Petr and Greg.
I saw the patches and just started to prepare the test on devices using the 4.14.y tree. will update the test result later.
@Greg Kroah-Hartman Sorry for not providing the details in the public thread. I can sync w/ you offline.
Thanks, Yi Fan
On Tue, Nov 9, 2021 at 7:28 AM Petr Mladek pmladek@suse.com wrote:
On Tue 2021-11-09 07:27:35, Greg KH wrote:
On Mon, Nov 08, 2021 at 11:17:07AM -0800, Yi Fan wrote:
On Mon, Nov 8, 2021 at 12:00 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 12:40:32PM -0700, Yi Fan wrote:
Reply inline.
On Thu, Nov 4, 2021 at 11:56 AM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Nov 04, 2021 at 11:14:55AM -0700, Yi Fan wrote: > Resend the email using plain text. > > I found some kernel performance regression issues that might be > related w/ 4.14.y LTS commit. > > 4.14.y commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v... > > The issue is observed when "console=" is used as a kernel parameter to > disable the kernel console.
I think that I see the problem. linux-4.14.y stable branch currently ignores "console=" parameter. As a result, a console (ttyX) is enabled by default.
What exact "performance issue" are you seeing?
[YF] one kernel thread was randomly blocked for more than ~40 milliseconds, causing a certain task to fail to process in time. [YF] the issue is highly random on a single device. But it might happen a few times per 24 hours on a certain percentage of devices. The overall percentage of devices that show the issue seems quite stable over a long period of time (somehow the magic number is ~40%.). [YF] local test on a pool of devices does not show any correlation w/ any particular devices.
This might happen when there is a flood of messages to be printed to the console. It does not happen when there is no console.
It has been fixed by the upstream commit 3cffa06aeef7ece30f6b5ac0 ("printk/console: Allow to disable console output by using console="" or console=null")
The fix needs some tweaking for the stable branches because __add_preferred_console() has gained more parameters over time.
It seems that all longterm stable branches are affected. I am going to prepare the backports.
Best Regards, Petr
linux-stable-mirror@lists.linaro.org