[REGRESSION][BISECTED] Boot stall from merge tag 'net-next-6.2'

List overview All Threads
Download

newer

older

[PATCH v1 2/2] arm64: dts: mt7986:...

FAILED: patch "[PATCH] writeback:...

Sami Korkalainen

26 May 2023 26 May '23

7:17 p.m.

Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop, the 6.1.30 works still fine. The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may boot ok on the first try, but when rebooting, the very same version doesn't boot.

Some times, when trying to boot, I get this message repeated forever: ACPI Error: No handler or method for GPE [XX], disabling event (20221020/evgpe-839) On newer kernels, the date is 20230331 instead of 20221020. There is also some other error, but I can't read it as it gets overwritten by the other ACPI error, see image linked at the end.

And some times, the screen will just stay completely blank.

I tried booting with acpi=off, but it does not help.

I bisected and this is the first bad commit 7e68dd7d07a2 "Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next"

As the later kernels had the seemingly random booting behaviour (mentioned above), I retested the last good one 7c4a6309e27f by booting it several times and it boots every time.

I tried getting some boot logs, but the boot process does not go far enough to make any logs.

Kernel .config file: https://0x0.st/Hqt1.txt

Environment (outputs of a working Linux 6.1 build): Software (output of the ver_linux script): https://0x0.st/Hqte.txt Processor information (from /proc/cpuinfo): https://0x0.st/Hqt2.txt Module information (from /proc/modules): https://0x0.st/HqtL.txt /proc/ioports: https://0x0.st/Hqt9.txt /proc/iomem: https://0x0.st/Hqtf.txt PCI information ('lspci -vvv' as root): https://0x0.st/HqtO.txt SCSI information (from /proc/scsi/scsi)

Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: KINGSTON SVP200S Rev: C4 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: hp Model: CDDVDW TS-L633M Rev: 0301 Type: CD-ROM ANSI SCSI revision: 05

Distribution: Arch Linux Boot manager: systemd-boot (UEFI)

git bisect log: https://0x0.st/Hqgx.txt ACPI Error (sorry for the dusty screen): https://0x0.st/HqEk.jpeg

#regzbot ^introduced 7e68dd7d07a2

Best regards Sami Korkalainen

Show replies by date

Bagas Sanjaya

27 May 27 May

1:17 a.m.

On Fri, May 26, 2023 at 07:17:26PM +0000, Sami Korkalainen wrote:

...

Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop, the 6.1.30 works still fine. The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may boot ok on the first try, but when rebooting, the very same version doesn't boot. Some times, when trying to boot, I get this message repeated forever: ACPI Error: No handler or method for GPE [XX], disabling event (20221020/evgpe-839) On newer kernels, the date is 20230331 instead of 20221020. There is also some other error, but I can't read it as it gets overwritten by the other ACPI error, see image linked at the end.

And some times, the screen will just stay completely blank.

I tried booting with acpi=off, but it does not help. I bisected and this is the first bad commit 7e68dd7d07a2 "Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next"

I think networking changes shouldn't cause this ACPI regression, right?

...

   
As the later kernels had the seemingly random booting behaviour (mentioned above), I retested the last good one 7c4a6309e27f by booting it several times and it boots every time.

I tried getting some boot logs, but the boot process does not go far enough to make any logs.

Kernel .config file: https://0x0.st/Hqt1.txt Environment (outputs of a working Linux 6.1 build): Software (output of the ver_linux script): https://0x0.st/Hqte.txt Processor information (from /proc/cpuinfo): https://0x0.st/Hqt2.txt Module information (from /proc/modules): https://0x0.st/HqtL.txt /proc/ioports: https://0x0.st/Hqt9.txt /proc/iomem: https://0x0.st/Hqtf.txt PCI information ('lspci -vvv' as root): https://0x0.st/HqtO.txt SCSI information (from /proc/scsi/scsi)

Where is SCSI info?

...

Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: KINGSTON SVP200S Rev: C4 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: hp Model: CDDVDW TS-L633M Rev: 0301 Type: CD-ROM ANSI SCSI revision: 05 Distribution: Arch Linux Boot manager: systemd-boot (UEFI)

git bisect log: https://0x0.st/Hqgx.txt ACPI Error (sorry for the dusty screen): https://0x0.st/HqEk.jpeg

#regzbot ^introduced 7e68dd7d07a2

Best regards Sami Korkalainen

Anyway, I also Cc: netdev and acpi lists and maintainers (maybe they have idea on what's going on here) and also fixing up regzbot entry title:

#regzbot title: Boot stall with ACPI error (no handler/method for GPE) caused by net-next 6.2 pull

Thanks.

-- An old man doll... just what I always wanted! - Clara

Sami Korkalainen

4:07 a.m.

...

Where is SCSI info?

Right there, under the text (It was so short, that I thought to put it in the message. Maybe I should have put that also in pastebin for consistency and clarity):

...

I think networking changes shouldn't cause this ACPI regression, right?

Yeah, beats me, but that's what I got by bisecting. My expertise ends about here.

Bagas Sanjaya

12 Jun 12 Jun

2:07 p.m.

On Sat, May 27, 2023 at 04:07:56AM +0000, Sami Korkalainen wrote:

...

...
Where is SCSI info?

Right there, under the text (It was so short, that I thought to put it in the message. Maybe I should have put that also in pastebin for consistency and clarity):

Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: KINGSTON SVP200S Rev: C4 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: hp Model: CDDVDW TS-L633M Rev: 0301 Type: CD-ROM ANSI SCSI revision: 05

...
I think networking changes shouldn't cause this ACPI regression, right?

Yeah, beats me, but that's what I got by bisecting. My expertise ends about here.

Hmm, no reply for a while.

Networking people: It looks like your v6.2 PR introduces unrelated ACPICA regression. Can you explain why?

ACPICA people: Can you figure out why do this regression happen?

Sami: Can you try latest mainline and repeat bisection as confirmation?

I'm considering to remove this from regression tracking if there is no replies in several more days.

Thanks.

-- An old man doll... just what I always wanted! - Clara

Sami Korkalainen

7:05 p.m.

Ok. I will try the latest mainline and if it does not work, I try bisecting again, but it should take at least a couple of weeks with this old PC. Can't really compile more than once a day.

Regards Sami Korkalainen ___________________________

Sent with Proton Mail secure email.

------- Original Message ------- On Monday, June 12th, 2023 at 17.07, Bagas Sanjaya bagasdotme@gmail.com wrote:

...

On Sat, May 27, 2023 at 04:07:56AM +0000, Sami Korkalainen wrote:

...
...
Where is SCSI info?

Right there, under the text (It was so short, that I thought to put it in the message. Maybe I should have put that also in pastebin for consistency and clarity):

Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: KINGSTON SVP200S Rev: C4 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: hp Model: CDDVDW TS-L633M Rev: 0301 Type: CD-ROM ANSI SCSI revision: 05

...
I think networking changes shouldn't cause this ACPI regression, right? Yeah, beats me, but that's what I got by bisecting. My expertise ends about here.

Hmm, no reply for a while.

Networking people: It looks like your v6.2 PR introduces unrelated ACPICA regression. Can you explain why?

ACPICA people: Can you figure out why do this regression happen?

Sami: Can you try latest mainline and repeat bisection as confirmation?

I'm considering to remove this from regression tracking if there is no replies in several more days.

Thanks.

-- An old man doll... just what I always wanted! - Clara

Andrew Lunn

7:50 p.m.

On Mon, Jun 12, 2023 at 07:05:45PM +0000, Sami Korkalainen wrote:

...

Ok. I will try the latest mainline and if it does not work, I try bisecting again, but it should take at least a couple of weeks with this old PC. Can't really compile more than once a day.

Cross compiling Linux has been possible for at least 20 years. Do the build on something modern and copy the results to the target.

Andrew

Sami Korkalainen

21 Jun 21 Jun

6:07 a.m.

I bisected again. It seems I made some mistake last time, as I got a different result this time. Maybe, because these problematic kernels may boot fine sometimes, like I said before.

Anyway, first bad commit (makes much more sense this time): e7b813b32a42a3a6281a4fd9ae7700a0257c1d50 efi: random: refresh non-volatile random seed when RNG is initialized

I confirmed that this is the code causing the issue by commenting it out (see the patch file). Without this code, the latest mainline boots fine.

Terveisin Sami Korkalainen

Linux regression tracking (Thorsten Leemhuis)

8:46 a.m.

[added Jason (who authored the culprit) to the list of recipients; moved net people and list to BCC, guess they are not much interested in this anymore then]

On 21.06.23 08:07, Sami Korkalainen wrote:

...

I bisected again. It seems I made some mistake last time, as I got a different result this time. Maybe, because these problematic kernels may boot fine sometimes, like I said before.

Anyway, first bad commit (makes much more sense this time): e7b813b32a42a3a6281a4fd9ae7700a0257c1d50 efi: random: refresh non-volatile random seed when RNG is initialized

I confirmed that this is the code causing the issue by commenting it out (see the patch file). Without this code, the latest mainline boots fine.

Jason, in that case it seems this is something for you. For the initial report, see here:

https://lore.kernel.org/all/GQUnKz2al3yke5mB2i1kp3SzNHjK8vi6KJEh7rnLrOQ24Orl...

Quoting a part of it:

``` Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop, the 6.1.30 works still fine. The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may boot ok on the first try, but when rebooting, the very same version doesn't boot.

And some times, the screen will just stay completely blank.

I tried booting with acpi=off, but it does not help. ``` Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.

#regzbot introduced e7b813b32a42a3a6281a4fd9ae7700a0257c1d50

Linus Torvalds

5:56 p.m.

On Wed, 21 Jun 2023 at 01:46, Linux regression tracking (Thorsten Leemhuis) regressions@leemhuis.info wrote:

...

Jason, in that case it seems this is something for you. For the initial report, see here:

I'll just revert it for now. Writing EFI variables has always been fraught with danger - more so than just reading them - and this one just looks horrible anyway.

Calling execute_with_initialized_rng() can end up having the callback done under a spinlock with interrupts disabled, which is probably why it then has that odd double indirection through a one-time work. And in no situation should we start writing to EFI variables during early subsystem initialization, I feel.

It also probably shouldn't use the "set_variable" function at all, but the non-blocking one, and who knows if it should try to do some serialization with efi/vars.c.

I think it would be better off done in user space, but if we can't trust user space to do the right thing, at least do it much much later.

Linus

Linus Torvalds

6:08 p.m.

On Wed, 21 Jun 2023 at 10:56, Linus Torvalds torvalds@linux-foundation.org wrote:

...

I'll just revert it for now.

Btw, Thorsten, is there a good way to refer to the regzbot entry in a commit message some way? I know about the email interface, but I'd love to just be able to link to the regression entry. Now I just linked to the report in this thread.

Maybe you don't keep a long-term stable link around anywhere, and you just pick up on reverts directly, but I suspect it would be nice to be able to just link to any regression entry directly.

Linus

Thorsten Leemhuis

22 Jun 22 Jun

6:34 p.m.

[CCing Konstantin]

On 21.06.23 20:08, Linus Torvalds wrote:

...

On Wed, 21 Jun 2023 at 10:56, Linus Torvalds torvalds@linux-foundation.org wrote:

...
I'll just revert it for now.

Btw, Thorsten, is there a good way to refer to the regzbot entry in a commit message some way? I know about the email interface, but I'd love to just be able to link to the regression entry.

There is a separate page for each tracked regression:

https://linux-regtracking.leemhuis.info/regzbot/regression/lore/GQUnKz2al3yk...

FWIW, such pages existed earlier already, but before sending this reply I wanted to fix a related bug that changed the url slightly. One can find that link by clicking on "activity" in the regzbot webui (I need to find a better place for this link to make it more approachable :-/ ).

And yes, in this case the URL sadly is rather long -- and the long msgid is only partly to blame. If we really want to link there more regularly I could work towards making that url shorter.

That being said: I wonder if we really want to add these links to commit messages regularly. In case of this particular regression...

...

Now I just linked to the report in this thread.

...the thread with the report basically contains nearly everything relevant (expect a link to the commit with the revert; but in this case that's where the journey or a curious reader would start).

But yes, for regressions with a more complex history it's different, as there the regzbot webui makes things a bit easier -- among others by directly pointing to patches in the same or other threads that otherwise are hard to find from the original thread, unless you know how to search for them on lore.

I sometimes wonder if the real solution for this kind of problem would be some bot (regzbot? bugbot?) that does something similar to the pr-tracker-bot:

1) bot notices when a patch with a Link: or Closes: tag to a thread with the msgid <foo> is posted or applied to next, mainline, or stable 2) bot posts a reply to <foo> with a short msg like "a patch that links to this thread was (posted|merged); for details see <url>"

That would solve a few things (that might or might not be worth solving):

* bug reporters would become aware of the progress in case the developer forgets to CC them (which happens)

* people that run into an issue and search for existing mailed reports on lore currently have no simple way to find fixes that are already under review or were applied somewhere already

That together with lore is also more likely to be long-term stable than links to the regzbot webui.

Ciao, Thorsten

Jason A. Donenfeld

21 Jun 21 Jun

5:57 p.m.

+Ard - any ideas here?

On Wed, Jun 21, 2023 at 10:46 AM Linux regression tracking (Thorsten Leemhuis) regressions@leemhuis.info wrote:

...

[added Jason (who authored the culprit) to the list of recipients; moved net people and list to BCC, guess they are not much interested in this anymore then]

On 21.06.23 08:07, Sami Korkalainen wrote:

...
I bisected again. It seems I made some mistake last time, as I got a different result this time. Maybe, because these problematic kernels may boot fine sometimes, like I said before.

Anyway, first bad commit (makes much more sense this time): e7b813b32a42a3a6281a4fd9ae7700a0257c1d50 efi: random: refresh non-volatile random seed when RNG is initialized

I confirmed that this is the code causing the issue by commenting it out (see the patch file). Without this code, the latest mainline boots fine.

Jason, in that case it seems this is something for you. For the initial report, see here:

https://lore.kernel.org/all/GQUnKz2al3yke5mB2i1kp3SzNHjK8vi6KJEh7rnLrOQ24Orl...

Quoting a part of it:
Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop,
the 6.1.30 works still fine.
The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may
boot ok on the first try, but when rebooting, the very same version
doesn't boot.

Some times, when trying to boot, I get this message repeated forever:
ACPI Error: No handler or method for GPE [XX], disabling event
(20221020/evgpe-839)
On newer kernels, the date is 20230331 instead of 20221020. There is
also some other error, but I can't read it as it gets overwritten by the
other ACPI error, see image linked at the end.

And some times, the screen will just stay completely blank.

I tried booting with acpi=off, but it does not help.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.

#regzbot introduced e7b813b32a42a3a6281a4fd9ae7700a0257c1d50

Ard Biesheuvel

23 Jun 23 Jun

1:55 p.m.

On Wed, 21 Jun 2023 at 19:57, Jason A. Donenfeld Jason@zx2c4.com wrote:

...

+Ard - any ideas here?

On Wed, Jun 21, 2023 at 10:46 AM Linux regression tracking (Thorsten Leemhuis) regressions@leemhuis.info wrote:

...
[added Jason (who authored the culprit) to the list of recipients; moved net people and list to BCC, guess they are not much interested in this anymore then]

On 21.06.23 08:07, Sami Korkalainen wrote:

...
I bisected again. It seems I made some mistake last time, as I got a different result this time. Maybe, because these problematic kernels may boot fine sometimes, like I said before.

Anyway, first bad commit (makes much more sense this time): e7b813b32a42a3a6281a4fd9ae7700a0257c1d50 efi: random: refresh non-volatile random seed when RNG is initialized

I confirmed that this is the code causing the issue by commenting it out (see the patch file). Without this code, the latest mainline boots fine.

Jason, in that case it seems this is something for you. For the initial report, see here:

https://lore.kernel.org/all/GQUnKz2al3yke5mB2i1kp3SzNHjK8vi6KJEh7rnLrOQ24Orl...

Quoting a part of it:
Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop,
the 6.1.30 works still fine.
The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may
boot ok on the first try, but when rebooting, the very same version
doesn't boot.

Some times, when trying to boot, I get this message repeated forever:
ACPI Error: No handler or method for GPE [XX], disabling event
(20221020/evgpe-839)
On newer kernels, the date is 20230331 instead of 20221020. There is
also some other error, but I can't read it as it gets overwritten by the
other ACPI error, see image linked at the end.

And some times, the screen will just stay completely blank.

I tried booting with acpi=off, but it does not help.

Catching up with email after my vacation, apologies for the delay.

This ship seems to have sailed in the meantime, but I'll contribute some observations anyway.

The machine in question appears to be Vista-era Windows laptop, and I am not surprised at all that the firmware is flaky. In those days, firmware testing was limited to boot testing Windows, and nobody bothered testing for EFI compliance beyond that (as it is not needed to get the Windows sticker)

However, the failure mode still strikes me as odd, and I'd be interested in finding out whether booting with efi=noruntime makes a difference at all, as that would prevent the SetVariable() all from taking place, without affecting anything else.

Setting the variable from user space is ultimately a better choice, I think. The reason it was avoided it here is so that we don't have to rely on user space to set limited permissions on the efivarfs file entry in order to avoid the seed from being world readable (which is something, e.g., systemd does today for other 'sensitive' EFI variables, whatever that means). But given that this variable is in its own GUIDed namespace, we could easily fix that in efivarfs itself.

Linus Torvalds

5:29 p.m.

On Fri, 23 Jun 2023 at 06:55, Ard Biesheuvel ardb@kernel.org wrote:

...

Setting the variable from user space is ultimately a better choice, I think.

Doing it from the kernel might still be an option, but I think it was a huge mistake to do it *early*.

Early boot is fragile to begin with when not everything is set up, and *much* harder to debug.

So not only are problems more likely to happen in the first place, when they do happen they are a lot harder to figure out.

Maybe it would make more sense to write a new seed at kernel shutdown. Not only do y ou presumably have a ton more entropy at that point, but if things go sideways it's also less of a problem to have dead machine.

Of course, shutdown is another really hard to debug situation, so not optimal either.

Linus

Jason A. Donenfeld

8:30 p.m.

Hi Linus, Ard,

On Fri, Jun 23, 2023 at 7:30 PM Linus Torvalds torvalds@linux-foundation.org wrote:

...

Maybe it would make more sense to write a new seed at kernel shutdown. Not only do y ou presumably have a ton more entropy at that point, but if things go sideways it's also less of a problem to have dead machine.

We always have to write when using so that we don't credit the same seed twice, so it's gotta be used at a stage when SetVariable is somewhat working.

...

On Fri, 23 Jun 2023 at 06:55, Ard Biesheuvel ardb@kernel.org wrote:

...
Setting the variable from user space is ultimately a better choice, I think.

Doing it from the kernel might still be an option, but I think it was a huge mistake to do it *early*.

Early boot is fragile to begin with when not everything is set up, and *much* harder to debug.

So not only are problems more likely to happen in the first place, when they do happen they are a lot harder to figure out.

I think it's still worth doing in the kernel - or trying to do, at least.

I wonder why SetVariable is failing on this system, and whether there's a way to workaround it. If we wind up needing to quirk around it somewhat, then I suspect your suggestion of not doing this as early in boot might be wise. Specifically, what if we do this after workqueues are available and do it from one of them? That's still early enough in boot that it makes the feature useful, but the scheduler is alive at that point. Then in the worst case, we just get a wq stall splat, which the user is able to report, and then can figure out what to do from there.

Jason

Linus Torvalds

9:52 p.m.

On Fri, 23 Jun 2023 at 13:31, Jason A. Donenfeld Jason@zx2c4.com wrote:

...

We always have to write when using so that we don't credit the same seed twice, so it's gotta be used at a stage when SetVariable is somewhat working.

This code isn't even the code that "uses" the alleged entropy from that EFI variable in the first place. That's the code in efi_random_get_seed() in the EFI boot sequence, and appends it to the bootup randomness buffers.

And that code already seems to clear the EFI variable (or seems to append to it).

So this argument seems to be complete garbage - we absolutely do not have to write it, and your patch already just wrote it in the wrong place anyway.

Don't make excuses. That code caused boot failures, it was all done in the wrong place, and at entirely the wrong time.

Linus

Ard Biesheuvel

10:55 p.m.

On Fri, 23 Jun 2023 at 23:52, Linus Torvalds torvalds@linux-foundation.org wrote:

...

On Fri, 23 Jun 2023 at 13:31, Jason A. Donenfeld Jason@zx2c4.com wrote:

...
We always have to write when using so that we don't credit the same seed twice, so it's gotta be used at a stage when SetVariable is somewhat working.

This code isn't even the code that "uses" the alleged entropy from that EFI variable in the first place. That's the code in efi_random_get_seed() in the EFI boot sequence, and appends it to the bootup randomness buffers.

And that code already seems to clear the EFI variable (or seems to append to it).

It reads the variable twice (once to obtain the size and once to grab the data), and replaces it with a zero-length string, which causes the variable to disappear. (This is typically NOR flash with spare blocks managed by a fault tolerant write layer in software, and so really wiping the seed or overwriting it is not generally possible)

Using SetVariable() from boot services to delete a variable is highly unlikely to regress older systems in a similar way.

...

So this argument seems to be complete garbage - we absolutely do not have to write it, and your patch already just wrote it in the wrong place anyway.

Don't make excuses. That code caused boot failures, it was all done in the wrong place, and at entirely the wrong time.

With the revert applied, the kernel/EFI stub only consumes the variable and deletes it, but never creates it by itself, and so the code does nothing if the variable is never created in the first place.

If we leave it up to user space to create it, we won´t need any policy or quirks handling in the kernel at all, which I´d prefer. The only thing we should do is special case the variable's scope GUID in efivarfs so the file is not created world-readable like we do for other variables. (This predates my involvement but I think this was an oversight). Using efivarfs will also ensure that the 'storage paranoia' logic is used on x86. (This is something I failed to take into account when I reviewed Jason's patch)

Linus Torvalds

11:02 p.m.

On Fri, 23 Jun 2023 at 15:55, Ard Biesheuvel ardb@kernel.org wrote:

...

With the revert applied, the kernel/EFI stub only consumes the variable and deletes it, but never creates it by itself, and so the code does nothing if the variable is never created in the first place.

Right.

But my *point* was that if we want to create it, we DAMN WELL DO NOT WANT TO DO SO AT BOOT TIME.

Boot time is absolutely the worst possible time to do it.

We'd be much better off doing so at shutdown time, when we at least have (a) maximal entropy and (b) failures are less critical.

Jason's argument against that was pure and utter BS.

Now, there are real arguments against shutdown time: it too is horrible to debug. So shutdown is not exactly great either. It's better than bootup, but it really would be better to do it at a point where we can actually get reasonable results out if something goes wrong. Which it clearly did.

Linus

David Laight

25 Jun 25 Jun

3:36 p.m.

From: Linus Torvalds

...

Sent: 24 June 2023 00:03

On Fri, 23 Jun 2023 at 15:55, Ard Biesheuvel ardb@kernel.org wrote:

...
With the revert applied, the kernel/EFI stub only consumes the variable and deletes it, but never creates it by itself, and so the code does nothing if the variable is never created in the first place.

Right.

But my *point* was that if we want to create it, we DAMN WELL DO NOT WANT TO DO SO AT BOOT TIME.

Boot time is absolutely the worst possible time to do it.

We'd be much better off doing so at shutdown time, when we at least have (a) maximal entropy and (b) failures are less critical.

Or maybe better - especially for embedded systems which don't often get shut down properly (or any where someone can force a system crash and then get no saved entropy) - after the system has been running long enough to get a reasonable amount of entropy.

Also, why delete the entropy during boot? Clearly it is sub-optimal to use it twice, but that has to be better that not using any at all?

David

- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)

Jason A. Donenfeld

2:40 p.m.

On Fri, Jun 23, 2023 at 02:52:25PM -0700, Linus Torvalds wrote:

...

On Fri, 23 Jun 2023 at 13:31, Jason A. Donenfeld Jason@zx2c4.com wrote:

...
We always have to write when using so that we don't credit the same seed twice, so it's gotta be used at a stage when SetVariable is somewhat working.

This code isn't even the code that "uses" the alleged entropy from that EFI variable in the first place. That's the code in efi_random_get_seed() in the EFI boot sequence, and appends it to the bootup randomness buffers.

And that code already seems to clear the EFI variable (or seems to append to it).

Oh, doh, yea, you're right. Sorry. My mistake.

So indeed, we can probably get away with just delaying this until much later in boot, and doing this inside of a workqueue or similar, instead of in some special early boot context. Or maybe shutdown? Shutdown seems like it'd better handle potential firmware issues since hanging on shutdown is a lot better than hanging on boot. But it would be nice to keep this working during unclean shutdown, which maybe means doing it sometime after bootup is still better.

...

So this argument seems to be complete garbage - we absolutely do not have to write it, and your patch already just wrote it in the wrong place anyway.

Don't make excuses. That code caused boot failures, it was all done in the wrong place, and at entirely the wrong time.

Yes, my point was entirely wrong. I was mistaken. But it wasn't an *excuse*. I was just momentarily confused. No malice here, I promise.

Jason

Sami Korkalainen

23 Jun 23 Jun

6:20 p.m.

...

However, the failure mode still strikes me as odd, and I'd be interested in finding out whether booting with efi=noruntime makes a difference at all, as that would prevent the SetVariable() all from taking place, without affecting anything else.

No boot stall with efi=noruntime. Tested on 6.3.9 and 6.4-rc7.

Ard Biesheuvel

6:38 p.m.

On Fri, 23 Jun 2023 at 20:20, Sami Korkalainen sami.korkalainen@proton.me wrote:

...

Please don't send me encrypted emails.

Linus Torvalds

7:01 p.m.

On Fri, 23 Jun 2023 at 11:39, Ard Biesheuvel ardb@kernel.org wrote:

...

On Fri, 23 Jun 2023 at 20:20, Sami Korkalainen sami.korkalainen@proton.me wrote:

...
Please don't send me encrypted emails.

Heh. That must be protonmail doing some crazy stuff based on recipient. Here's Sami's email on the lists:

https://lore.kernel.org/all/CzNbNfn7R2cqLMD6_jp11Dku0OoXYJhx2AMfk8JXeQVP2EGd...

(and it's what I got too). No encryption anywhere, just the message ID from hell.

So for some reason protonmail decided that *you* are special, and singled you out for their super sekrit encryption. Presumably because Sami has your pgp key.

Linus

Jason A. Donenfeld

21 Jun 21 Jun

6:49 p.m.

Hi Sami,

Would you try applying https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=13bb0... instead of the revert?

Spender (CC'd) suggested to me that possibly the reason for your first mis-bisect and possibly for the result you wound with has more to do with some non-determinism in the actual underlying bug that the above commit fixes. If applying 13bb06f8dd4207 fixes the issue, then Linus can then revert the revert he just committed.

Jason

Linus Torvalds

7:51 p.m.

On Wed, 21 Jun 2023 at 11:49, Jason A. Donenfeld Jason@zx2c4.com wrote:

...

Would you try applying https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=13bb0... instead of the revert?

That commit just got merged into my tree, and it fixes a real bug, but it _shouldn't_ be what Sami sees.

The bug it fixes was only introduced in this merge window.

So any boot failures seen in older kernels would only be because it was then backported to stable trees, but Sami mentions kernel versions that don't have those stable backports (eg the original questionable bisection that ended up on a bad commit 7e68dd7d07a2).

Now, with non-repeatable boot failures, anything is possible, and Sami does mention 6.1.30 as good (implying that 6.1.31 might not be - and that is when the backport happened).

So it's certainly worth checking out, but on the face of it, that bisection result doesn't really support the bug being due to e9523a0d81899 (which came *after* e7b813b32a42).

Linus

Jason A. Donenfeld

22 Jun 22 Jun

1:40 p.m.

On Wed, Jun 21, 2023 at 9:51 PM Linus Torvalds torvalds@linux-foundation.org wrote:

...

Now, with non-repeatable boot failures, anything is possible, and Sami does mention 6.1.30 as good (implying that 6.1.31 might not be - and that is when the backport happened).

So it's certainly worth checking out, but on the face of it, that bisection result doesn't really support the bug being due to e9523a0d81899 (which came *after* e7b813b32a42).

Sami - awaiting your results.