Updating the machine's kernel from v5.19.x to v6.0.x causes the machine to not
successfully boot. The machine boots successfully (and exhibits stable operation)
with version v5.19.17 and multiple earlier releases in the 5.19 line. Multiple releases
from the 6.0 line (including 6.0.0, 6.0.3, and 6.0.5), with no other changes to the
software environment, do not boot. Instead, the machine hangs after loading services
but before presenting a display manager; the machine instead shows repetitive hard
drive activity at this point and then no apparent activity.
''uname'' output for the machine successfully running v5.19.17 is:
Linux [MACHINE_NAME] 5.19.17 #1 SMP PREEMPT_DYNAMIC Mon Oct 24 13:32:29 2022 i686 Intel(R) Atom(TM) CPU N270 @ 1.60GHz GenuineIntel GNU/Linux
The machine is an OCZ Neutrino netbook, running a custom OS build largely similar to
LFS development. The kernel update uses ''make olddefconfig''.
#regzbot introduced: v5.19..v6.0
---
Dominic Jones
jonesd(a)xmission.com
Reinstating Cc stable, which I removed just before the discussion settled.
On Thu, 27 Oct 2022, Peter Xu wrote:
> ...
>
> After a re-read and 2nd thought, I think David has a valid point in that we
> shouldn't have special handling of !anon pages on CoW during fork(),
> because that seems to be against the fundamental concept of fork().
>
> So now I think I agree the !Anon original check does look a bit cleaner,
> and also make fork() behavior matching with the old/new kernels, irrelevant
> of the pin mess.
Thanks Peter. So Yuanzheng's patch for 5.10 is exactly right.
Sorry for leading everyone astray: my mistake was to suppose that
its !PageAnon check was simply to avoid the later BUG_ON(!anon_vma):
whereas David and Peter now agree that it actually corrects the
semantics for fork() on file pages.
I lift my hold on Yuanzheng's patch: nobody actually said "Acked-by",
but I think the discussion and resolution have given better than that.
(No 3rd thoughts please!)
Hugh
On Tue, Oct 11, 2022 at 03:46:15PM +0800, Yaxiong Tian wrote:
> >Note that there could be also the case in which there are many other
> >threads of executions on different cores trying to send a bunch of SCMI
> >commands (there are indeed many SCMI drivers currently in the stack) so
> >all these requests HAVE TO be queued while waiting for the shared
> >memory area to be free again, so I don't think you can assume in general
> >that shmem_tx_prepare() is called only after a successfull command completion
> >or timeout: it is probably true when an SCMI Mailbox transport is
> >used but it is not neccessarly the case when other shmem based transports
> >like optee/smc are used. You could have N threads trying to send an SCMI
> >command, queued, spinning on that loop, while waiting for the channel to
> >be free: in such a case note that you TX has not even complete, you are
> >still waiting for the channel the SCMI upper layer TX timeout has NOT
> >even being started since your the transmission itself is pending (i.e.
> >send_message has still not returned...)
>
> I read the code in optee/smc, scmi_optee_send_message()/smc_send_message()
> have channel lock before shmem_tx_prepare(). The lock was released when
> transports was completed 、timeout or err. So although they have N threads
> trying to send an SCMI command, queued, but only one could take up the channel.
> Also only one thread call shmem_tx_prepare() in one channel and spin() in it.
>
> It is also true in mailboxs or other shmem based transports,because SCMI
> protocol say:"agent must exclusive the channel."This is very reasonable through
> the channel lock rather than shared memory.
>
> So it is simple for selecting timeout period. Because there is only one thread
> spin() in one channel, so the timeout period only depending on the firmware.
> In other words this time can be a constant.
>
Yes, my bad, I forgot that any shared-mem based transport has some kind
of mechanism to access exclusively the channel for the whole transaction
to avoid some thread can issue a tx_prepare before the previous
transaction has fully completed (i.e. the result in the reply, if any,
was fetched before being overwritten by the next)
> >Not sure that all of this kind of work would be worth to address some,
> >maybe transient, error conditions due to a broken SCMI server, BUT in any
> >case, any kind of timeout you want to introduce in the spin loop MUST
> >result in a failed transmission until the FREE bitflag is cleared by the
> >SCMI server; i.e. if that flag won't be cleared EVER by the server, you
> >have to end up with a sequence of timed-out spinloops and transmission
> >failures, you definetely cannot recover forcibly like this.
>
> I totally agree above.In such broken SCMI server,users cannot get any Any
> hints.So I think it at least pr_warn(). We can set prepare_tx_timout parameter
> in scmi_desc,or just set options for users to check error.
>
Problem is anyway, as you said, you'll have to pick this timeout from the
related transport scmi_desc (even if as of now the max_rx_timeout for
all existent shared mem transport is the same..) and this means anyway
adding more complexity to the chain of calls to just to print a warn of
some kind in a rare error-situation from which you cannot recover anyway.
Due to other unrelated discussions, I was starting to think about
exposing some debug-only (Kconfig dependent) SCMI stats like timeouts, errors,
unpexpected/OoO/late_replies in order to ease the debug and monitoring
of the health of a running SCMI stack: maybe this could be a place where
to flag this FW issues without changing the spinloop above (or
to add the kind of timeout you mentioned but only when some sort of
CONFIG_SCMI_DEBUG is enabled...)...still to fully think it through, though.
Any thoughts ?
Thanks,
Cristian