Hi Russell,
czw., 10 gru 2020 o 16:46 Russell King - ARM Linux admin linux@armlinux.org.uk napisał(a):
On Thu, Dec 10, 2020 at 03:35:29PM +0100, Marcin Wojtas wrote:
Hi Greg,
śr., 9 gru 2020 o 11:59 Greg Kroah-Hartman gregkh@linuxfoundation.org napisał(a):
What part fixes the issue? I can't see it...
I re-checked in my setup and here's the smallest part of the original patch, that fixes previously described issue: diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index e98be8372780..9d71a4fe1750 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -4767,6 +4767,11 @@ static void mvpp2_port_copy_mac_addr(struct net_device *dev, struct mvpp2 *priv, eth_hw_addr_random(dev); }
+static struct mvpp2_port *mvpp2_phylink_to_port(struct phylink_config *config) +{
return container_of(config, struct mvpp2_port, phylink_config);
+}
static void mvpp2_phylink_validate(struct phylink_config *config, unsigned long *supported, struct phylink_link_state *state) @@ -5105,13 +5110,12 @@ static void mvpp2_gmac_config(struct mvpp2_port *port, unsigned int mode, static void mvpp2_mac_config(struct phylink_config *config, unsigned int mode, const struct phylink_link_state *state) {
struct net_device *dev = to_net_dev(config->dev);
struct mvpp2_port *port = netdev_priv(dev);
struct mvpp2_port *port = mvpp2_phylink_to_port(config); bool change_interface = port->phy_interface != state->interface; /* Check for invalid configuration */ if (mvpp2_is_xlg(state->interface) && port->gop_id != 0) {
netdev_err(dev, "Invalid mode on %s\n", dev->name);
netdev_err(port->dev, "Invalid mode on %s\n", port->dev->name); return; }
@@ -5151,8 +5155,7 @@ static void mvpp2_mac_link_up(struct phylink_config *config, int speed, int duplex, bool tx_pause, bool rx_pause) {
struct net_device *dev = to_net_dev(config->dev);
struct mvpp2_port *port = netdev_priv(dev);
struct mvpp2_port *port = mvpp2_phylink_to_port(config); u32 val; if (mvpp2_is_xlg(interface)) {
@@ -5199,7 +5202,7 @@ static void mvpp2_mac_link_up(struct phylink_config *config,
mvpp2_egress_enable(port); mvpp2_ingress_enable(port);
netif_tx_wake_all_queues(dev);
netif_tx_wake_all_queues(port->dev);
}
static void mvpp2_mac_link_down(struct phylink_config *config,
The problem is caused by this hack:
/* Phylink isn't used as of now for ACPI, so the MAC has to be * configured manually when the interface is started. This will * be removed as soon as the phylink ACPI support lands in. */ struct phylink_link_state state = { .interface = port->phy_interface, }; mvpp2_mac_config(&port->phylink_config, MLO_AN_INBAND, &state); mvpp2_mac_link_up(&port->phylink_config, MLO_AN_INBAND, port->phy_interface, NULL);
which passes an un-initialised (zeroed) port->phylink_config, as phylink is not used in ACPI setups.
The problem occurs because port->phylink_config.dev (which is a NULL pointer in this instance) is passed to to_net_dev():
#define to_net_dev(d) container_of(d, struct net_device, dev)
Which then means netdev_priv(dev) attempts to dereference a not-quite NULL pointer, leading to an oops.
Correct. Hopefully the MDIO+ACPI patchset will land and I'll be able to get rid of this hack and do not maintain dual handling paths any longer.
The problem here is that the bug was not noticed; it seems hardly anyone bothers to run mainline kernels with ACPI on Marvell platforms, or if they do, they don't bother reporting to mainline communities when they have problems. Likely, there's posts on some random web-based bulletin board or mailing list that kernel developers don't read somewhere complaining that there's an oops.
I must admit that due to other duties I did not follow the mainline mvpp2 for a couple revisions (and I am not maintainer of it). However recently I got reached-out directly by different developers - the trigger was different distros upgrading the kernel above v5.4+ and for some reasons the DT path is not chosen there (and the ACPI will be chosen more and more in the SystemReady world).
Like...
https://lists.einval.com/pipermail/macchiato/2020-January/000309.html https://gist.github.com/AdrianKoshka/ff9862da2183a2d8e26d47baf8dc04b9
This kind of segmentation is very disappointing; it means potentially lots of bugs go by unnoticed by kernel developers, and bugs only get fixed by chance. Had it been reported to somewhere known earlier this year, it is likely that a proper fix patch would have been created.
Totally agree. As soon as I got noticed directly it took me no time to find the regression root cause.
How this gets handled is ultimately up to the stable developers to decide what they wish to accept. Do they wish to accept a back-ported full version of my commit 6c2b49eb9671 ("net: mvpp2: add mvpp2_phylink_to_port() helper") that unintentionally fixed this unknown issue, or do they want a more minimal fix such as the cut-down version of that commit that Marcin has supplied.
Until something changes in the way bugs get reported, I suspect this won't be the only instance of bug-fixing-by-accident happening.
Thank you Russell for your input.
Best regards, Marcin