If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
Cc: stable@vger.kernel.org Reported-by: Florian Fainelli f.fainelli@gmail.com Suggested-by: Florian Fainelli f.fainelli@gmail.com Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn andrew@lunn.ch --- drivers/net/phy/phy_device.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 17d0d0555a79..53598210be6c 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -3021,6 +3021,15 @@ static int phy_led_blink_set(struct led_classdev *led_cdev, return err; }
+static void phy_leds_unregister(struct phy_device *phydev) +{ + struct phy_led *phyled; + + list_for_each_entry(phyled, &phydev->leds, list) { + led_classdev_unregister(&phyled->led_cdev); + } +} + static int of_phy_led(struct phy_device *phydev, struct device_node *led) { @@ -3054,7 +3063,7 @@ static int of_phy_led(struct phy_device *phydev, init_data.fwnode = of_fwnode_handle(led); init_data.devname_mandatory = true;
- err = devm_led_classdev_register_ext(dev, cdev, &init_data); + err = led_classdev_register_ext(dev, cdev, &init_data); if (err) return err;
@@ -3083,6 +3092,7 @@ static int of_phy_leds(struct phy_device *phydev) err = of_phy_led(phydev, led); if (err) { of_node_put(led); + phy_leds_unregister(phydev); return err; } } @@ -3305,6 +3315,9 @@ static int phy_remove(struct device *dev)
cancel_delayed_work_sync(&phydev->state_queue);
+ if (IS_ENABLED(CONFIG_PHYLIB_LEDS)) + phy_leds_unregister(phydev); + phydev->state = PHY_DOWN;
sfp_bus_del_upstream(phydev->sfp_bus);
Hello:
This patch was applied to netdev/net.git (main) by David S. Miller davem@davemloft.net:
On Sat, 17 Jun 2023 17:55:00 +0200 you wrote:
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
[...]
Here is the summary with links: - [net] net: phy: Manual remove LEDs to ensure correct ordering https://git.kernel.org/netdev/net/c/c938ab4da0eb
You are awesome, thank you!
Hi Andrew,
On 6/17/2023 4:55 PM, Andrew Lunn wrote:
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
Cc: stable@vger.kernel.org Reported-by: Florian Fainelli f.fainelli@gmail.com Suggested-by: Florian Fainelli f.fainelli@gmail.com Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn andrew@lunn.ch
Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following:
# reboot -f [ 18.162000] bcmgenet 8f00000.ethernet eth0: Link is Down [ 18.305163] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError [ 18.305170] GISB: target abort at 0x8f00e14 [R ], core: cpu_0 [ 18.305180] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 6.4.0-rc5-next-20230607-gc7a93fa22690 #98 [ 18.305187] Hardware name: BCM972180HB_V20 (DT) [ 18.305191] Workqueue: events set_brightness_delayed [ 18.305214] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 18.305220] pc : el1_abort+0x30/0x5c [ 18.305230] lr : el1_abort+0x24/0x5c [ 18.305235] sp : ffffffc082b73a90 [ 18.305236] x29: ffffffc082b73a90 x28: ffffff8002fad780 x27: 0000000000000000 [ 18.305243] x26: 0000000000000000 x25: 0000000000000000 x24: ffffff807dbb340d [ 18.305250] x23: 0000000060000005 x22: ffffffc08066d9ac x21: 0000000096000210 [ 18.305256] x20: ffffffc082b55e14 x19: ffffffc082b73ad0 x18: 0000000000000000 [ 18.305263] x17: 74656e2f74656e72 x16: 656874652e303030 x15: 303066382f626472 [ 18.305269] x14: ffffff8004a84cd8 x13: 6e69622f7273752f x12: 0000000000000000 [ 18.305275] x11: ffffff8002d1c710 x10: 0000000000000870 x9 : ffffffc080667e34 [ 18.305282] x8 : ffffff8003d44a80 x7 : fefefefefefefeff x6 : 000073746e657665 [ 18.305288] x5 : ffffff8003d44a80 x4 : ffffffc082b73ad0 x3 : 0000000000000025 [ 18.305294] x2 : 000000000000001c x1 : 0000000004208060 x0 : 0000000000000000 [ 18.305303] Kernel panic - not syncing: Asynchronous SError Interrupt [ 18.305306] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 6.4.0-rc5-next-20230607-gc7a93fa22690 #98 [ 18.305311] Hardware name: BCM972180HB_V20 (DT) [ 18.305314] Workqueue: events set_brightness_delayed [ 18.305319] Call trace: [ 18.305321] dump_backtrace+0xdc/0x114 [ 18.305329] show_stack+0x1c/0x28 [ 18.305333] dump_stack_lvl+0x44/0x58 [ 18.305339] dump_stack+0x14/0x1c [ 18.305342] panic+0x128/0x2f8 [ 18.305350] nmi_panic+0x50/0x70 [ 18.305356] arm64_serror_panic+0x74/0x80 [ 18.305361] do_serror+0x2c/0x5c [ 18.305366] el1h_64_error_handler+0x30/0x44 [ 18.305372] el1h_64_error+0x64/0x68 [ 18.305378] el1_abort+0x30/0x5c [ 18.305383] el1h_64_sync_handler+0x64/0xc8 [ 18.305389] el1h_64_sync+0x64/0x68 [ 18.305392] readl_relaxed+0x0/0x8 [ 18.305401] __mdiobus_write+0x3c/0x94 [ 18.305409] mdiobus_write+0x4c/0x70 [ 18.305415] phy_write+0x1c/0x24 [ 18.305419] bcm_phy_read_shadow+0x24/0x40 [ 18.305423] bcm_phy_led_brightness_set+0x40/0x94 [ 18.305428] phy_led_set_brightness+0x48/0x68 [ 18.305434] set_brightness_delayed_set_brightness+0x44/0x7c [ 18.305443] set_brightness_delayed+0xc4/0x1a4 [ 18.305447] process_one_work+0x1c0/0x284 [ 18.305455] process_scheduled_works+0x44/0x48 [ 18.305461] worker_thread+0x1e8/0x264 [ 18.305467] kthread+0xcc/0xdc [ 18.305474] ret_from_fork+0x10/0x20 [ 18.311812] Kernel Offset: disabled [ 18.311814] CPU features: 0x00000003,00010000,0000420b [ 18.311818] Memory Limit: none [ 18.566507] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
still not clear to me how the workqueue managed to execute and not finish before we unregistered the PHY device.
Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following:
Hi Florian
Yes, i was not expecting this patch to fix that. But i was getting the NULL pointer dereference you pointed out with another setup, and this change does fix that part of the problem.
still not clear to me how the workqueue managed to execute and not finish before we unregistered the PHY device.
Me neither. I took a look at the MDIO bus driver and could not see anything obvious. I think you are going to have to scatter printk() in the code to get a clear understanding of the order things are done. Maybe it is another devm_ timing issue.
Andrew
On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
Hi Andrew,
On 6/17/2023 4:55 PM, Andrew Lunn wrote:
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
Cc: stable@vger.kernel.org Reported-by: Florian Fainelli f.fainelli@gmail.com Suggested-by: Florian Fainelli f.fainelli@gmail.com Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn andrew@lunn.ch
Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following:
Hi Florian,
Can you try setting trigger_data->led_cdev to NULL after the cancel_delayed_work_sync() in netdev_trig_deactivate() and see what the effect is?
Thanks.
Hi Russell,
On 6/21/2023 6:04 PM, Russell King (Oracle) wrote:
On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
Hi Andrew,
On 6/17/2023 4:55 PM, Andrew Lunn wrote:
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
Cc: stable@vger.kernel.org Reported-by: Florian Fainelli f.fainelli@gmail.com Suggested-by: Florian Fainelli f.fainelli@gmail.com Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn andrew@lunn.ch
Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following:
Hi Florian,
Can you try setting trigger_data->led_cdev to NULL after the cancel_delayed_work_sync() in netdev_trig_deactivate() and see what the effect is?
Thanks for the suggestion, getting an identical trace as before with that change.
On Wed, Jun 21, 2023 at 06:12:40PM +0100, Florian Fainelli wrote:
Hi Russell,
On 6/21/2023 6:04 PM, Russell King (Oracle) wrote:
On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
Hi Andrew,
On 6/17/2023 4:55 PM, Andrew Lunn wrote:
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED.
Manually unregister the LEDs at a safe point in phy_remove.
Cc: stable@vger.kernel.org Reported-by: Florian Fainelli f.fainelli@gmail.com Suggested-by: Florian Fainelli f.fainelli@gmail.com Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn andrew@lunn.ch
Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following:
Hi Florian,
Can you try setting trigger_data->led_cdev to NULL after the cancel_delayed_work_sync() in netdev_trig_deactivate() and see what the effect is?
Thanks for the suggestion, getting an identical trace as before with that change.
Thanks for trying. I was wondering whether the work was being re-queued after the flush_work(), but seemingly not.
linux-stable-mirror@lists.linaro.org