Hi,
On Mon, Jun 22, 2020 at 10:22:48AM +0200, Krzysztof Kozlowski wrote:
On Fri, Jun 19, 2020 at 07:55:21PM +0200, Sebastian Reichel wrote:
On Wed, May 27, 2020 at 09:42:54AM +0200, Pali Rohár wrote:
On Tuesday 26 May 2020 21:16:28 Andrew F. Davis wrote:
On 5/25/20 7:32 AM, Krzysztof Kozlowski wrote:
This reverts commit 8cfaaa811894a3ae2d7360a15a6cfccff3ebc7db.
If device was unbound and bound, the polling interval would be set to 0. This is both unexpected and messes up with other bq27xxx devices (if more than one battery device is used).
This reset of polling interval was added in commit 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver") stating that power_supply_unregister() calls get_property(). However in Linux kernel v3.1 and newer, such call trace does not exist. Unregistering power supply does not call get_property() on unregistered power supply.
Fixes: 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver") Cc: stable@vger.kernel.org Signed-off-by: Krzysztof Kozlowski krzk@kernel.org
I really could not identify the issue being fixed in offending commit 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver"), therefore maybe I missed here something important.
Please share your thoughts on this.
I'm having a hard time finding the OOPS also. Maybe there is a window where the poll function is running or about to run where cancel_delayed_work_sync() is called and cancels the work, only to have an interrupt or late get_property call in to the poll function and re-schedule it.
What we really need is to do is look at how we are handling the polling function. It gets called from the workqueue, from a threaded interrupt context, and from a power supply framework callback, possibly all at the same time. Sometimes its protected by a lock, sometimes not. Updating the device's cached data should always be locked.
What's more is the poll function is self-arming, so if we call cancel_delayed_work_sync() (remove it from the work queue then then wait for it to finish if running), are we sure it wont have just re-arm itself?
We should make the only way we call the poll function be through the work queue, (plus make sure all accesses to the cache are locked).
Andrew
I do not remember details too. It is long time ago.
CCing Ivaylo Dimitrov as he may remember something...
Applying this revert introduces at least a race condition when userspace reads sysfs files while kernel removes the driver.
So looking at the entrypoints for schedules:
bq27xxx_battery_i2c_probe: Not relevant, probe is done when the battery is being removed.
poll_interval_param_set: Can be avoided by unregistering from the list earlier. This is the right thing to do considering the battery is added to the list as last step in the probe routine, it should be removed first during teardown.
Yes, good point.
bq27xxx_external_power_changed: This can happen at any time while the power-supply device is registered, because of the code in get_property.
bq27xxx_battery_poll: This can happen at any time while the power-supply device is registered.
As far as I can see the only thing in the delayed work needing the power-supply device is power_supply_changed(). If we add a check, that di->bat is not NULL, we should be able to reorder teardown like this:
Except power_supply structure there is the device state struct bq27xxx_device_info 'di'. If bq27xxx_battery_poll() is called during the unbind, it will access the 'di' which is being freed by devm-framework. And just checking for di->bat is also not thread safe (can be reordered).
I think there is no easy few-line fix for this. Instead, the workqueue scheduling should be guarded everywhere by device-instance mutex (bq27xxx_device_info.lock).
Freeing of bq27xxx_device_info 'di' is not a problem, since the managed resource is attached to the i2c device, not the power-supply device. The I2C device is still available after step A2 from the list below, only the power-supply device will be unregistered/free'd. As a result after step A2 external power change can no longer happen and get_property can no longer be called, so no new work can be scheduled (apart from requeuing). After step A3 work incl. requeuing is stopped (cancel_delayed_work_sync handles work requeing itself).
The alternative would be to introduce a flag in di, that the device is about to teardown and avoid rescheduling work from external_power_changed + get_property when that is set. Then it would be possible to teardown like this:
B1. get mutex, set teardown flag, release mutex B2. remove from list B3. cancel delayed work B4. unregister power-supply device B5. destroy mutex
A1. remove from list A2. unregister power-supply device and set to di->bat to NULL A3. cancel delayed work A4. destroy mutex
Also I agree with Andrew, that the locking looks fishy. I think the lock needs to be moved, so that the call to bq27xx_battery_update(di) in bq27xxx_battery_poll is protected.
Exactly.
Best regards, Krzysztof
-- Sebastian