On Mon, Mar 4, 2024 at 8:49 AM Herve Codina herve.codina@bootlin.com wrote:
Hi Rob,
On Mon, 4 Mar 2024 09:22:02 -0600 Rob Herring robh@kernel.org wrote:
...
@@ -853,6 +854,14 @@ static void free_overlay_changeset(struct overlay_changeset *ovcs) { int i;
- /*
- Wait for any ongoing device link removals before removing some of
- nodes. Drop the global lock while waiting
- */
- mutex_unlock(&of_mutex);
- device_link_wait_removal();
- mutex_lock(&of_mutex);
I'm still not convinced we need to drop the lock. What happens if someone else grabs the lock while we are in device_link_wait_removal()? Can we guarantee that we can't screw things badly?
It is also just ugly because it's the callers of free_overlay_changeset() that hold the lock and now we're releasing it behind their back.
As device_link_wait_removal() is called before we touch anything, can't it be called before we take the lock? And do we need to call it if applying the overlay fails?
Rob,
This[1] scenario Luca reported seems like a reason for the device_link_wait_removal() to be where Herve put it. That example seems reasonable.
[1] - https://lore.kernel.org/all/20231220181627.341e8789@booty/
Indeed, having device_link_wait_removal() is not needed when applying the overlay fails.
I can call device_link_wait_removal() from the caller of_overlay_remove() but not before the lock is taken. We need to call it between __of_changeset_revert_notify() and free_overlay_changeset() and so, the lock is taken.
This lead to the following sequence: --- 8< --- int of_overlay_remove(int *ovcs_id) { ... mutex_lock(&of_mutex); ...
ret = __of_changeset_revert_notify(&ovcs->cset); ... ret_tmp = overlay_notify(ovcs, OF_OVERLAY_POST_REMOVE); ... mutex_unlock(&of_mutex); device_link_wait_removal(); mutex_lock(&of_mutex); free_overlay_changeset(ovcs); ... mutex_unlock(&of_mutex); ...
} --- 8< ---
In this sequence, the question is: Do we need to release the mutex lock while device_link_wait_removal() is called ?
In general I hate these kinds of sequences that release a lock and then grab it again quickly. It's not always a bug, but my personal take on that is 90% of these introduce a bug.
Drop the unlock/lock and we'll deal a deadlock if we actually hit one. I'm also fairly certain that device_link_wait_removal() can't trigger something else that can cause an OF overlay change while we are in the middle of one. And like Rob said, I'm not sure this unlock/lock is a good solution for that anyway.
Please CC me on the next series. And I'm glad folks convinced you to use flush_workqueue(). As I said in the older series, I think drain_workqueue() will actually break device links.
-Saravana
-Saravana