Re: [PATCH v3 2/2] of: overlay: Synchronize of_overlay_remove() with the devlink removals

5 Mar 2024

On Mon, 2024-03-04 at 22:47 -0800, Saravana Kannan wrote:
...
On Mon, Mar 4, 2024 at 8:49 AM Herve Codina herve.codina@bootlin.com wrote:
...
Hi Rob,
On Mon, 4 Mar 2024 09:22:02 -0600
Rob Herring robh@kernel.org wrote:
...
...
...
...
@@ -853,6 +854,14 @@ static void free_overlay_changeset(struct
overlay_changeset *ovcs)
 {
  int i;

/*

+  * Wait for any ongoing device link removals before removing some of
+  * nodes. Drop the global lock while waiting
+  */

mutex_unlock(&of_mutex);
device_link_wait_removal();
mutex_lock(&of_mutex);

I'm still not convinced we need to drop the lock. What happens if
someone else
grabs the lock while we are in device_link_wait_removal()? Can we
guarantee that
we can't screw things badly?
It is also just ugly because it's the callers of
free_overlay_changeset() that hold the lock and now we're releasing it
behind their back.
As device_link_wait_removal() is called before we touch anything, can't
it be called before we take the lock? And do we need to call it if
applying the overlay fails?
Rob,
This[1] scenario Luca reported seems like a reason for the
device_link_wait_removal() to be where Herve put it. That example
seems reasonable.
[1] - https://lore.kernel.org/all/20231220181627.341e8789@booty/
I'm still not totally convinced about that. Why not putting the check right
before checking the kref in __of_changeset_entry_destroy(). I'll contradict
myself a bit because this is just theory but if we look at pci_stop_dev(), which
AFAIU, could be reached from a sysfs write(), we have:
device_release_driver(&dev->dev);
...
of_pci_remove_node(dev);
    of_changeset_revert(np->data);
    of_changeset_destroy(np->data);
So looking at the above we would hit the same issue if we flush the queue in
free_overlay_changeset() - as the queue won't be flushed at all and we could
have devlink removal due to device_release_driver(). Right?
Again, completely theoretical but seems like a reasonable one plus I'm not
understanding the push against having the flush in
__of_changeset_entry_destroy(). Conceptually, it looks the best place to me but
I may be missing some issue in doing it there?
...
...
...
Indeed, having device_link_wait_removal() is not needed when applying the
overlay fails.
I can call device_link_wait_removal() from the caller of_overlay_remove()
but not before the lock is taken.
We need to call it between __of_changeset_revert_notify() and
free_overlay_changeset() and so, the lock is taken.
This lead to the following sequence:

--- 8< ---
int of_overlay_remove(int *ovcs_id)
{
        ...
        mutex_lock(&of_mutex);
        ...
ret = __of_changeset_revert_notify(&ovcs->cset);
        ...
ret_tmp = overlay_notify(ovcs, OF_OVERLAY_POST_REMOVE);
        ...
mutex_unlock(&of_mutex);
        device_link_wait_removal();
        mutex_lock(&of_mutex);
free_overlay_changeset(ovcs);
        ...
        mutex_unlock(&of_mutex);
        ...
}
--- 8< ---
In this sequence, the question is:
Do we need to release the mutex lock while device_link_wait_removal() is
called ?
In general I hate these kinds of sequences that release a lock and
then grab it again quickly. It's not always a bug, but my personal
take on that is 90% of these introduce a bug.
Drop the unlock/lock and we'll deal a deadlock if we actually hit one.
I'm also fairly certain that device_link_wait_removal() can't trigger
something else that can cause an OF overlay change while we are in the
middle of one. And like Rob said, I'm not sure this unlock/lock is a
good solution for that anyway.
Totally agree. Unless we really see a deadlock this is a very bad idea (IMHO).
Even on the PCI code, it seems to me that we're never destroying a changeset
from a device/kobj_type release callback. That would be super weird right?
- Nuno Sá
...

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v3 2/2] of: overlay: Synchronize of_overlay_remove() with the devlink removals