Sending it out to the mailing lists once more because AMD mail servers
tried to convert it to HTML :(
Am 17.12.24 um 01:26 schrieb Matthew Brost:
> On Fri, Nov 22, 2024 at 02:36:59PM +0000, Tvrtko Ursulin wrote:
>> [SNIP]
>>>>>> Do we have system wide workqueues for that? It seems a bit
>>>>>> overkill that amdgpu has to allocate one on his own.
>>>>> I wondered the same but did not find any. Only ones I am aware
>>>>> of are system_wq&co created in workqueue_init_early().
>>>> Gentle ping on this. I don't have any better ideas that creating a
>>>> new wq.
>>> It took me a moment to realize, but I now think this warning message is
>>> a false positive.
>>>
>>> What happens is that the code calls cancel_delayed_work_sync().
>>>
>>> If the work item never run because of lack of memory then it can just be
>>> canceled.
>>>
>>> If the work item is running then we will block for it to finish.
>>>
> Apologies for the late reply. Alex responded to another thread and CC'd
> me, which reminded me to reply here.
>
> The execution of the non-reclaim worker could have led to a few scenarios:
>
> - It might have triggered reclaim through its own memory allocation.
That is unrelated and has nothing todo with WQ_MEM_RECLAIM.
What we should do is to make sure that the lockdep annotation covers all
workers who play a role in fence signaling.
> - It could have been running and then context-switched out, with reclaim
> being triggered elsewhere in the mean time, pausing the execution of
> the non-reclaim worker.
As far as I know non-reclaim workers are not paused because a reclaim
worker is running, that would be really new to me.
What happens is that here (from workqueue.c):
* Workqueue rescuer thread function. There's one rescuer for each *
workqueue which has WQ_MEM_RECLAIM set. * * Regular work processing on a
pool may block trying to create a new * worker which uses GFP_KERNEL
allocation which has slight chance of * developing into deadlock if some
works currently on the same queue * need to be processed to satisfy the
GFP_KERNEL allocation. This is * the problem rescuer solves.
> In either case, during reclaim, if you wait on a DMA fence that depends
> on the DRM scheduler worker,and that worker attempts to flush the above
> non-reclaim worker, it will result in a deadlock.
Well that is only partially correct.
It's true that the worker we wait for can't wait for DMA-fence or do
memory allocations who wait for DMA-fences. But WQ_MEM_RECLAIM is not
related to any DMA fence annotation.
What happens instead is that the kernel always keeps a kernel thread
pre-allocated so that it can guarantee that the worker can start without
allocating memory.
As soon as the worker runs there shouldn't be any difference in the
handling as far as I know.
> The annotation appears correct to me, and I believe Tvrtko's patch is
> indeed accurate. For what it's worth, we encountered several similar
> bugs in Xe that emerged once we added the correct work queue
> annotations.
I think you mean something different. This is the lockdep annotation for
the workers and not WQ_MEM_RECLAIM.
Regards,
Christian.
>>> There is no need to use WQ_MEM_RECLAIM for the workqueue or do I miss
>>> something?
>>>
>>> If I'm not completely mistaken you stumbled over a bug in the warning
>>> code instead :)
>> Hmm your thinking sounds convincing.
>>
>> Adding Tejun if he has time to help brainstorm this.
>>
> Tejun could likely provide insight into whether my above assessment is
> correct.
> Matt
>
>> Question is - does check_flush_dependency() need to skip the !WQ_MEM_RECLAIM
>> flushing WQ_MEM_RECLAIM warning *if* the work is already running *and* it
>> was called from cancel_delayed_work_sync()?
>>
>> Regards,
>>
>> Tvrtko
>>
>>>>>> Apart from that looks good to me.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Tvrtko Ursulin<tvrtko.ursulin(a)igalia.com>
>>>>>>> References: 746ae46c1113 ("drm/sched: Mark scheduler
>>>>>>> work queues with WQ_MEM_RECLAIM")
>>>>>>> Fixes: a6149f039369 ("drm/sched: Convert drm scheduler
>>>>>>> to use a work queue rather than kthread")
>>>>>>> Cc:stable@vger.kernel.org
>>>>>>> Cc: Matthew Brost<matthew.brost(a)intel.com>
>>>>>>> Cc: Danilo Krummrich<dakr(a)kernel.org>
>>>>>>> Cc: Philipp Stanner<pstanner(a)redhat.com>
>>>>>>> Cc: Alex Deucher<alexander.deucher(a)amd.com>
>>>>>>> Cc: Christian König<christian.koenig(a)amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 25
>>>>>>> +++++++++++++++++++++++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 5 +++--
>>>>>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index 7645e498faa4..a6aad687537e 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -268,6 +268,8 @@ extern int amdgpu_agp;
>>>>>>> extern int amdgpu_wbrf;
>>>>>>> +extern struct workqueue_struct *amdgpu_reclaim_wq;
>>>>>>> +
>>>>>>> #define AMDGPU_VM_MAX_NUM_CTX 4096
>>>>>>> #define AMDGPU_SG_THRESHOLD (256*1024*1024)
>>>>>>> #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> index 38686203bea6..f5b7172e8042 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> @@ -255,6 +255,8 @@ struct amdgpu_watchdog_timer
>>>>>>> amdgpu_watchdog_timer = {
>>>>>>> .period = 0x0, /* default to 0x0 (timeout disable) */
>>>>>>> };
>>>>>>> +struct workqueue_struct *amdgpu_reclaim_wq;
>>>>>>> +
>>>>>>> /**
>>>>>>> * DOC: vramlimit (int)
>>>>>>> * Restrict the total amount of VRAM in MiB for
>>>>>>> testing. The default is 0 (Use full VRAM).
>>>>>>> @@ -2971,6 +2973,21 @@ static struct pci_driver
>>>>>>> amdgpu_kms_pci_driver = {
>>>>>>> .dev_groups = amdgpu_sysfs_groups,
>>>>>>> };
>>>>>>> +static int amdgpu_wq_init(void)
>>>>>>> +{
>>>>>>> + amdgpu_reclaim_wq =
>>>>>>> + alloc_workqueue("amdgpu-reclaim", WQ_MEM_RECLAIM, 0);
>>>>>>> + if (!amdgpu_reclaim_wq)
>>>>>>> + return -ENOMEM;
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void amdgpu_wq_fini(void)
>>>>>>> +{
>>>>>>> + destroy_workqueue(amdgpu_reclaim_wq);
>>>>>>> +}
>>>>>>> +
>>>>>>> static int __init amdgpu_init(void)
>>>>>>> {
>>>>>>> int r;
>>>>>>> @@ -2978,6 +2995,10 @@ static int __init amdgpu_init(void)
>>>>>>> if (drm_firmware_drivers_only())
>>>>>>> return -EINVAL;
>>>>>>> + r = amdgpu_wq_init();
>>>>>>> + if (r)
>>>>>>> + goto error_wq;
>>>>>>> +
>>>>>>> r = amdgpu_sync_init();
>>>>>>> if (r)
>>>>>>> goto error_sync;
>>>>>>> @@ -3006,6 +3027,9 @@ static int __init amdgpu_init(void)
>>>>>>> amdgpu_sync_fini();
>>>>>>> error_sync:
>>>>>>> + amdgpu_wq_fini();
>>>>>>> +
>>>>>>> +error_wq:
>>>>>>> return r;
>>>>>>> }
>>>>>>> @@ -3017,6 +3041,7 @@ static void __exit amdgpu_exit(void)
>>>>>>> amdgpu_acpi_release();
>>>>>>> amdgpu_sync_fini();
>>>>>>> amdgpu_fence_slab_fini();
>>>>>>> + amdgpu_wq_fini();
>>>>>>> mmu_notifier_synchronize();
>>>>>>> amdgpu_xcp_drv_release();
>>>>>>> }
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> index 2f3f09dfb1fd..f8fd71d9382f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> @@ -790,8 +790,9 @@ void amdgpu_gfx_off_ctrl(struct
>>>>>>> amdgpu_device *adev, bool enable)
>>>>>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>> adev->gfx.gfx_off_state = true;
>>>>>>> } else {
>>>>>>> - schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
>>>>>>> - delay);
>>>>>>> + queue_delayed_work(amdgpu_reclaim_wq,
>>>>>>> + &adev->gfx.gfx_off_delay_work,
>>>>>>> + delay);
>>>>>>> }
>>>>>>> }
>>>>>>> } else {
On certain i.MX8 series parts [1], the PPS channel 0
is routed internally to eDMA, and the external PPS
pin is available on channel 1. In addition, on
certain boards, the PPS may be wired on the PCB to
an EVENTOUTn pin other than 0. On these systems
it is necessary that the PPS channel be able
to be configured from the Device Tree.
[1] https://lore.kernel.org/all/ZrPYOWA3FESx197L@lizhi-Precision-Tower-5810/
Francesco Dolcini (3):
dt-bindings: net: fec: add pps channel property
net: fec: refactor PPS channel configuration
net: fec: make PPS channel configurable
Documentation/devicetree/bindings/net/fsl,fec.yaml | 7 +++++++
drivers/net/ethernet/freescale/fec_ptp.c | 11 ++++++-----
2 files changed, 13 insertions(+), 5 deletions(-)
--
2.34.1
On Sun, Dec 15, 2024 at 11:54:50AM -0500, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> module: Convert default symbol namespace to string literal
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> module-convert-default-symbol-namespace-to-string-li.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
IIUC if you take this one, you would want to take more that are fixing
documentation generation and other noticed regressions.
--
With Best Regards,
Andy Shevchenko
On Sun, Dec 15, 2024 at 11:54:56AM -0500, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> gpio-idio-16-actually-make-use-of-the-gpio_idio_16-s.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 8845b746c447c715080e448d62aeed25f73fb205
> Author: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> Date: Tue Dec 3 18:26:30 2024 +0100
>
> gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
>
> [ Upstream commit 9ac4b58fcef0f9fc03fa6e126a5f53c1c71ada8a ]
>
> DEFAULT_SYMBOL_NAMESPACE must already be defined when <linux/export.h>
> is included. So move the define above the include block.
>
> Fixes: b9b1fc1ae119 ("gpio: idio-16: Introduce the ACCES IDIO-16 GPIO library module")
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> Acked-by: William Breathitt Gray <wbg(a)kernel.org>
> Link: https://lore.kernel.org/r/20241203172631.1647792-2-u.kleine-koenig@baylibre…
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Hmm, I don't think the advantages here are very relevant. The only
problem fixed here is that the symbols provided by the driver are not in
the expected namespace. So this is nothing a user would wail about as
everything works as intended. The big upside of dropping this patch is
that you can (I think) also drop the backport of commit ceb8bf2ceaa7
("module: Convert default symbol namespace to string literal").
Even if you want to fix the namespace issue in the gpio-idio-16 driver,
I'd suggest to just move the definition of DEFAULT_SYMBOL_NAMESPACE
without the quotes for stable as I think this is easy enough to not
justify taking the intrusive ceb8bf2ceaa7. Also ceb8bf2ceaa7 might be an
annoyance for out-of-tree code. I know we don't care much about these,
still I think not adding to their burden is another small argument for
not taking ceb8bf2ceaa7.
Best regards
Uwe
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function to decrement reference count for cleanup.
Or it could cause memory leak.
As comment of device_add() says, if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: ed542bed126c ("[SCSI] raid class: handle component-add errors")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/scsi/raid_class.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/scsi/raid_class.c b/drivers/scsi/raid_class.c
index 898a0bdf8df6..2cb2949a78c6 100644
--- a/drivers/scsi/raid_class.c
+++ b/drivers/scsi/raid_class.c
@@ -251,6 +251,7 @@ int raid_component_add(struct raid_template *r,struct device *raid_dev,
list_del(&rc->node);
rd->component_count--;
put_device(component_dev);
+ put_device(&rc->dev);
kfree(rc);
return err;
}
--
2.25.1
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function to decrement reference count for cleanup.
Or it could cause memory leak.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 53d2a715c240 ("phy: Add Tegra XUSB pad controller support")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/phy/tegra/xusb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/phy/tegra/xusb.c b/drivers/phy/tegra/xusb.c
index 79d4814d758d..c89df95aa6ca 100644
--- a/drivers/phy/tegra/xusb.c
+++ b/drivers/phy/tegra/xusb.c
@@ -548,16 +548,16 @@ static int tegra_xusb_port_init(struct tegra_xusb_port *port,
err = dev_set_name(&port->dev, "%s-%u", name, index);
if (err < 0)
- goto unregister;
+ goto put_device;
err = device_add(&port->dev);
if (err < 0)
- goto unregister;
+ goto put_device;
return 0;
-unregister:
- device_unregister(&port->dev);
+put_device:
+ put_device(&port->dev);
return err;
}
--
2.25.1
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: f2b44cde7e16 ("virtio: split device_register into device_initialize and device_add")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the fixes tag according to suggestions;
- modified the bug description.
---
drivers/virtio/virtio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index b9095751e43b..ac721b5597e8 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -503,6 +503,7 @@ int register_virtio_device(struct virtio_device *dev)
out_of_node_put:
of_node_put(dev->dev.of_node);
+ put_device(&dev->dev);
out_ida_remove:
ida_free(&virtio_index_ida, dev->index);
out:
--
2.25.1