Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
Hope this helps!
Best wishes, Guillaume
-------- Forwarded Message -------- Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC) From: kernelci.org bot bot@kernelci.org To: guillaume.tucker@collabora.com
Bisection result for mainline/master (v4.15-rc3) on peach-pi
Good known revision:
c6b3e96 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Bad known revision:
50c4c4e Linux 4.15-rc3
Extra parameters:
Tree: mainline Branch: master Target: peach-pi Lab: lab-collabora Defconfig: exynos_defconfig Plan: boot
Breaking commit found:
------------------------------------------------------------------------------- commit a703c55004e1c5076d57e43771b3e11117796ea0 Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Dec 4 21:48:18 2017 +0100
drm: safely free connectors from connector_iter
In
commit 613051dac40da1751ab269572766d3348d45a197 Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Wed Dec 14 00:08:06 2016 +0100
drm: locking&new iterators for connector_list
we've went to extreme lengths to make sure connector iterations works in any context, without introducing any additional locking context. This worked, except for a small fumble in the implementation:
When we actually race with a concurrent connector unplug event, and our temporary connector reference turns out to be the final one, then everything breaks: We call the connector release function from whatever context we happen to be in, which can be an irq/atomic context. And connector freeing grabs all kinds of locks and stuff.
Fix this by creating a specially safe put function for connetor_iter, which (in this rare case) punts the cleanup to a worker.
Reported-by: Ben Widawsky ben@bwidawsk.net Cc: Ben Widawsky ben@bwidawsk.net Fixes: 613051dac40d ("drm: locking&new iterators for connector_list") Cc: Dave Airlie airlied@gmail.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Sean Paul seanpaul@chromium.org Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: Dave Airlie airlied@gmail.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel....
diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index 25f4b2e..4820141 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) connector->funcs->destroy(connector); }
+static void drm_connector_free_work_fn(struct work_struct *work) +{ + struct drm_connector *connector = + container_of(work, struct drm_connector, free_work); + struct drm_device *dev = connector->dev; + + drm_mode_object_unregister(dev, &connector->base); + connector->funcs->destroy(connector); +} + /** * drm_connector_init - Init a preallocated connector * @dev: DRM device @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, if (ret) return ret;
+ INIT_WORK(&connector->free_work, drm_connector_free_work_fn); + connector->base.properties = &connector->properties; connector->dev = dev; connector->funcs = funcs; @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device *dev, } EXPORT_SYMBOL(drm_connector_list_iter_begin);
+/* + * Extra-safe connector put function that works in any context. Should only be + * used from the connector_iter functions, where we never really expect to + * actually release the connector when dropping our final reference. + */ +static void +drm_connector_put_safe(struct drm_connector *conn) +{ + if (refcount_dec_and_test(&conn->base.refcount.refcount)) + schedule_work(&conn->free_work); +} + /** * drm_connector_list_iter_next - return next connector * @iter: connectr_list iterator @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct drm_connector_list_iter *iter) spin_unlock_irqrestore(&config->connector_list_lock, flags);
if (old_conn) - drm_connector_put(old_conn); + drm_connector_put_safe(old_conn);
return iter->conn; } @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter) { iter->dev = NULL; if (iter->conn) - drm_connector_put(iter->conn); + drm_connector_put_safe(iter->conn); lock_release(&connector_list_iter_dep_map, 0, _RET_IP_); } EXPORT_SYMBOL(drm_connector_list_iter_end); diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c index cda8bfa..cc78b3d 100644 --- a/drivers/gpu/drm/drm_mode_config.c +++ b/drivers/gpu/drm/drm_mode_config.c @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) drm_connector_put(connector); } drm_connector_list_iter_end(&conn_iter); + /* connector_iter drops references in a work item. */ + flush_scheduled_work(); if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { drm_connector_list_iter_begin(dev, &conn_iter); drm_for_each_connector_iter(connector, &conn_iter) diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index df9807a..a4649c5 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -916,6 +916,14 @@ struct drm_connector { uint8_t num_h_tile, num_v_tile; uint8_t tile_h_loc, tile_v_loc; uint16_t tile_h_size, tile_v_size; + + /** + * @free_work: + * + * Work used only by &drm_connector_iter to be able to clean up a + * connector from any context. + */ + struct work_struct free_work; };
#define obj_to_connector(x) container_of(x, struct drm_connector, base) -------------------------------------------------------------------------------
Git bisection log:
------------------------------------------------------------------------------- git bisect start # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94 # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3 git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36 # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12 # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992 # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80 # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the shrink request to prevent skip huge pool git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6 # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes git bisect good db8f884ca7fe6af64d443d1510464efe23826131 # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828 # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0 # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter -------------------------------------------------------------------------------
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Hope this helps!
Best wishes, Guillaume
-------- Forwarded Message -------- Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC) From: kernelci.org bot bot@kernelci.org To: guillaume.tucker@collabora.com
Bisection result for mainline/master (v4.15-rc3) on peach-pi
Good known revision:
c6b3e96 Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Bad known revision:
50c4c4e Linux 4.15-rc3
Extra parameters:
Tree: mainline Branch: master Target: peach-pi Lab: lab-collabora Defconfig: exynos_defconfig Plan: boot
Breaking commit found:
commit a703c55004e1c5076d57e43771b3e11117796ea0 Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Dec 4 21:48:18 2017 +0100
drm: safely free connectors from connector_iter In commit 613051dac40da1751ab269572766d3348d45a197 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Dec 14 00:08:06 2016 +0100 drm: locking&new iterators for connector_list we've went to extreme lengths to make sure connector iterations
works in any context, without introducing any additional locking context. This worked, except for a small fumble in the implementation: When we actually race with a concurrent connector unplug event, and our temporary connector reference turns out to be the final one, then everything breaks: We call the connector release function from whatever context we happen to be in, which can be an irq/atomic context. And connector freeing grabs all kinds of locks and stuff. Fix this by creating a specially safe put function for connetor_iter, which (in this rare case) punts the cleanup to a worker. Reported-by: Ben Widawsky ben@bwidawsk.net Cc: Ben Widawsky ben@bwidawsk.net Fixes: 613051dac40d ("drm: locking&new iterators for connector_list") Cc: Dave Airlie airlied@gmail.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Sean Paul seanpaul@chromium.org Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: Dave Airlie airlied@gmail.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel....
diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index 25f4b2e..4820141 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) connector->funcs->destroy(connector); } +static void drm_connector_free_work_fn(struct work_struct *work) +{
struct drm_connector *connector =
container_of(work, struct drm_connector, free_work);
struct drm_device *dev = connector->dev;
drm_mode_object_unregister(dev, &connector->base);
connector->funcs->destroy(connector);
+}
/**
- drm_connector_init - Init a preallocated connector
- @dev: DRM device
@@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, if (ret) return ret;
INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
connector->base.properties = &connector->properties; connector->dev = dev; connector->funcs = funcs;
@@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device *dev, } EXPORT_SYMBOL(drm_connector_list_iter_begin); +/*
- Extra-safe connector put function that works in any context. Should only
be
- used from the connector_iter functions, where we never really expect to
- actually release the connector when dropping our final reference.
- */
+static void +drm_connector_put_safe(struct drm_connector *conn) +{
if (refcount_dec_and_test(&conn->base.refcount.refcount))
schedule_work(&conn->free_work);
+}
/**
- drm_connector_list_iter_next - return next connector
- @iter: connectr_list iterator
@@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct drm_connector_list_iter *iter) spin_unlock_irqrestore(&config->connector_list_lock, flags); if (old_conn)
drm_connector_put(old_conn);
drm_connector_put_safe(old_conn); return iter->conn;
} @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter) { iter->dev = NULL; if (iter->conn)
drm_connector_put(iter->conn);
drm_connector_put_safe(iter->conn); lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
} EXPORT_SYMBOL(drm_connector_list_iter_end); diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c index cda8bfa..cc78b3d 100644 --- a/drivers/gpu/drm/drm_mode_config.c +++ b/drivers/gpu/drm/drm_mode_config.c @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) drm_connector_put(connector); } drm_connector_list_iter_end(&conn_iter);
/* connector_iter drops references in a work item. */
flush_scheduled_work(); if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { drm_connector_list_iter_begin(dev, &conn_iter); drm_for_each_connector_iter(connector, &conn_iter)
diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index df9807a..a4649c5 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -916,6 +916,14 @@ struct drm_connector { uint8_t num_h_tile, num_v_tile; uint8_t tile_h_loc, tile_v_loc; uint16_t tile_h_size, tile_v_size;
/**
* @free_work:
*
* Work used only by &drm_connector_iter to be able to clean up a
* connector from any context.
*/
struct work_struct free_work;
};
#define obj_to_connector(x) container_of(x, struct drm_connector, base)
Git bisection log:
git bisect start # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94 # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3 git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36 # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12 # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992 # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80 # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the shrink request to prevent skip huge pool git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6 # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes git bisect good db8f884ca7fe6af64d443d1510464efe23826131 # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828 # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0 # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
[adding Marek and Shuah to cc list]
On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
[0]: https://patchwork.kernel.org/patch/10067711/ [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html
Best regards, Javier
Hope this helps!
Best wishes, Guillaume
-------- Forwarded Message -------- Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC) From: kernelci.org bot bot@kernelci.org To: guillaume.tucker@collabora.com
Bisection result for mainline/master (v4.15-rc3) on peach-pi
Good known revision:
c6b3e96 Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Bad known revision:
50c4c4e Linux 4.15-rc3
Extra parameters:
Tree: mainline Branch: master Target: peach-pi Lab: lab-collabora Defconfig: exynos_defconfig Plan: boot
Breaking commit found:
commit a703c55004e1c5076d57e43771b3e11117796ea0 Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Dec 4 21:48:18 2017 +0100
drm: safely free connectors from connector_iter In commit 613051dac40da1751ab269572766d3348d45a197 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Dec 14 00:08:06 2016 +0100 drm: locking&new iterators for connector_list we've went to extreme lengths to make sure connector iterations
works in any context, without introducing any additional locking context. This worked, except for a small fumble in the implementation: When we actually race with a concurrent connector unplug event, and our temporary connector reference turns out to be the final one, then everything breaks: We call the connector release function from whatever context we happen to be in, which can be an irq/atomic context. And connector freeing grabs all kinds of locks and stuff. Fix this by creating a specially safe put function for connetor_iter, which (in this rare case) punts the cleanup to a worker. Reported-by: Ben Widawsky ben@bwidawsk.net Cc: Ben Widawsky ben@bwidawsk.net Fixes: 613051dac40d ("drm: locking&new iterators for connector_list") Cc: Dave Airlie airlied@gmail.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Sean Paul seanpaul@chromium.org Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: Dave Airlie airlied@gmail.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel....
diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index 25f4b2e..4820141 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) connector->funcs->destroy(connector); } +static void drm_connector_free_work_fn(struct work_struct *work) +{
struct drm_connector *connector =
container_of(work, struct drm_connector, free_work);
struct drm_device *dev = connector->dev;
drm_mode_object_unregister(dev, &connector->base);
connector->funcs->destroy(connector);
+}
/**
- drm_connector_init - Init a preallocated connector
- @dev: DRM device
@@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, if (ret) return ret;
INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
connector->base.properties = &connector->properties; connector->dev = dev; connector->funcs = funcs;
@@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device *dev, } EXPORT_SYMBOL(drm_connector_list_iter_begin); +/*
- Extra-safe connector put function that works in any context. Should only
be
- used from the connector_iter functions, where we never really expect to
- actually release the connector when dropping our final reference.
- */
+static void +drm_connector_put_safe(struct drm_connector *conn) +{
if (refcount_dec_and_test(&conn->base.refcount.refcount))
schedule_work(&conn->free_work);
+}
/**
- drm_connector_list_iter_next - return next connector
- @iter: connectr_list iterator
@@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct drm_connector_list_iter *iter) spin_unlock_irqrestore(&config->connector_list_lock, flags); if (old_conn)
drm_connector_put(old_conn);
drm_connector_put_safe(old_conn); return iter->conn;
} @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter) { iter->dev = NULL; if (iter->conn)
drm_connector_put(iter->conn);
drm_connector_put_safe(iter->conn); lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
} EXPORT_SYMBOL(drm_connector_list_iter_end); diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c index cda8bfa..cc78b3d 100644 --- a/drivers/gpu/drm/drm_mode_config.c +++ b/drivers/gpu/drm/drm_mode_config.c @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) drm_connector_put(connector); } drm_connector_list_iter_end(&conn_iter);
/* connector_iter drops references in a work item. */
flush_scheduled_work(); if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { drm_connector_list_iter_begin(dev, &conn_iter); drm_for_each_connector_iter(connector, &conn_iter)
diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index df9807a..a4649c5 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -916,6 +916,14 @@ struct drm_connector { uint8_t num_h_tile, num_v_tile; uint8_t tile_h_loc, tile_v_loc; uint16_t tile_h_size, tile_v_size;
/**
* @free_work:
*
* Work used only by &drm_connector_iter to be able to clean up a
* connector from any context.
*/
struct work_struct free_work;
};
#define obj_to_connector(x) container_of(x, struct drm_connector, base)
Git bisection log:
git bisect start # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94 # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3 git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36 # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12 # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992 # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80 # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the shrink request to prevent skip huge pool git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6 # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes git bisect good db8f884ca7fe6af64d443d1510464efe23826131 # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828 # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0 # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
-- Daniel Vetter Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 11, 2017 at 11:28 PM, Javier Martinez Canillas javier@dowhile0.org wrote:
[adding Marek and Shuah to cc list]
[snip]
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Anyway, I don't have access to the machine now, but it would be nice if someone test. Or I would do in a few days.
Best regards, Javier
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc);
return ERR_PTR(-EPERM); }
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc); return ERR_PTR(-EPERM);
}
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
thanks, -- Shuah
Hi Shuah,
On 2017-12-12 00:25, Shuah Khan wrote:
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc); return ERR_PTR(-EPERM);
}
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Best regards
Hello Marek,
On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hi Shuah,
On 2017-12-12 00:25, Shuah Khan wrote:
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc); return ERR_PTR(-EPERM);
}
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved?
I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now.
Best regards
Marek Szyprowski, PhD Samsung R&D Institute Poland
Best regards, Javier
Hi Javier,
On 2017-12-12 09:00, Javier Martinez Canillas wrote:
On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
On 2017-12-12 00:25, Shuah Khan wrote:
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes").
I've posted a fix for that:
https://patchwork.kernel.org/patch/10105921/
I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component:
[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1
Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc); return ERR_PTR(-EPERM);
}
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved?
I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now.
Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue.
I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next.
Best regards
On Tue, Dec 12, 2017 at 9:47 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hi Javier,
On 2017-12-12 09:00, Javier Martinez Canillas wrote:
On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
(...)
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved?
I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now.
Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue.
I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next.
Then it looks like we could give EXYNOS_IOMMU a try. At least only on exynos_defconfig which would leave multi_v7 as a platform to compare.
Best regards, Krzysztof
On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
Hi Javier,
On 2017-12-12 09:00, Javier Martinez Canillas wrote:
On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
On 2017-12-12 00:25, Shuah Khan wrote:
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote: > So I gave a quick look to this, and at the very least there's a bug in > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: > exynos: Add status property to Exynos 542x Mixer nodes"). > > I've posted a fix for that: > > https://patchwork.kernel.org/patch/10105921/ > > I believe this could be also be the cause for the boot failure, since > I see in the boot log that things start to go wrong after exynos-drm > fails to bind the HDMI component: > > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops > 0xc1398690): -1 Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM.
I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc);
return ERR_PTR(-EPERM); }
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved?
I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now.
Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue.
I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next.
Good to know it doesn't break Exynos Snow. This is why I test without IOMMU and then enable IOMMU on odroid-xu4 for test with IOMMU
thanks, -- Shuah
On 12/12/2017 07:47 AM, Shuah Khan wrote:
On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
Hi Javier,
On 2017-12-12 09:00, Javier Martinez Canillas wrote:
On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
On 2017-12-12 00:25, Shuah Khan wrote:
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: > On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas > wrote: >> So I gave a quick look to this, and at the very least there's a bug in >> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >> exynos: Add status property to Exynos 542x Mixer nodes"). >> >> I've posted a fix for that: >> >> https://patchwork.kernel.org/patch/10105921/ >> >> I believe this could be also be the cause for the boot failure, since >> I see in the boot log that things start to go wrong after exynos-drm >> fails to bind the HDMI component: >> >> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >> 0xc1398690): -1 > Umm, -1 ? Looking that error code up in > include/uapi/asm-generic/errno-base.h says it's -EPERM. > > I suspect that's someone just returning -1 because they're lazy... > which is real bad form and needs fixing. Oh, it really is -EPERM:
struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc;
drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc);
return ERR_PTR(-EPERM); }
Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me.
Can we please avoid abusing errno codes?
I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for.
However, with 4.15-rc3, dmesg is gets filled with
[ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed.
This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/
This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig.
Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved?
I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now.
Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue.
I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next.
Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent a patch to do that a while back. The decision at the time to not pull that patch is was based on systems not booting with it enabled.
Is it time to revisit that or the recommendation is for IOMMU to be enabled in configs manually on systems it is safe to do so?
thanks, -- Shuah
Hello Shuah,
On Tue, Dec 12, 2017 at 7:26 PM, Shuah Khan shuahkh@osg.samsung.com wrote:
[snip]
Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue.
I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next.
Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent a patch to do that a while back. The decision at the time to not pull that patch is was based on systems not booting with it enabled.
Is it time to revisit that or the recommendation is for IOMMU to be enabled in configs manually on systems it is safe to do so?
Yes, I think it would be good to have it enabled by default if that doesn't cause boot issues anymore.
Could you please resend your patch and cc Marek and me so we can test on Snow and Peach Chromebooks?
thanks, -- Shuah
Best regards, Javier
Hi All,
On 2017-12-11 23:28, Javier Martinez Canillas wrote:
[adding Marek and Shuah to cc list]
On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
After a recent discussion on the Javier's patch: https://patchwork.kernel.org/patch/10106417/ I've managed to reproduce this issue also on Exynos5250 based Samsung Snow Chromebook and investigate a bit.
It is caused by a deadlock in the main kernel workqueue. Here are details:
1. Exynos DRM fails to initialize due to missing regulators and gets moved to deferred probe device list
2. Deferred probe is triggered and kernel "events" workqueue calls deferred_probe_work_func()
3. exynos_drm_bind() is called, component_bind_all() fails due to missing Exynos Mixer device
4. error handling path is executed in exynos_drm_bind(), which calls drm_mode_config_cleanup()
5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes deadlock.
Do You have idea how to fix this issue properly?
Taking a look at git blame, this indeed shows that the issue has been introduced by the commit a703c55004e1 ("drm: safely free connectors from connector_ite"), which added a call to flush_scheduled_work() in drm_mode_config_cleanup().
drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if called from the workqueue, but I don't have idea how to check that. The other way of fixing it would be to resurrect separate workqueue for DRM related events.
Best regards
Hi Marek,
On 12/12/2017 04:38 AM, Marek Szyprowski wrote:
Hi All,
On 2017-12-11 23:28, Javier Martinez Canillas wrote:
[adding Marek and Shuah to cc list]
On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
After a recent discussion on the Javier's patch: https://patchwork.kernel.org/patch/10106417/ I've managed to reproduce this issue also on Exynos5250 based Samsung Snow Chromebook and investigate a bit.
It is caused by a deadlock in the main kernel workqueue. Here are details:
- Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list
- Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()
- exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device
- error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()
- drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.
Do You have idea how to fix this issue properly?
Taking a look at git blame, this indeed shows that the issue has been introduced by the commit a703c55004e1 ("drm: safely free connectors from connector_ite"), which added a call to flush_scheduled_work() in drm_mode_config_cleanup().
This commit is making its way into stable releases. It has been added to 4.14-6 stable. If this patch poses problems, maybe somebody should comment on the stable release thread.
drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if called from the workqueue, but I don't have idea how to check that. The other way of fixing it would be to resurrect separate workqueue for DRM related events.
Especially since there is no solution :)
thanks, -- Shuah
On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hi All,
On 2017-12-11 23:28, Javier Martinez Canillas wrote:
[adding Marek and Shuah to cc list]
On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
After a recent discussion on the Javier's patch: https://patchwork.kernel.org/patch/10106417/ I've managed to reproduce this issue also on Exynos5250 based Samsung Snow Chromebook and investigate a bit.
It is caused by a deadlock in the main kernel workqueue. Here are details:
- Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list
- Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()
- exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device
- error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()
- drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.
Do You have idea how to fix this issue properly?
Taking a look at git blame, this indeed shows that the issue has been introduced by the commit a703c55004e1 ("drm: safely free connectors from connector_ite"), which added a call to flush_scheduled_work() in drm_mode_config_cleanup().
drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if called from the workqueue, but I don't have idea how to check that. The other way of fixing it would be to resurrect separate workqueue for DRM related events.
We need to flush the work there, or things will go wrong on unload. I think the real fix is to make sure that the connector cleanup work isn't run on the same work stuff as any driver bind stuff, which yes means new separate workqueue just for this.
I guess the simple fix is to do that in the drm, but tbh I'm surprised that driver bind/deferred probe hasn't run into this problem anywhere else yet. -Daniel
Hi Daniel,
On 2017-12-12 19:14, Daniel Vetter wrote:
On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hi All,
On 2017-12-11 23:28, Javier Martinez Canillas wrote:
[adding Marek and Shuah to cc list]
On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi Daniel,
Please see below, I've had several bisection results pointing at that commit over the week-end on mainline but also on linux-next and net-next. While the peach-pi is a bit flaky at the moment and is likely to have more than one issue, it does seem like this commit is causing some well reproducible kernel hang.
Here's a re-run with v4.15-rc3 showing the issue:
https://lava.collabora.co.uk/scheduler/job/1018478
and here's another one with the change mentioned below reverted:
https://lava.collabora.co.uk/scheduler/job/1018479
They both show a warning about "unbalanced disables for lcd_vdd", I don't know if this is related as I haven't investigated any further. It does appear to reliably hang with v4.15-rc3 and boot most of the time with the commit reverted though.
The automated kernelci.org bisection is still an experimental tool and it may well be a false positive, so please take this result with a pinch of salt...
The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel
Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other.
I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though.
After a recent discussion on the Javier's patch: https://patchwork.kernel.org/patch/10106417/ I've managed to reproduce this issue also on Exynos5250 based Samsung Snow Chromebook and investigate a bit.
It is caused by a deadlock in the main kernel workqueue. Here are details:
- Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list
- Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()
- exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device
- error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()
- drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.
Do You have idea how to fix this issue properly?
Taking a look at git blame, this indeed shows that the issue has been introduced by the commit a703c55004e1 ("drm: safely free connectors from connector_ite"), which added a call to flush_scheduled_work() in drm_mode_config_cleanup().
drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if called from the workqueue, but I don't have idea how to check that. The other way of fixing it would be to resurrect separate workqueue for DRM related events.
We need to flush the work there, or things will go wrong on unload. I think the real fix is to make sure that the connector cleanup work isn't run on the same work stuff as any driver bind stuff, which yes means new separate workqueue just for this.
I guess the simple fix is to do that in the drm, but tbh I'm surprised that driver bind/deferred probe hasn't run into this problem anywhere else yet.
Well, this means that no-one tested the error paths in deferred probe case. It's not that surprising, if we assume that typically platform devices are deferred only once. Second probe() call (which is done from workqueue) is successful in that case.
I've managed to fix this deadlock without additional workqueue: https://patchwork.freedesktop.org/patch/193069/
Best regards
kernel-build-reports@lists.linaro.org