With the transition of pd-mapper into the kernel, the timing was altered such that on some targets the initial rpmsg_send() requests from pmic_glink clients would be attempted before the firmware had announced intents, and the firmware reject intent requests.
Fix this
Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com --- Changes in v2: - Introduced "intents" and fixed a few spelling mistakes in the commit message of patch 1 - Cleaned up log snippet in commit message of patch 2, added battery manager log - Changed the arbitrary 10 second timeout to 5... Ought to be enough for anybody. - Added a small sleep in the send-loop in patch 2, and by that refactored the loop completely. - Link to v1: https://lore.kernel.org/r/20241022-pmic-glink-ecancelled-v1-0-9e26fc74e0a3@o...
--- Bjorn Andersson (2): rpmsg: glink: Handle rejected intent request better soc: qcom: pmic_glink: Handle GLINK intent allocation rejections
drivers/rpmsg/qcom_glink_native.c | 10 +++++++--- drivers/soc/qcom/pmic_glink.c | 25 ++++++++++++++++++++++--- 2 files changed, 29 insertions(+), 6 deletions(-) --- base-commit: 42f7652d3eb527d03665b09edac47f85fb600924 change-id: 20241022-pmic-glink-ecancelled-d899a9ca0358
Best regards,
GLINK operates using pre-allocated buffers, aka intents, where incoming messages are aggregated before being passed up the stack. In the case that no suitable intents have been announced by the receiver, the sender can request an intent to be allocated.
The initial implementation of the response to such request dealt with two outcomes; granted allocations, and all other cases being considered -ECANCELLED (likely from "cancelling the operation as the remote is going down").
But on some channels intent allocation is not supported, instead the remote will pre-allocate and announce a fixed number of intents for the sender to use. If for such channels an rpmsg_send() is being invoked before any channels have been announced, an intent request will be issued and as this comes back rejected the call fails with -ECANCELED.
Given that this is reported in the same way as the remote being shut down, there's no way for the client to differentiate the two cases.
In line with the original GLINK design, change the return value to -EAGAIN for the case where the remote rejects an intent allocation request.
It's tempting to handle this case in the GLINK core, as we expect intents to show up in this case. But there's no way to distinguish between this case and a rejection for a too big allocation, nor is it possible to predict if a currently used (and seemingly suitable) intent will be returned for reuse or not. As such, returning the error to the client and allow it to react seems to be the only sensible solution.
In addition to this, commit 'c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack")' changed the logic such that the code always wait for an intent request response and an intent. This works out in most cases, but in the event that an intent request is rejected and no further intent arrives (e.g. client asks for a too big intent), the code will stall for 10 seconds and then return -ETIMEDOUT; instead of a more suitable error.
This change also resulted in intent requests racing with the shutdown of the remote would be exposed to this same problem, unless some intent happens to arrive. A patch for this was developed and posted by Sarannya S [1], and has been incorporated here.
To summarize, the intent request can end in 4 ways: - Timeout, no response arrived => return -ETIMEDOUT - Abort TX, the edge is going away => return -ECANCELLED - Intent request was rejected => return -EAGAIN - Intent request was accepted, and an intent arrived => return 0
This patch was developed with input from Sarannya S, Deepak Kumar Singh, and Chris Lew.
[1] https://lore.kernel.org/all/20240925072328.1163183-1-quic_deesin@quicinc.com...
Fixes: c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack") Cc: stable@vger.kernel.org Tested-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com --- drivers/rpmsg/qcom_glink_native.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c index 0b2f290069080638581a13b3a580054d31e176c2..d3af1dfa3c7d71b95dda911dfc7ad844679359d6 100644 --- a/drivers/rpmsg/qcom_glink_native.c +++ b/drivers/rpmsg/qcom_glink_native.c @@ -1440,14 +1440,18 @@ static int qcom_glink_request_intent(struct qcom_glink *glink, goto unlock;
ret = wait_event_timeout(channel->intent_req_wq, - READ_ONCE(channel->intent_req_result) >= 0 && - READ_ONCE(channel->intent_received), + READ_ONCE(channel->intent_req_result) == 0 || + (READ_ONCE(channel->intent_req_result) > 0 && + READ_ONCE(channel->intent_received)) || + glink->abort_tx, 10 * HZ); if (!ret) { dev_err(glink->dev, "intent request timed out\n"); ret = -ETIMEDOUT; + } else if (glink->abort_tx) { + ret = -ECANCELED; } else { - ret = READ_ONCE(channel->intent_req_result) ? 0 : -ECANCELED; + ret = READ_ONCE(channel->intent_req_result) ? 0 : -EAGAIN; }
unlock:
On 10/23/2024 10:24 AM, Bjorn Andersson wrote:
GLINK operates using pre-allocated buffers, aka intents, where incoming messages are aggregated before being passed up the stack. In the case that no suitable intents have been announced by the receiver, the sender can request an intent to be allocated.
The initial implementation of the response to such request dealt with two outcomes; granted allocations, and all other cases being considered -ECANCELLED (likely from "cancelling the operation as the remote is going down").
But on some channels intent allocation is not supported, instead the remote will pre-allocate and announce a fixed number of intents for the sender to use. If for such channels an rpmsg_send() is being invoked before any channels have been announced, an intent request will be issued and as this comes back rejected the call fails with -ECANCELED.
Given that this is reported in the same way as the remote being shut down, there's no way for the client to differentiate the two cases.
In line with the original GLINK design, change the return value to -EAGAIN for the case where the remote rejects an intent allocation request.
It's tempting to handle this case in the GLINK core, as we expect intents to show up in this case. But there's no way to distinguish between this case and a rejection for a too big allocation, nor is it possible to predict if a currently used (and seemingly suitable) intent will be returned for reuse or not. As such, returning the error to the client and allow it to react seems to be the only sensible solution.
In addition to this, commit 'c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack")' changed the logic such that the code always wait for an intent request response and an intent. This works out in most cases, but in the event that an intent request is rejected and no further intent arrives (e.g. client asks for a too big intent), the code will stall for 10 seconds and then return -ETIMEDOUT; instead of a more suitable error.
This change also resulted in intent requests racing with the shutdown of the remote would be exposed to this same problem, unless some intent happens to arrive. A patch for this was developed and posted by Sarannya S [1], and has been incorporated here.
To summarize, the intent request can end in 4 ways:
- Timeout, no response arrived => return -ETIMEDOUT
- Abort TX, the edge is going away => return -ECANCELLED
- Intent request was rejected => return -EAGAIN
- Intent request was accepted, and an intent arrived => return 0
This patch was developed with input from Sarannya S, Deepak Kumar Singh, and Chris Lew.
[1] https://lore.kernel.org/all/20240925072328.1163183-1-quic_deesin@quicinc.com...
Fixes: c05dfce0b89e ("rpmsg: glink: Wait for intent, not just request ack") Cc: stable@vger.kernel.org Tested-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com
Reviewed-by: Chris Lew quic_clew@quicinc.com
Some versions of the pmic_glink firmware does not allow dynamic GLINK intent allocations, attempting to send a message before the firmware has allocated its receive buffers and announced these intent allocations will fail. When this happens something like this showns up in the log:
pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
GLINK has been updated to distinguish between the cases where the remote is going down (-ECANCELED) and the intent allocation being rejected (-EAGAIN).
Retry the send until intent buffers becomes available, or an actual error occur.
To avoid infinitely waiting for the firmware in the event that this misbehaves and no intents arrive, an arbitrary 5 second timeout is used.
This patch was developed with input from Chris Lew.
Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t Cc: stable@vger.kernel.org # rpmsg: glink: Handle rejected intent request better Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Tested-by: Johan Hovold johan+linaro@kernel.org Reviewed-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com --- drivers/soc/qcom/pmic_glink.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c index 9606222993fd78e80d776ea299cad024a0197e91..baa4ac6704a901661d1055c5caeaab61dc315795 100644 --- a/drivers/soc/qcom/pmic_glink.c +++ b/drivers/soc/qcom/pmic_glink.c @@ -4,6 +4,7 @@ * Copyright (c) 2022, Linaro Ltd */ #include <linux/auxiliary_bus.h> +#include <linux/delay.h> #include <linux/module.h> #include <linux/of.h> #include <linux/platform_device.h> @@ -13,6 +14,8 @@ #include <linux/soc/qcom/pmic_glink.h> #include <linux/spinlock.h>
+#define PMIC_GLINK_SEND_TIMEOUT (5 * HZ) + enum { PMIC_GLINK_CLIENT_BATT = 0, PMIC_GLINK_CLIENT_ALTMODE, @@ -112,13 +115,29 @@ EXPORT_SYMBOL_GPL(pmic_glink_client_register); int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len) { struct pmic_glink *pg = client->pg; + bool timeout_reached = false; + unsigned long start; int ret;
mutex_lock(&pg->state_lock); - if (!pg->ept) + if (!pg->ept) { ret = -ECONNRESET; - else - ret = rpmsg_send(pg->ept, data, len); + } else { + start = jiffies; + for (;;) { + ret = rpmsg_send(pg->ept, data, len); + if (ret != -EAGAIN) + break; + + if (timeout_reached) { + ret = -ETIMEDOUT; + break; + } + + usleep_range(1000, 5000); + timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT); + } + } mutex_unlock(&pg->state_lock);
return ret;
On 10/23/2024 10:24 AM, Bjorn Andersson wrote:
Some versions of the pmic_glink firmware does not allow dynamic GLINK intent allocations, attempting to send a message before the firmware has allocated its receive buffers and announced these intent allocations will fail. When this happens something like this showns up in the log:
pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
GLINK has been updated to distinguish between the cases where the remote is going down (-ECANCELED) and the intent allocation being rejected (-EAGAIN).
Retry the send until intent buffers becomes available, or an actual error occur.
To avoid infinitely waiting for the firmware in the event that this misbehaves and no intents arrive, an arbitrary 5 second timeout is used.
This patch was developed with input from Chris Lew.
Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t Cc: stable@vger.kernel.org # rpmsg: glink: Handle rejected intent request better Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Tested-by: Johan Hovold johan+linaro@kernel.org Reviewed-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com
Reviewed-by: Chris Lew quic_clew@quicinc.com
On Wed, Oct 23, 2024 at 05:24:33PM +0000, Bjorn Andersson wrote:
Some versions of the pmic_glink firmware does not allow dynamic GLINK intent allocations, attempting to send a message before the firmware has allocated its receive buffers and announced these intent allocations will fail.
Retry the send until intent buffers becomes available, or an actual error occur.
Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t Cc: stable@vger.kernel.org # rpmsg: glink: Handle rejected intent request better Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Tested-by: Johan Hovold johan+linaro@kernel.org Reviewed-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com
Thanks for the update. Still works as intended here.
int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len) { struct pmic_glink *pg = client->pg;
- bool timeout_reached = false;
- unsigned long start; int ret;
mutex_lock(&pg->state_lock);
- if (!pg->ept)
- if (!pg->ept) { ret = -ECONNRESET;
- else
ret = rpmsg_send(pg->ept, data, len);
- } else {
start = jiffies;
for (;;) {
ret = rpmsg_send(pg->ept, data, len);
if (ret != -EAGAIN)
break;
if (timeout_reached) {
ret = -ETIMEDOUT;
break;
}
usleep_range(1000, 5000);
I ran some quick tests of this patch this morning (reproducing the issue five times), and with the above delay it seems a single resend is enough. Dropping the delay I once hit:
[ 8.723479] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723877] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723921] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723951] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723981] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.724010] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.724046] qcom_pmic_glink pmic-glink: pmic_glink_send - resend
which seems to suggest that a one millisecond sleep is sufficient for the currently observed issue.
It would still mean up to 5k calls if you ever try to send a too large buffer or similar and spin here for five seconds however. Perhaps nothing to worry about at this point, but increasing the delay or lowering the timeout could be considered.
timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
}
- } mutex_unlock(&pg->state_lock);
return ret;
Johan
On Thu, Oct 24, 2024 at 08:39:25AM GMT, Johan Hovold wrote:
On Wed, Oct 23, 2024 at 05:24:33PM +0000, Bjorn Andersson wrote:
Some versions of the pmic_glink firmware does not allow dynamic GLINK intent allocations, attempting to send a message before the firmware has allocated its receive buffers and announced these intent allocations will fail.
Retry the send until intent buffers becomes available, or an actual error occur.
Reported-by: Johan Hovold johan@kernel.org Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t Cc: stable@vger.kernel.org # rpmsg: glink: Handle rejected intent request better Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Tested-by: Johan Hovold johan+linaro@kernel.org Reviewed-by: Johan Hovold johan+linaro@kernel.org Signed-off-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com
Thanks for the update. Still works as intended here.
Thanks for the confirmation.
int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len) { struct pmic_glink *pg = client->pg;
- bool timeout_reached = false;
- unsigned long start; int ret;
mutex_lock(&pg->state_lock);
- if (!pg->ept)
- if (!pg->ept) { ret = -ECONNRESET;
- else
ret = rpmsg_send(pg->ept, data, len);
- } else {
start = jiffies;
for (;;) {
ret = rpmsg_send(pg->ept, data, len);
if (ret != -EAGAIN)
break;
if (timeout_reached) {
ret = -ETIMEDOUT;
break;
}
usleep_range(1000, 5000);
I ran some quick tests of this patch this morning (reproducing the issue five times), and with the above delay it seems a single resend is enough. Dropping the delay I once hit:
[ 8.723479] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723877] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723921] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723951] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.723981] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.724010] qcom_pmic_glink pmic-glink: pmic_glink_send - resend [ 8.724046] qcom_pmic_glink pmic-glink: pmic_glink_send - resend
which seems to suggest that a one millisecond sleep is sufficient for the currently observed issue.
It would still mean up to 5k calls if you ever try to send a too large buffer or similar and spin here for five seconds however. Perhaps nothing to worry about at this point, but increasing the delay or lowering the timeout could be considered.
I did consider this as well, but this code-path is specific to pmic-glink, so we shouldn't have any messages of size unexpected to the other side...
If we do, then let's fix that. If I'm wrong in my assumptions, I'd be happy to see this corrected, without my arbitrarily chosen timeout values.
Thanks, Bjorn
timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
}
- } mutex_unlock(&pg->state_lock);
return ret;
Johan
On Wed, 23 Oct 2024 17:24:31 +0000, Bjorn Andersson wrote:
With the transition of pd-mapper into the kernel, the timing was altered such that on some targets the initial rpmsg_send() requests from pmic_glink clients would be attempted before the firmware had announced intents, and the firmware reject intent requests.
Fix this
[...]
Applied, thanks!
[1/2] rpmsg: glink: Handle rejected intent request better commit: a387e73fedd6307c0e194deaa53c42b153ff0bd6 [2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections commit: f8c879192465d9f328cb0df07208ef077c560bb1
Best regards,
linux-stable-mirror@lists.linaro.org