Valentin Schneider valentin.schneider@arm.com writes:
Turns out a cfs_rq->runtime_remaining can become positive in assign_cfs_rq_runtime(), but this codepath has no call to unthrottle_cfs_rq().
This can leave us in a situation where we have a throttled cfs_rq with positive ->runtime_remaining, which breaks the math in distribute_cfs_runtime(): this function expects a negative value so that it may safely negate it into a positive value.
Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where we expect negative values, and pull in a comment from the mailing list that didn't make it in [1].
Cc: stable@vger.kernel.org Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth") Reported-by: Liangyan liangyan.peng@linux.alibaba.com Signed-off-by: Valentin Schneider valentin.schneider@arm.com
Having now seen the rest of the thread:
Could you send the repro, as it doesn't seem to have reached lkml, so that I can confirm my guess as to what's going on?
It seems most likely we throttle during one of the remove-change-adds in set_cpus_allowed and friends or during the put half of pick_next_task followed by idle balance to drop the lock. Then distribute races with a later assign_cfs_rq_runtime so that the account finds runtime in the cfs_b.
Re clock_task, it's only frozen for the purposes of pelt, not delta_exec
The other possible way to fix this would be to skip assign if throttled, since the only time it could succeed is if we're racing with a distribute that will unthrottle use anyways.
The main advantage of that is the risk of screwy behavior due to unthrottling in the middle of pick_next/put_prev. The disadvantage is that we already have the lock, if it works we don't need an ipi to trigger a preempt, etc. (But I think one of the issues is that we may trigger the preempt on the previous task, not the next, and I'm not 100% sure that will carry over correctly)
kernel/sched/fair.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1054d2cf6aaa..219ff3f328e5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq) return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time; } +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) +{
- return cfs_bandwidth_used() && cfs_rq->throttled;
+}
/* returns 0 on failure to allocate runtime */ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_remaining += amount;
- if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
unthrottle_cfs_rq(cfs_rq);
- return cfs_rq->runtime_remaining > 0;
} @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) __account_cfs_rq_runtime(cfs_rq, delta_exec); } -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) -{
- return cfs_bandwidth_used() && cfs_rq->throttled;
-}
/* check whether cfs_rq, or any parent, is throttled */ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { @@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) if (!cfs_rq_throttled(cfs_rq)) goto next;
/* By the above check, this should never be true */
WARN_ON(cfs_rq->runtime_remaining > 0);
runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining;/* Pick the minimum amount to return to a positive quota state */