On Thu, Mar 28, 2024 at 8:21 AM mingyang.cui mingyang.cui@horizon.ai wrote:
When rt_mutex_setprio changes a task's scheduling class to RT, sometimes the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed:
- task has just been created and running for a short time
- task sleep while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's sum_exec_runtime is zero, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime
- later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier.
Just to make sure I am following: since sum_exec_runtime is updated in update_curr(), if the task goes to sleep, shouldn't it be dequeued and thus update_curr() would have been run (and thus sum_exec_runtime would be non-zero)?
Maybe in this analysis is the new task being boosted while it is still newly running (instead of sleeping)?
Since the task's vruntime is about double that of other tasks in cfs_rq, the task to be unable to run for a long time when there are continuous runnable tasks in cfs_rq.
The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled.
This is an interesting find! I'm curious how you detected the problem, as it might make a good correctness test (which I'm selfishly looking for more of myself :)
The root cause of the problem is that the vruntime_normalized made a misjudgment. Since the sum_exec_runtime of some tasks that were just created and run for a short time is zero, the vruntime_normalized mistakenly thinks that they are tasks that have just been forked. Therefore, sum_exec_runtime is not subtracted from the vruntime of the task.
So, we fix this bug by adding a check condition for newly forked task.
Signed-off-by: mingyang.cui mingyang.cui@horizon.ai
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 73a89fbd81be..3d0c14f3731f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11112,7 +11112,7 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */
if (!se->sum_exec_runtime ||
if (!se->sum_exec_runtime && p->state == TASK_NEW || (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true;
This looks like it was applied against an older tree? The p->state accesses should be under a READ_ONCE (and likely consolidated before the conditional?)
Also, I wonder if alternatively a call to update_curr() if (cfs_rq->curr == se) in switched_from_fair() would be good (ie: normalize it on the boost to close the edge case rather than handle it as an expected non-normalized condition)?
thanks -john