sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks.
Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality they can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct.
== Design ==
- The EXT server is initialized at boot time and remains configured throughout the system's lifetime - It starts automatically when the first sched_ext task is enqueued (rq->scx.nr_running == 1) - The server's pick function (ext_server_pick_task) always selects sched_ext tasks when active - Runtime accounting happens in update_curr_scx() during task execution and update_curr_idle() when idle - Bandwidth accounting includes both fair and ext servers in root domain calculations - A debugfs interface (/sys/kernel/debug/sched/ext_server/) allows runtime tuning of server parameters
== Highlights in this version ==
As discussed at the sched_ext microconference at LPC Tokyo, the plan is to start with a simpler approach, avoiding automatically creating or tearing down the EXT server bandwidth reservation when a BPF scheduler is loaded or unloaded. Instead, the reservation is kept permanently active. This significantly simplifies the logic while still addressing the starvation issue.
Any fine-tuning of the bandwidth reservation is delegated to the system administrator, who can adjust it via the debugfs interface. In the future, a more suitable interface can be introduced and automatic removal of the reservation when the BPF scheduler is unloaded can be revisited.
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v11: - do not create/remove the bandwidth reservation for the ext server when a BPF scheduler is loaded/unloaded, but keep the reservation bandwdith always active - change rt_stall kselftest to validate both FAIR and EXT DL servers - Link to v10: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v10: - reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi) - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when invoked by the dl_server (Tejun Heo) - rename rq_dl_server to dl_server (Peter Zijlstra) - fixed a missing dl_server start in dl_server_on() (Christian Loehle) - add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis) - Link to v9: https://lore.kernel.org/all/20251017093214.70029-1-arighi@nvidia.com/
Changes in v9: - Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v8: - Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf - Simplify dl_se->dl_server check (suggested by PeterZ) - Small coding style fixes in the kselftests - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7: - Rebased to Linus master - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6: - Added Acks to few patches - Fixes to few nits suggested by Tejun - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5: - Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs - Address comment from Andrea about redundant rq clock invalidation - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4: - Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX - Fixed other issues - Rebased on Linus master - All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3: - Removed code duplication in debugfs. Made ext interface separate - Fixed issue where rq_lock_irqsave was not used in the relinquish patch - Fixed running bw accounting issue in dl_server_remove_params - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2: - Fixed a hang related to using rq_lock instead of rq_lock_irqsave - Added support to remove BW of DL servers when they are switched to/from EXT - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (2): sched_ext: Add a DL server for sched_ext tasks selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (5): sched/deadline: Clear the defer params sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency
kernel/sched/core.c | 6 + kernel/sched/deadline.c | 87 +++++-- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 42 ++++ kernel/sched/idle.c | 3 + kernel/sched/sched.h | 2 + kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 240 +++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 11 files changed, 811 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
From: Joel Fernandes joelagnelf@nvidia.com
The defer params were not cleared in __dl_clear_params. Clear them.
Without this is some of my test cases are flaking and the DL timer is not starting correctly AFAICS.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Tested-by: Christian Loehle christian.loehle@arm.com Acked-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/deadline.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 319439fe18702..2789db5217cd4 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -3642,6 +3642,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se) dl_se->dl_non_contending = 0; dl_se->dl_overrun = 0; dl_se->dl_server = 0; + dl_se->dl_defer = 0; + dl_se->dl_defer_running = 0; + dl_se->dl_defer_armed = 0;
#ifdef CONFIG_RT_MUTEXES dl_se->pi_se = dl_se;
From: Joel Fernandes joelagnelf@nvidia.com
Updating "ppos" on error conditions does not make much sense. The pattern is to return the error code directly without modifying the position, or modify the position on success and return the number of bytes written.
Since on success, the return value of apply is 0, there is no point in modifying ppos either. Fix it by removing all this and just returning error code or number of bytes written on success.
Tested-by: Christian Loehle christian.loehle@arm.com Acked-by: Tejun Heo tj@kernel.org Reviewed-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 41caa22e0680a..93f009e1076d8 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -345,8 +345,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu long cpu = (long) ((struct seq_file *) filp->private_data)->private; struct rq *rq = cpu_rq(cpu); u64 runtime, period; + int retval = 0; size_t err; - int retval; u64 value;
err = kstrtoull_from_user(ubuf, cnt, 10, &value); @@ -380,8 +380,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu dl_server_stop(&rq->fair_server);
retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); - if (retval) - cnt = retval;
if (!runtime) printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", @@ -389,6 +387,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
if (rq->cfs.h_nr_queued) dl_server_start(&rq->fair_server); + + if (retval < 0) + return retval; }
*ppos += cnt;
From: Joel Fernandes joelagnelf@nvidia.com
Currently the DL server interface for applying parameters checks CFS-internals to identify if the server is active. This is error-prone and makes it difficult when adding new servers in the future.
Fix it, by using dl_server_active() which is also used by the DL server code to determine if the DL server was started.
v2: do not start the server if runtime is zero (Juri Lelli)
Tested-by: Christian Loehle christian.loehle@arm.com Acked-by: Tejun Heo tj@kernel.org Reviewed-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 93f009e1076d8..dd793f8f3858a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return err;
scoped_guard (rq_lock_irqsave, rq) { + bool is_active; + runtime = rq->fair_server.dl_runtime; period = rq->fair_server.dl_period;
@@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return -EINVAL; }
- update_rq_clock(rq); - dl_server_stop(&rq->fair_server); + is_active = dl_server_active(&rq->fair_server); + if (is_active) { + update_rq_clock(rq); + dl_server_stop(&rq->fair_server); + }
retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
@@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", cpu_of(rq));
- if (rq->cfs.h_nr_queued) + if (is_active && runtime) dl_server_start(&rq->fair_server);
if (retval < 0)
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also included later to confirm that both DL servers are functioning correctly:
# ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: TAP version 13 1..1 # Runtime of FAIR task (PID 1511) is 0.250000 seconds # Runtime of RT task (PID 1512) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 1 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1514) is 0.250000 seconds # Runtime of RT task (PID 1515) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 2 PASS: EXT task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of FAIR task (PID 1517) is 0.250000 seconds # Runtime of RT task (PID 1518) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 3 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1521) is 0.250000 seconds # Runtime of RT task (PID 1522) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 4 PASS: EXT task got more than 4.00% of runtime ok 1 rt_stall # ===== END =====
v4: - initialize EXT server bandwidth reservation at init time and always keep it active (Andrea Righi) - check for rq->nr_running == 1 to determine when to account idle time (Juri Lelli) v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/core.c | 6 +++ kernel/sched/deadline.c | 84 ++++++++++++++++++++++++++++++----------- kernel/sched/ext.c | 42 +++++++++++++++++++++ kernel/sched/idle.c | 3 ++ kernel/sched/sched.h | 2 + kernel/sched/topology.c | 5 +++ 6 files changed, 119 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41ba0be169117..a2400ee33a356 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8475,6 +8475,9 @@ int sched_cpu_dying(unsigned int cpu) dump_rq_tasks(rq, KERN_WARNING); } dl_server_stop(&rq->fair_server); +#ifdef CONFIG_SCHED_CLASS_EXT + dl_server_stop(&rq->ext_server); +#endif rq_unlock_irqrestore(rq, &rf);
calc_load_migrate(rq); @@ -8678,6 +8681,9 @@ void __init sched_init(void) hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); fair_server_init(rq); +#ifdef CONFIG_SCHED_CLASS_EXT + ext_server_init(rq); +#endif
#ifdef CONFIG_SCHED_CORE rq->core = rq; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 2789db5217cd4..88f2b5ed5678a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1445,8 +1445,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 dl_se->dl_defer_idle = 0;
/* - * The fair server can consume its runtime while throttled (not queued/ - * running as regular CFS). + * The DL server can consume its runtime while throttled (not + * queued / running as regular CFS). * * If the server consumes its entire runtime in this state. The server * is not required for the current period. Thus, reset the server by @@ -1531,10 +1531,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 }
/* - * The fair server (sole dl_server) does not account for real-time - * workload because it is running fair work. + * The dl_server does not account for real-time workload because it + * is running fair work. */ - if (dl_se == &rq->fair_server) + if (dl_se->dl_server) return;
#ifdef CONFIG_RT_GROUP_SCHED @@ -1569,9 +1569,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 * In the non-defer mode, the idle time is not accounted, as the * server provides a guarantee. * - * If the dl_server is in defer mode, the idle time is also considered - * as time available for the fair server, avoiding a penalty for the - * rt scheduler that did not consumed that time. + * If the dl_server is in defer mode, the idle time is also considered as + * time available for the dl_server, avoiding a penalty for the rt + * scheduler that did not consumed that time. */ void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec) { @@ -1810,6 +1810,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se) hrtimer_try_to_cancel(&dl_se->dl_timer); dl_se->dl_defer_armed = 0; dl_se->dl_throttled = 0; + dl_se->dl_defer_running = 0; dl_se->dl_defer_idle = 0; dl_se->dl_server_active = 0; } @@ -1844,6 +1845,18 @@ void sched_init_dl_servers(void) dl_se->dl_server = 1; dl_se->dl_defer = 1; setup_new_dl_entity(dl_se); + +#ifdef CONFIG_SCHED_CLASS_EXT + dl_se = &rq->ext_server; + + WARN_ON(dl_server(dl_se)); + + dl_server_apply_params(dl_se, runtime, period, 1); + + dl_se->dl_server = 1; + dl_se->dl_defer = 1; + setup_new_dl_entity(dl_se); +#endif } }
@@ -3183,6 +3196,36 @@ void dl_add_task_root_domain(struct task_struct *p) raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); }
+static void dl_server_add_bw(struct root_domain *rd, int cpu) +{ + struct sched_dl_entity *dl_se; + + dl_se = &cpu_rq(cpu)->fair_server; + if (dl_server(dl_se) && cpu_active(cpu)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); + +#ifdef CONFIG_SCHED_CLASS_EXT + dl_se = &cpu_rq(cpu)->ext_server; + if (dl_server(dl_se) && cpu_active(cpu)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); +#endif +} + +static u64 dl_server_read_bw(int cpu) +{ + u64 dl_bw = 0; + + if (cpu_rq(cpu)->fair_server.dl_server) + dl_bw += cpu_rq(cpu)->fair_server.dl_bw; + +#ifdef CONFIG_SCHED_CLASS_EXT + if (cpu_rq(cpu)->ext_server.dl_server) + dl_bw += cpu_rq(cpu)->ext_server.dl_bw; +#endif + + return dl_bw; +} + void dl_clear_root_domain(struct root_domain *rd) { int i; @@ -3201,12 +3244,8 @@ void dl_clear_root_domain(struct root_domain *rd) * dl_servers are not tasks. Since dl_add_task_root_domain ignores * them, we need to account for them here explicitly. */ - for_each_cpu(i, rd->span) { - struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; - - if (dl_server(dl_se) && cpu_active(i)) - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); - } + for_each_cpu(i, rd->span) + dl_server_add_bw(rd, i); }
void dl_clear_root_domain_cpu(int cpu) @@ -3702,7 +3741,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) unsigned long flags, cap; struct dl_bw *dl_b; bool overflow = 0; - u64 fair_server_bw = 0; + u64 dl_server_bw = 0;
rcu_read_lock_sched(); dl_b = dl_bw_of(cpu); @@ -3735,27 +3774,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) cap -= arch_scale_cpu_capacity(cpu);
/* - * cpu is going offline and NORMAL tasks will be moved away - * from it. We can thus discount dl_server bandwidth - * contribution as it won't need to be servicing tasks after - * the cpu is off. + * cpu is going offline and NORMAL and EXT tasks will be + * moved away from it. We can thus discount dl_server + * bandwidth contribution as it won't need to be servicing + * tasks after the cpu is off. */ - if (cpu_rq(cpu)->fair_server.dl_server) - fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; + dl_server_bw = dl_server_read_bw(cpu);
/* * Not much to check if no DEADLINE bandwidth is present. * dl_servers we can discount, as tasks will be moved out the * offlined CPUs anyway. */ - if (dl_b->total_bw - fair_server_bw > 0) { + if (dl_b->total_bw - dl_server_bw > 0) { /* * Leaving at least one CPU for DEADLINE tasks seems a * wise thing to do. As said above, cpu is not offline * yet, so account for that. */ if (dl_bw_cpus(cpu) - 1) - overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); + overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0); else overflow = 1; } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 94164f2dec6dc..04daaac74f514 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -957,6 +957,8 @@ static void update_curr_scx(struct rq *rq) if (!curr->scx.slice) touch_core_sched(rq, curr); } + + dl_server_update(&rq->ext_server, delta_exec); }
static bool scx_dsq_priq_less(struct rb_node *node_a, @@ -1500,6 +1502,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p);
+ /* Start dl_server if this is the first task being enqueued */ + if (rq->scx.nr_running == 1) + dl_server_start(&rq->ext_server); + do_enqueue_task(rq, p, enq_flags, sticky_cpu); out: rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; @@ -2511,6 +2517,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf) return do_pick_task_scx(rq, rf, false); }
+/* + * Select the next task to run from the ext scheduling class. + * + * Use do_pick_task_scx() directly with @force_scx enabled, since the + * dl_server must always select a sched_ext task. + */ +static struct task_struct * +ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf) +{ + if (!scx_enabled()) + return NULL; + + return do_pick_task_scx(dl_se->rq, rf, true); +} + +/* + * Initialize the ext server deadline entity. + */ +void ext_server_init(struct rq *rq) +{ + struct sched_dl_entity *dl_se = &rq->ext_server; + + init_dl_entity(dl_se); + + dl_server_init(dl_se, rq, ext_server_pick_task); +} + #ifdef CONFIG_SCHED_CORE /** * scx_prio_less - Task ordering for core-sched @@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) static void switched_from_scx(struct rq *rq, struct task_struct *p) { scx_disable_task(p); + + /* + * After class switch, if the DL server is still active, restart it so + * that DL timers will be queued, in case SCX switched to higher class. + */ + if (dl_server_active(&rq->ext_server)) { + dl_server_stop(&rq->ext_server); + dl_server_start(&rq->ext_server); + } }
static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c174afe1dd177..53793b9a04185 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -530,6 +530,9 @@ static void update_curr_idle(struct rq *rq) se->exec_start = now;
dl_server_update_idle(&rq->fair_server, delta_exec); +#ifdef CONFIG_SCHED_CLASS_EXT + dl_server_update_idle(&rq->ext_server, delta_exec); +#endif }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5f..28c24cda1c3ce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, extern void sched_init_dl_servers(void);
extern void fair_server_init(struct rq *rq); +extern void ext_server_init(struct rq *rq); extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); @@ -1151,6 +1152,7 @@ struct rq { struct dl_rq dl; #ifdef CONFIG_SCHED_CLASS_EXT struct scx_rq scx; + struct sched_dl_entity ext_server; #endif
struct sched_dl_entity fair_server; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index cf643a5ddedd2..ac268da917781 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) if (rq->fair_server.dl_server) __dl_server_attach_root(&rq->fair_server, rq);
+#ifdef CONFIG_SCHED_CLASS_EXT + if (rq->ext_server.dl_server) + __dl_server_attach_root(&rq->ext_server, rq); +#endif + rq_unlock_irqrestore(rq, &rf);
if (old_rd)
Hi!
On 17/12/25 10:35, Andrea Righi wrote:
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
...
v4: - initialize EXT server bandwidth reservation at init time and always keep it active (Andrea Righi) - check for rq->nr_running == 1 to determine when to account idle time (Juri Lelli) v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com
...
@@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) static void switched_from_scx(struct rq *rq, struct task_struct *p) { scx_disable_task(p);
- /*
* After class switch, if the DL server is still active, restart it so* that DL timers will be queued, in case SCX switched to higher class.*/- if (dl_server_active(&rq->ext_server)) {
dl_server_stop(&rq->ext_server);dl_server_start(&rq->ext_server);- }
}
We might have discussed this already, in that case I forgot, sorry. But, why we do need to start the server again if switched from scx? Couldn't make sense of the comment that is already present.
Thanks, Juri
Hi Juri,
On Wed, Dec 17, 2025 at 04:49:02PM +0100, Juri Lelli wrote:
Hi!
On 17/12/25 10:35, Andrea Righi wrote:
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
...
v4: - initialize EXT server bandwidth reservation at init time and always keep it active (Andrea Righi) - check for rq->nr_running == 1 to determine when to account idle time (Juri Lelli) v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com
...
@@ -3090,6 +3123,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) static void switched_from_scx(struct rq *rq, struct task_struct *p) { scx_disable_task(p);
- /*
* After class switch, if the DL server is still active, restart it so* that DL timers will be queued, in case SCX switched to higher class.*/- if (dl_server_active(&rq->ext_server)) {
dl_server_stop(&rq->ext_server);dl_server_start(&rq->ext_server);- }
}
We might have discussed this already, in that case I forgot, sorry. But, why we do need to start the server again if switched from scx? Couldn't make sense of the comment that is already present.
The intention was to restart the DL timers, but thinking more about it, this appears more harmful than helpful, as it may actually disrupt accounting.
I did a quick test without the restart and everything seems to work. I'll run more tests and I'll send an updated patch if everything works well without the restart.
Thanks! -Andrea
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also included later to confirm that both DL servers are functioning correctly:
# ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: TAP version 13 1..1 # Runtime of FAIR task (PID 1511) is 0.250000 seconds # Runtime of RT task (PID 1512) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 1 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1514) is 0.250000 seconds # Runtime of RT task (PID 1515) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 2 PASS: EXT task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of FAIR task (PID 1517) is 0.250000 seconds # Runtime of RT task (PID 1518) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 3 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1521) is 0.250000 seconds # Runtime of RT task (PID 1522) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 4 PASS: EXT task got more than 4.00% of runtime ok 1 rt_stall # ===== END =====
v5: - do not restart the EXT server on switch_class() (Juri Lelli) v4: - initialize EXT server bandwidth reservation at init time and always keep it active (Andrea Righi) - check for rq->nr_running == 1 to determine when to account idle time (Juri Lelli) v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/core.c | 6 +++ kernel/sched/deadline.c | 84 ++++++++++++++++++++++++++++++----------- kernel/sched/ext.c | 33 ++++++++++++++++ kernel/sched/idle.c | 3 ++ kernel/sched/sched.h | 2 + kernel/sched/topology.c | 5 +++ 6 files changed, 110 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41ba0be169117..a2400ee33a356 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8475,6 +8475,9 @@ int sched_cpu_dying(unsigned int cpu) dump_rq_tasks(rq, KERN_WARNING); } dl_server_stop(&rq->fair_server); +#ifdef CONFIG_SCHED_CLASS_EXT + dl_server_stop(&rq->ext_server); +#endif rq_unlock_irqrestore(rq, &rf);
calc_load_migrate(rq); @@ -8678,6 +8681,9 @@ void __init sched_init(void) hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); fair_server_init(rq); +#ifdef CONFIG_SCHED_CLASS_EXT + ext_server_init(rq); +#endif
#ifdef CONFIG_SCHED_CORE rq->core = rq; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 2789db5217cd4..88f2b5ed5678a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1445,8 +1445,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 dl_se->dl_defer_idle = 0;
/* - * The fair server can consume its runtime while throttled (not queued/ - * running as regular CFS). + * The DL server can consume its runtime while throttled (not + * queued / running as regular CFS). * * If the server consumes its entire runtime in this state. The server * is not required for the current period. Thus, reset the server by @@ -1531,10 +1531,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 }
/* - * The fair server (sole dl_server) does not account for real-time - * workload because it is running fair work. + * The dl_server does not account for real-time workload because it + * is running fair work. */ - if (dl_se == &rq->fair_server) + if (dl_se->dl_server) return;
#ifdef CONFIG_RT_GROUP_SCHED @@ -1569,9 +1569,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 * In the non-defer mode, the idle time is not accounted, as the * server provides a guarantee. * - * If the dl_server is in defer mode, the idle time is also considered - * as time available for the fair server, avoiding a penalty for the - * rt scheduler that did not consumed that time. + * If the dl_server is in defer mode, the idle time is also considered as + * time available for the dl_server, avoiding a penalty for the rt + * scheduler that did not consumed that time. */ void dl_server_update_idle(struct sched_dl_entity *dl_se, s64 delta_exec) { @@ -1810,6 +1810,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se) hrtimer_try_to_cancel(&dl_se->dl_timer); dl_se->dl_defer_armed = 0; dl_se->dl_throttled = 0; + dl_se->dl_defer_running = 0; dl_se->dl_defer_idle = 0; dl_se->dl_server_active = 0; } @@ -1844,6 +1845,18 @@ void sched_init_dl_servers(void) dl_se->dl_server = 1; dl_se->dl_defer = 1; setup_new_dl_entity(dl_se); + +#ifdef CONFIG_SCHED_CLASS_EXT + dl_se = &rq->ext_server; + + WARN_ON(dl_server(dl_se)); + + dl_server_apply_params(dl_se, runtime, period, 1); + + dl_se->dl_server = 1; + dl_se->dl_defer = 1; + setup_new_dl_entity(dl_se); +#endif } }
@@ -3183,6 +3196,36 @@ void dl_add_task_root_domain(struct task_struct *p) raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); }
+static void dl_server_add_bw(struct root_domain *rd, int cpu) +{ + struct sched_dl_entity *dl_se; + + dl_se = &cpu_rq(cpu)->fair_server; + if (dl_server(dl_se) && cpu_active(cpu)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); + +#ifdef CONFIG_SCHED_CLASS_EXT + dl_se = &cpu_rq(cpu)->ext_server; + if (dl_server(dl_se) && cpu_active(cpu)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); +#endif +} + +static u64 dl_server_read_bw(int cpu) +{ + u64 dl_bw = 0; + + if (cpu_rq(cpu)->fair_server.dl_server) + dl_bw += cpu_rq(cpu)->fair_server.dl_bw; + +#ifdef CONFIG_SCHED_CLASS_EXT + if (cpu_rq(cpu)->ext_server.dl_server) + dl_bw += cpu_rq(cpu)->ext_server.dl_bw; +#endif + + return dl_bw; +} + void dl_clear_root_domain(struct root_domain *rd) { int i; @@ -3201,12 +3244,8 @@ void dl_clear_root_domain(struct root_domain *rd) * dl_servers are not tasks. Since dl_add_task_root_domain ignores * them, we need to account for them here explicitly. */ - for_each_cpu(i, rd->span) { - struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; - - if (dl_server(dl_se) && cpu_active(i)) - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); - } + for_each_cpu(i, rd->span) + dl_server_add_bw(rd, i); }
void dl_clear_root_domain_cpu(int cpu) @@ -3702,7 +3741,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) unsigned long flags, cap; struct dl_bw *dl_b; bool overflow = 0; - u64 fair_server_bw = 0; + u64 dl_server_bw = 0;
rcu_read_lock_sched(); dl_b = dl_bw_of(cpu); @@ -3735,27 +3774,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) cap -= arch_scale_cpu_capacity(cpu);
/* - * cpu is going offline and NORMAL tasks will be moved away - * from it. We can thus discount dl_server bandwidth - * contribution as it won't need to be servicing tasks after - * the cpu is off. + * cpu is going offline and NORMAL and EXT tasks will be + * moved away from it. We can thus discount dl_server + * bandwidth contribution as it won't need to be servicing + * tasks after the cpu is off. */ - if (cpu_rq(cpu)->fair_server.dl_server) - fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; + dl_server_bw = dl_server_read_bw(cpu);
/* * Not much to check if no DEADLINE bandwidth is present. * dl_servers we can discount, as tasks will be moved out the * offlined CPUs anyway. */ - if (dl_b->total_bw - fair_server_bw > 0) { + if (dl_b->total_bw - dl_server_bw > 0) { /* * Leaving at least one CPU for DEADLINE tasks seems a * wise thing to do. As said above, cpu is not offline * yet, so account for that. */ if (dl_bw_cpus(cpu) - 1) - overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); + overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0); else overflow = 1; } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 94164f2dec6dc..d8c457c853aef 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -957,6 +957,8 @@ static void update_curr_scx(struct rq *rq) if (!curr->scx.slice) touch_core_sched(rq, curr); } + + dl_server_update(&rq->ext_server, delta_exec); }
static bool scx_dsq_priq_less(struct rb_node *node_a, @@ -1500,6 +1502,10 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p);
+ /* Start dl_server if this is the first task being enqueued */ + if (rq->scx.nr_running == 1) + dl_server_start(&rq->ext_server); + do_enqueue_task(rq, p, enq_flags, sticky_cpu); out: rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; @@ -2511,6 +2517,33 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf) return do_pick_task_scx(rq, rf, false); }
+/* + * Select the next task to run from the ext scheduling class. + * + * Use do_pick_task_scx() directly with @force_scx enabled, since the + * dl_server must always select a sched_ext task. + */ +static struct task_struct * +ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf) +{ + if (!scx_enabled()) + return NULL; + + return do_pick_task_scx(dl_se->rq, rf, true); +} + +/* + * Initialize the ext server deadline entity. + */ +void ext_server_init(struct rq *rq) +{ + struct sched_dl_entity *dl_se = &rq->ext_server; + + init_dl_entity(dl_se); + + dl_server_init(dl_se, rq, ext_server_pick_task); +} + #ifdef CONFIG_SCHED_CORE /** * scx_prio_less - Task ordering for core-sched diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c174afe1dd177..53793b9a04185 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -530,6 +530,9 @@ static void update_curr_idle(struct rq *rq) se->exec_start = now;
dl_server_update_idle(&rq->fair_server, delta_exec); +#ifdef CONFIG_SCHED_CLASS_EXT + dl_server_update_idle(&rq->ext_server, delta_exec); +#endif }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5f..28c24cda1c3ce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -414,6 +414,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, extern void sched_init_dl_servers(void);
extern void fair_server_init(struct rq *rq); +extern void ext_server_init(struct rq *rq); extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); @@ -1151,6 +1152,7 @@ struct rq { struct dl_rq dl; #ifdef CONFIG_SCHED_CLASS_EXT struct scx_rq scx; + struct sched_dl_entity ext_server; #endif
struct sched_dl_entity fair_server; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index cf643a5ddedd2..ac268da917781 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) if (rq->fair_server.dl_server) __dl_server_attach_root(&rq->fair_server, rq);
+#ifdef CONFIG_SCHED_CLASS_EXT + if (rq->ext_server.dl_server) + __dl_server_attach_root(&rq->ext_server, rq); +#endif + rq_unlock_irqrestore(rq, &rf);
if (old_rd)
From: Joel Fernandes joelagnelf@nvidia.com
When a sched_ext server is loaded, tasks in the fair class are automatically moved to the sched_ext class. Add support to modify the ext server parameters similar to how the fair server parameters are modified.
Re-use common code between ext and fair servers as needed.
v2: - use dl_se->dl_server to determine if dl_se is a DL server (Peter Zijlstra) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Reviewed-by: Juri Lelli juri.lelli@redhat.com Co-developed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 157 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 133 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index dd793f8f3858a..2e9896668c6fd 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -336,14 +336,16 @@ enum dl_param { DL_PERIOD, };
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ -static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ +static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ +static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos, enum dl_param param) +static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos, enum dl_param param, + void *server) { long cpu = (long) ((struct seq_file *) filp->private_data)->private; struct rq *rq = cpu_rq(cpu); + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; u64 runtime, period; int retval = 0; size_t err; @@ -356,8 +358,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu scoped_guard (rq_lock_irqsave, rq) { bool is_active;
- runtime = rq->fair_server.dl_runtime; - period = rq->fair_server.dl_period; + runtime = dl_se->dl_runtime; + period = dl_se->dl_period;
switch (param) { case DL_RUNTIME: @@ -373,25 +375,25 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu }
if (runtime > period || - period > fair_server_period_max || - period < fair_server_period_min) { + period > dl_server_period_max || + period < dl_server_period_min) { return -EINVAL; }
- is_active = dl_server_active(&rq->fair_server); + is_active = dl_server_active(dl_se); if (is_active) { update_rq_clock(rq); - dl_server_stop(&rq->fair_server); + dl_server_stop(dl_se); }
- retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); + retval = dl_server_apply_params(dl_se, runtime, period, 0);
if (!runtime) - printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", - cpu_of(rq)); + printk_deferred("%s server disabled in CPU %d, system may crash due to starvation.\n", + server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
if (is_active && runtime) - dl_server_start(&rq->fair_server); + dl_server_start(dl_se);
if (retval < 0) return retval; @@ -401,36 +403,42 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return cnt; }
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) +static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param, + void *server) { - unsigned long cpu = (unsigned long) m->private; - struct rq *rq = cpu_rq(cpu); + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; u64 value;
switch (param) { case DL_RUNTIME: - value = rq->fair_server.dl_runtime; + value = dl_se->dl_runtime; break; case DL_PERIOD: - value = rq->fair_server.dl_period; + value = dl_se->dl_period; break; }
seq_printf(m, "%llu\n", value); return 0; - }
static ssize_t sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, + &rq->fair_server); }
static int sched_fair_server_runtime_show(struct seq_file *m, void *v) { - return sched_fair_server_show(m, v, DL_RUNTIME); + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server); }
static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) @@ -446,16 +454,57 @@ static const struct file_operations fair_server_runtime_fops = { .release = single_release, };
+#ifdef CONFIG_SCHED_CLASS_EXT +static ssize_t +sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, + &rq->ext_server); +} + +static int sched_ext_server_runtime_show(struct seq_file *m, void *v) +{ + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server); +} + +static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_ext_server_runtime_show, inode->i_private); +} + +static const struct file_operations ext_server_runtime_fops = { + .open = sched_ext_server_runtime_open, + .write = sched_ext_server_runtime_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; +#endif /* CONFIG_SCHED_CLASS_EXT */ + static ssize_t sched_fair_server_period_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, + &rq->fair_server); }
static int sched_fair_server_period_show(struct seq_file *m, void *v) { - return sched_fair_server_show(m, v, DL_PERIOD); + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server); }
static int sched_fair_server_period_open(struct inode *inode, struct file *filp) @@ -471,6 +520,40 @@ static const struct file_operations fair_server_period_fops = { .release = single_release, };
+#ifdef CONFIG_SCHED_CLASS_EXT +static ssize_t +sched_ext_server_period_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, + &rq->ext_server); +} + +static int sched_ext_server_period_show(struct seq_file *m, void *v) +{ + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server); +} + +static int sched_ext_server_period_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_ext_server_period_show, inode->i_private); +} + +static const struct file_operations ext_server_period_fops = { + .open = sched_ext_server_period_open, + .write = sched_ext_server_period_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; +#endif /* CONFIG_SCHED_CLASS_EXT */ + static struct dentry *debugfs_sched;
static void debugfs_fair_server_init(void) @@ -494,6 +577,29 @@ static void debugfs_fair_server_init(void) } }
+#ifdef CONFIG_SCHED_CLASS_EXT +static void debugfs_ext_server_init(void) +{ + struct dentry *d_ext; + unsigned long cpu; + + d_ext = debugfs_create_dir("ext_server", debugfs_sched); + if (!d_ext) + return; + + for_each_possible_cpu(cpu) { + struct dentry *d_cpu; + char buf[32]; + + snprintf(buf, sizeof(buf), "cpu%lu", cpu); + d_cpu = debugfs_create_dir(buf, d_ext); + + debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops); + debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops); + } +} +#endif /* CONFIG_SCHED_CLASS_EXT */ + static __init int sched_init_debug(void) { struct dentry __maybe_unused *numa; @@ -532,6 +638,9 @@ static __init int sched_init_debug(void) debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
debugfs_fair_server_init(); +#ifdef CONFIG_SCHED_CLASS_EXT + debugfs_ext_server_init(); +#endif
return 0; }
Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class.
v4: - cover full cycle: fair stop, ext start, ext stop, fair start (Christian Loehle) v3: - add a comment to explain the 4% threshold (Emil Tsalapatis) v2: - replaced occurences of CFS in the test with EXT (Joel Fernandes)
Reviewed-by: Emil Tsalapatis emil@etsalapatis.com Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 240 ++++++++++++++++++ 3 files changed, 264 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 5fe45f9c5f8fd..c9255d1499b6e 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -183,6 +183,7 @@ auto-test-targets := \ select_cpu_dispatch_bad_dsq \ select_cpu_dispatch_dbl_dsp \ select_cpu_vtime \ + rt_stall \ test_example \
testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c new file mode 100644 index 0000000000000..80086779dd1eb --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. + * + * Copyright (c) 2025 NVIDIA Corporation. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops rt_stall_ops = { + .exit = (void *)rt_stall_exit, + .name = "rt_stall", +}; diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c new file mode 100644 index 0000000000000..015200f80f6e2 --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -0,0 +1,240 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sched.h> +#include <sys/prctl.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <time.h> +#include <linux/sched.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <unistd.h> +#include "rt_stall.bpf.skel.h" +#include "scx_test.h" +#include "../kselftest.h" + +#define CORE_ID 0 /* CPU to pin tasks to */ +#define RUN_TIME 5 /* How long to run the test in seconds */ + +/* Simple busy-wait function for test tasks */ +static void process_func(void) +{ + while (1) { + /* Busy wait */ + for (volatile unsigned long i = 0; i < 10000000UL; i++) + ; + } +} + +/* Set CPU affinity to a specific core */ +static void set_affinity(int cpu) +{ + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(cpu, &mask); + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } +} + +/* Set task scheduling policy and priority */ +static void set_sched(int policy, int priority) +{ + struct sched_param param; + + param.sched_priority = priority; + if (sched_setscheduler(0, policy, ¶m) != 0) { + perror("sched_setscheduler"); + exit(EXIT_FAILURE); + } +} + +/* Get process runtime from /proc/<pid>/stat */ +static float get_process_runtime(int pid) +{ + char path[256]; + FILE *file; + long utime, stime; + int fields; + + snprintf(path, sizeof(path), "/proc/%d/stat", pid); + file = fopen(path, "r"); + if (file == NULL) { + perror("Failed to open stat file"); + return -1; + } + + /* Skip the first 13 fields and read the 14th and 15th */ + fields = fscanf(file, + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", + &utime, &stime); + fclose(file); + + if (fields != 2) { + fprintf(stderr, "Failed to read stat file\n"); + return -1; + } + + /* Calculate the total time spent in the process */ + long total_time = utime + stime; + long ticks_per_second = sysconf(_SC_CLK_TCK); + float runtime_seconds = total_time * 1.0 / ticks_per_second; + + return runtime_seconds; +} + +static enum scx_test_status setup(void **ctx) +{ + struct rt_stall *skel; + + skel = rt_stall__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static bool sched_stress_test(bool is_ext) +{ + /* + * We're expecting the EXT task to get around 5% of CPU time when + * competing with the RT task (small 1% fluctuations are expected). + * + * However, the EXT task should get at least 4% of the CPU to prove + * that the EXT deadline server is working correctly. A percentage + * less than 4% indicates a bug where RT tasks can potentially + * stall SCHED_EXT tasks, causing the test to fail. + */ + const float expected_min_ratio = 0.04; /* 4% */ + const char *class_str = is_ext ? "EXT" : "FAIR"; + + float ext_runtime, rt_runtime, actual_ratio; + int ext_pid, rt_pid; + + ksft_print_header(); + ksft_set_plan(1); + + /* Create and set up a EXT task */ + ext_pid = fork(); + if (ext_pid == 0) { + set_affinity(CORE_ID); + process_func(); + exit(0); + } else if (ext_pid < 0) { + perror("fork task"); + ksft_exit_fail(); + } + + /* Create an RT task */ + rt_pid = fork(); + if (rt_pid == 0) { + set_affinity(CORE_ID); + set_sched(SCHED_FIFO, 50); + process_func(); + exit(0); + } else if (rt_pid < 0) { + perror("fork for RT task"); + ksft_exit_fail(); + } + + /* Let the processes run for the specified time */ + sleep(RUN_TIME); + + /* Get runtime for the EXT task */ + ext_runtime = get_process_runtime(ext_pid); + if (ext_runtime == -1) + ksft_exit_fail_msg("Error getting runtime for %s task (PID %d)\n", + class_str, ext_pid); + ksft_print_msg("Runtime of %s task (PID %d) is %f seconds\n", + class_str, ext_pid, ext_runtime); + + /* Get runtime for the RT task */ + rt_runtime = get_process_runtime(rt_pid); + if (rt_runtime == -1) + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); + + /* Kill the processes */ + kill(ext_pid, SIGKILL); + kill(rt_pid, SIGKILL); + waitpid(ext_pid, NULL, 0); + waitpid(rt_pid, NULL, 0); + + /* Verify that the scx task got enough runtime */ + actual_ratio = ext_runtime / (ext_runtime + rt_runtime); + ksft_print_msg("%s task got %.2f%% of total runtime\n", + class_str, actual_ratio * 100); + + if (actual_ratio >= expected_min_ratio) { + ksft_test_result_pass("PASS: %s task got more than %.2f%% of runtime\n", + class_str, expected_min_ratio * 100); + return true; + } + ksft_test_result_fail("FAIL: %s task got less than %.2f%% of runtime\n", + class_str, expected_min_ratio * 100); + return false; +} + +static enum scx_test_status run(void *ctx) +{ + struct rt_stall *skel = ctx; + struct bpf_link *link = NULL; + bool res; + int i; + + /* + * Test if the dl_server is working both with and without the + * sched_ext scheduler attached. + * + * This ensures all the scenarios are covered: + * - fair_server stop -> ext_server start + * - ext_server stop -> fair_server stop + */ + for (i = 0; i < 4; i++) { + bool is_ext = i % 2; + + if (is_ext) { + memset(&skel->data->uei, 0, sizeof(skel->data->uei)); + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + } + res = sched_stress_test(is_ext); + if (is_ext) { + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); + bpf_link__destroy(link); + } + + if (!res) + ksft_exit_fail(); + } + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct rt_stall *skel = ctx; + + rt_stall__destroy(skel); +} + +struct scx_test rt_stall = { + .name = "rt_stall", + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&rt_stall)
From: Joel Fernandes joelagnelf@nvidia.com
Add a new kselftest to verify that the total_bw value in /sys/kernel/debug/sched/debug remains consistent across all CPUs under different sched_ext BPF program states:
1. Before a BPF scheduler is loaded 2. While a BPF scheduler is loaded and active 3. After a BPF scheduler is unloaded
The test runs CPU stress threads to ensure DL server bandwidth values stabilize before checking consistency. This helps catch potential issues with DL server bandwidth accounting during sched_ext transitions.
v2: - small coding style fixes (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- tools/testing/selftests/sched_ext/Makefile | 1 + tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++ 2 files changed, 282 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index c9255d1499b6e..2c601a7eaff5f 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -185,6 +185,7 @@ auto-test-targets := \ select_cpu_vtime \ rt_stall \ test_example \ + total_bw \
testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c new file mode 100644 index 0000000000000..5b0a619bab86e --- /dev/null +++ b/tools/testing/selftests/sched_ext/total_bw.c @@ -0,0 +1,281 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test to verify that total_bw value remains consistent across all CPUs + * in different BPF program states. + * + * Copyright (C) 2025 NVIDIA Corporation. + */ +#include <bpf/bpf.h> +#include <errno.h> +#include <pthread.h> +#include <scx/common.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include "minimal.bpf.skel.h" +#include "scx_test.h" + +#define MAX_CPUS 512 +#define STRESS_DURATION_SEC 5 + +struct total_bw_ctx { + struct minimal *skel; + long baseline_bw[MAX_CPUS]; + int nr_cpus; +}; + +static void *cpu_stress_thread(void *arg) +{ + volatile int i; + time_t end_time = time(NULL) + STRESS_DURATION_SEC; + + while (time(NULL) < end_time) + for (i = 0; i < 1000000; i++) + ; + + return NULL; +} + +/* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ +static int run_cpu_stress(int nr_cpus) +{ + pthread_t *threads; + int i, ret = 0; + + threads = calloc(nr_cpus, sizeof(pthread_t)); + if (!threads) + return -ENOMEM; + + /* Create threads to run on each CPU */ + for (i = 0; i < nr_cpus; i++) { + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { + ret = -errno; + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); + break; + } + } + + /* Wait for all threads to complete */ + for (i = 0; i < nr_cpus; i++) { + if (threads[i]) + pthread_join(threads[i], NULL); + } + + free(threads); + return ret; +} + +static int read_total_bw_values(long *bw_values, int max_cpus) +{ + FILE *fp; + char line[256]; + int cpu_count = 0; + + fp = fopen("/sys/kernel/debug/sched/debug", "r"); + if (!fp) { + SCX_ERR("Failed to open debug file"); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + char *bw_str = strstr(line, "total_bw"); + + if (bw_str) { + bw_str = strchr(bw_str, ':'); + if (bw_str) { + /* Only store up to max_cpus values */ + if (cpu_count < max_cpus) + bw_values[cpu_count] = atol(bw_str + 1); + cpu_count++; + } + } + } + + fclose(fp); + return cpu_count; +} + +static bool verify_total_bw_consistency(long *bw_values, int count) +{ + int i; + long first_value; + + if (count <= 0) + return false; + + first_value = bw_values[0]; + + for (i = 1; i < count; i++) { + if (bw_values[i] != first_value) { + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", + first_value, i, bw_values[i]); + return false; + } + } + + return true; +} + +static int fetch_verify_total_bw(long *bw_values, int nr_cpus) +{ + int attempts = 0; + int max_attempts = 10; + int count; + + /* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ + if (run_cpu_stress(nr_cpus) < 0) { + SCX_ERR("Failed to run CPU stress"); + return -1; + } + + /* Try multiple times to get stable values */ + while (attempts < max_attempts) { + count = read_total_bw_values(bw_values, nr_cpus); + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); + /* If system has more CPUs than we're testing, that's OK */ + if (count < nr_cpus) { + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); + attempts++; + sleep(1); + continue; + } + + /* Only verify the CPUs we're testing */ + if (verify_total_bw_consistency(bw_values, nr_cpus)) { + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); + return 0; + } + + attempts++; + sleep(1); + } + + return -1; +} + +static enum scx_test_status setup(void **ctx) +{ + struct total_bw_ctx *test_ctx; + + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); + return SCX_TEST_SKIP; + } + + test_ctx = calloc(1, sizeof(*test_ctx)); + if (!test_ctx) + return SCX_TEST_FAIL; + + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); + if (test_ctx->nr_cpus <= 0) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ + if (test_ctx->nr_cpus > MAX_CPUS) + test_ctx->nr_cpus = MAX_CPUS; + + /* Test scenario 1: BPF program not loaded */ + /* Read and verify baseline total_bw before loading BPF program */ + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable baseline values"); + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* Load the BPF skeleton */ + test_ctx->skel = minimal__open(); + if (!test_ctx->skel) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + SCX_ENUM_INIT(test_ctx->skel); + if (minimal__load(test_ctx->skel)) { + minimal__destroy(test_ctx->skel); + free(test_ctx); + return SCX_TEST_FAIL; + } + + *ctx = test_ctx; + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + struct bpf_link *link; + long loaded_bw[MAX_CPUS]; + long unloaded_bw[MAX_CPUS]; + int i; + + /* Test scenario 2: BPF program loaded */ + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values with BPF loaded"); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + bpf_link__destroy(link); + + /* Test scenario 3: BPF program unloaded */ + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values after BPF unload"); + return SCX_TEST_FAIL; + } + + /* Verify all three scenarios have the same total_bw values */ + for (i = 0; i < test_ctx->nr_cpus; i++) { + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", + i, test_ctx->baseline_bw[i], loaded_bw[i]); + return SCX_TEST_FAIL; + } + + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", + i, test_ctx->baseline_bw[i], unloaded_bw[i]); + return SCX_TEST_FAIL; + } + } + + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + + if (test_ctx) { + if (test_ctx->skel) + minimal__destroy(test_ctx->skel); + free(test_ctx); + } +} + +struct scx_test total_bw = { + .name = "total_bw", + .description = "Verify total_bw consistency across BPF program states", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&total_bw)
Hi!
On 17/12/25 22:55, Andrea Righi wrote:
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also included later to confirm that both DL servers are functioning correctly:
# ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: TAP version 13 1..1 # Runtime of FAIR task (PID 1511) is 0.250000 seconds # Runtime of RT task (PID 1512) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 1 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1514) is 0.250000 seconds # Runtime of RT task (PID 1515) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 2 PASS: EXT task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of FAIR task (PID 1517) is 0.250000 seconds # Runtime of RT task (PID 1518) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 3 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1521) is 0.250000 seconds # Runtime of RT task (PID 1522) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 4 PASS: EXT task got more than 4.00% of runtime ok 1 rt_stall # ===== END =====
v5: - do not restart the EXT server on switch_class() (Juri Lelli) v4: - initialize EXT server bandwidth reservation at init time and always keep it active (Andrea Righi) - check for rq->nr_running == 1 to determine when to account idle time (Juri Lelli) v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Tested-by: Christian Loehle christian.loehle@arm.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com
This new version looks good to me, thanks!
Reviewed-by: Juri Lelli juri.lelli@redhat.com
Best, Juri
linux-kselftest-mirror@lists.linaro.org