 
            sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks.
Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality schedulers can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct.
== Highlights in this version ==
- wait for inactive_task_timer() to fire before removing the bandwidth reservation (Juri/Peter: please check if this new dl_server_remove_params() implementation makes sense to you) - removed the explicit dl_server_stop() from dequeue_task_scx() and rely on the delayed stop behavior (Juri/Peter: ditto)
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v10: - reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi) - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when invoked by the dl_server (Tejun Heo) - rename rq_dl_server to dl_server (Peter Zijlstra) - fixed a missing dl_server start in dl_server_on() (Christian Loehle) - add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis)
Changes in v9: - Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v8: - Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf - Simplify dl_se->dl_server check (suggested by PeterZ) - Small coding style fixes in the kselftests - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7: - Rebased to Linus master - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6: - Added Acks to few patches - Fixes to few nits suggested by Tejun - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5: - Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs - Address comment from Andrea about redundant rq clock invalidation - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4: - Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX - Fixed other issues - Rebased on Linus master - All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3: - Removed code duplication in debugfs. Made ext interface separate - Fixed issue where rq_lock_irqsave was not used in the relinquish patch - Fixed running bw accounting issue in dl_server_remove_params - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2: - Fixed a hang related to using rq_lock instead of rq_lock_irqsave - Added support to remove BW of DL servers when they are switched to/from EXT - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (5): sched/deadline: Add support to initialize and remove dl_server bandwidth sched_ext: Add a DL server for sched_ext tasks sched/deadline: Account ext server bandwidth sched_ext: Selectively enable ext and fair DL servers selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (6): sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/deadline: Clear the defer params sched/deadline: Add a server arg to dl_server_update_idle_time() sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency
kernel/sched/core.c | 3 + kernel/sched/deadline.c | 169 +++++++++++--- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 144 +++++++++++- kernel/sched/fair.c | 2 +- kernel/sched/idle.c | 2 +- kernel/sched/sched.h | 8 +- kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 12 files changed, 955 insertions(+), 77 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
 
            From: Joel Fernandes joelagnelf@nvidia.com
Updating "ppos" on error conditions does not make much sense. The pattern is to return the error code directly without modifying the position, or modify the position on success and return the number of bytes written.
Since on success, the return value of apply is 0, there is no point in modifying ppos either. Fix it by removing all this and just returning error code or number of bytes written on success.
Acked-by: Tejun Heo tj@kernel.org Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 02e16b70a7901..6cf9be6eea49a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -345,8 +345,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu long cpu = (long) ((struct seq_file *) filp->private_data)->private; struct rq *rq = cpu_rq(cpu); u64 runtime, period; + int retval = 0; size_t err; - int retval; u64 value;
err = kstrtoull_from_user(ubuf, cnt, 10, &value); @@ -380,8 +380,6 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu dl_server_stop(&rq->fair_server);
retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); - if (retval) - cnt = retval;
if (!runtime) printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", @@ -389,6 +387,9 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu
if (rq->cfs.h_nr_queued) dl_server_start(&rq->fair_server); + + if (retval < 0) + return retval; }
*ppos += cnt;
 
            From: Joel Fernandes joelagnelf@nvidia.com
Currently the DL server interface for applying parameters checks CFS-internals to identify if the server is active. This is error-prone and makes it difficult when adding new servers in the future.
Fix it, by using dl_server_active() which is also used by the DL server code to determine if the DL server was started.
Acked-by: Tejun Heo tj@kernel.org Reviewed-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 6cf9be6eea49a..e71f6618c1a6a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -354,6 +354,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return err;
scoped_guard (rq_lock_irqsave, rq) { + bool is_active; + runtime = rq->fair_server.dl_runtime; period = rq->fair_server.dl_period;
@@ -376,8 +378,11 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return -EINVAL; }
- update_rq_clock(rq); - dl_server_stop(&rq->fair_server); + is_active = dl_server_active(&rq->fair_server); + if (is_active) { + update_rq_clock(rq); + dl_server_stop(&rq->fair_server); + }
retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
@@ -385,7 +390,7 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", cpu_of(rq));
- if (rq->cfs.h_nr_queued) + if (is_active) dl_server_start(&rq->fair_server);
if (retval < 0)
 
            From: Joel Fernandes joelagnelf@nvidia.com
The defer params were not cleared in __dl_clear_params. Clear them.
Without this is some of my test cases are flaking and the DL timer is not starting correctly AFAICS.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Acked-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/deadline.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 48357d4609bf9..4aefb34a1d38b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -3387,6 +3387,9 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se) dl_se->dl_non_contending = 0; dl_se->dl_overrun = 0; dl_se->dl_server = 0; + dl_se->dl_defer = 0; + dl_se->dl_defer_running = 0; + dl_se->dl_defer_armed = 0;
#ifdef CONFIG_RT_MUTEXES dl_se->pi_se = dl_se;
 
            During switching from sched_ext to fair tasks and vice-versa, we need support for intializing and removing the bandwidth contribution of either DL server.
Add support for handling these transitions.
Moreover, remove references specific to the fair server, in preparation for adding the ext server.
v2: - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - add WARN_ON_ONCE(!cpus) sanity check in dl_server_apply_params() (Andrea Righi)
Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/deadline.c | 96 ++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 3 ++ 2 files changed, 84 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 4aefb34a1d38b..8aff1aba7b8a9 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1441,8 +1441,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 dl_se->runtime -= scaled_delta_exec;
/* - * The fair server can consume its runtime while throttled (not queued/ - * running as regular CFS). + * The dl_server can consume its runtime while throttled (not + * queued / running as regular fair task). * * If the server consumes its entire runtime in this state. The server * is not required for the current period. Thus, reset the server by @@ -1501,10 +1501,10 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 }
/* - * The fair server (sole dl_server) does not account for real-time - * workload because it is running fair work. + * The dl_server does not account real-time workload because it + * runs non-RT tasks. */ - if (dl_se == &rq->fair_server) + if (dl_se->dl_server) return;
#ifdef CONFIG_RT_GROUP_SCHED @@ -1540,8 +1540,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 * server provides a guarantee. * * If the dl_server is in defer mode, the idle time is also considered - * as time available for the fair server, avoiding a penalty for the - * rt scheduler that did not consumed that time. + * as time available for the dl_server, avoiding a penalty for the rt + * scheduler that did not consumed that time. */ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p) { @@ -1570,11 +1570,37 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec) { - /* 0 runtime = fair server disabled */ + /* 0 runtime = dl_server disabled */ if (dl_se->dl_runtime) update_curr_dl_se(dl_se->rq, dl_se, delta_exec); }
+/** + * dl_server_init_params - Initialize bandwidth reservation for a DL server + * @dl_se: The DL server entity to remove bandwidth for + * + * This function initializes the bandwidth reservation for a DL server + * entity, its bandwidth accounting and server state. + * + * Returns: 0 on success, negative error code on failure + */ +int dl_server_init_params(struct sched_dl_entity *dl_se) +{ + u64 runtime = 50 * NSEC_PER_MSEC; + u64 period = 1000 * NSEC_PER_MSEC; + int err; + + err = dl_server_apply_params(dl_se, runtime, period, 1); + if (err) + return err; + + dl_se->dl_server = 1; + dl_se->dl_defer = 1; + setup_new_dl_entity(dl_se); + + return err; +} + void dl_server_start(struct sched_dl_entity *dl_se) { struct rq *rq = dl_se->rq; @@ -1614,8 +1640,7 @@ void sched_init_dl_servers(void) struct sched_dl_entity *dl_se;
for_each_online_cpu(cpu) { - u64 runtime = 50 * NSEC_PER_MSEC; - u64 period = 1000 * NSEC_PER_MSEC; + int err;
rq = cpu_rq(cpu);
@@ -1625,11 +1650,8 @@ void sched_init_dl_servers(void)
WARN_ON(dl_server(dl_se));
- dl_server_apply_params(dl_se, runtime, period, 1); - - dl_se->dl_server = 1; - dl_se->dl_defer = 1; - setup_new_dl_entity(dl_se); + err = dl_server_init_params(dl_se); + WARN_ON_ONCE(err); } }
@@ -1663,6 +1685,9 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio guard(raw_spinlock)(&dl_b->lock);
cpus = dl_bw_cpus(cpu); + if (WARN_ON_ONCE(!cpus)) + return -ENODEV; + cap = dl_bw_capacity(cpu);
if (__dl_overflow(dl_b, cap, old_bw, new_bw)) @@ -1678,6 +1703,12 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio dl_rq_change_utilization(rq, dl_se, new_bw); }
+ /* Clear these so that the dl_server is reinitialized */ + if (new_bw == 0) { + dl_se->dl_defer = 0; + dl_se->dl_server = 0; + } + dl_se->dl_runtime = runtime; dl_se->dl_deadline = period; dl_se->dl_period = period; @@ -1691,6 +1722,41 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio return retval; }
+/** + * dl_server_remove_params - Remove bandwidth reservation for a DL server + * @dl_se: The DL server entity to remove bandwidth for + * + * This function removes the bandwidth reservation for a DL server entity, + * cleaning up all bandwidth accounting and server state. + * + * Returns: 0 on success, negative error code on failure + */ +int dl_server_remove_params(struct sched_dl_entity *dl_se, + struct rq *rq, struct rq_flags *rf) +{ + if (!dl_se->dl_server) + return 0; /* Already disabled */ + + /* + * First dequeue if still queued. It should not be queued since + * we call this only after the last dl_server_stop(). + */ + if (WARN_ON_ONCE(on_dl_rq(dl_se))) + dequeue_dl_entity(dl_se, DEQUEUE_SLEEP); + + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == -1) { + rq_unlock_irqrestore(rq, rf); + + hrtimer_cancel(&dl_se->inactive_timer); + + rq_lock_irqsave(rq, rf); + update_rq_clock(rq); + } + + /* Remove bandwidth reservation */ + return dl_server_apply_params(dl_se, 0, dl_se->dl_period, false); +} + /* * Update the current task's runtime statistics (provided it is still * a -deadline task and has not been removed from the dl_rq). diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27aae2a298f8b..4a0bf38dc71e9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -417,6 +417,9 @@ extern void fair_server_init(struct rq *rq); extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); +extern int dl_server_init_params(struct sched_dl_entity *dl_se); +extern int dl_server_remove_params(struct sched_dl_entity *dl_se, + struct rq *rq, struct rq_flags *rf);
static inline bool dl_server_active(struct sched_dl_entity *dl_se) {
 
            From: Joel Fernandes joelagnelf@nvidia.com
Since we are adding more servers, make dl_server_update_idle_time() accept a server argument rather than a specific server.
v2: - rename rq_dl_server to dl_server (Peter Zijlstra)
Acked-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/deadline.c | 16 ++++++++-------- kernel/sched/fair.c | 2 +- kernel/sched/idle.c | 2 +- kernel/sched/sched.h | 3 ++- 4 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8aff1aba7b8a9..6ecfaaa1f912d 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1543,26 +1543,26 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 * as time available for the dl_server, avoiding a penalty for the rt * scheduler that did not consumed that time. */ -void dl_server_update_idle_time(struct rq *rq, struct task_struct *p) +void dl_server_update_idle_time(struct rq *rq, struct task_struct *p, + struct sched_dl_entity *dl_server) { s64 delta_exec;
- if (!rq->fair_server.dl_defer) + if (!dl_server->dl_defer) return;
/* no need to discount more */ - if (rq->fair_server.runtime < 0) + if (dl_server->runtime < 0) return;
delta_exec = rq_clock_task(rq) - p->se.exec_start; if (delta_exec < 0) return;
- rq->fair_server.runtime -= delta_exec; - - if (rq->fair_server.runtime < 0) { - rq->fair_server.dl_defer_running = 0; - rq->fair_server.runtime = 0; + dl_server->runtime -= delta_exec; + if (dl_server->runtime < 0) { + dl_server->dl_defer_running = 0; + dl_server->runtime = 0; }
p->se.exec_start = rq_clock_task(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2554055c1ba13..562cdd253678a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6999,7 +6999,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!rq_h_nr_queued && rq->cfs.h_nr_queued) { /* Account for idle runtime */ if (!rq->nr_running) - dl_server_update_idle_time(rq, rq->curr); + dl_server_update_idle_time(rq, rq->curr, &rq->fair_server); dl_server_start(&rq->fair_server); }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 7fa0b593bcff7..60a19ea9bdbb7 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -454,7 +454,7 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next) { - dl_server_update_idle_time(rq, prev); + dl_server_update_idle_time(rq, prev, &rq->fair_server); scx_update_idle(rq, false, true); }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4a0bf38dc71e9..eaae470841dea 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -412,7 +412,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq, - struct task_struct *p); + struct task_struct *p, + struct sched_dl_entity *rq_dl_server); extern void fair_server_init(struct rq *rq); extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 
            sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also provided later to verify:
# ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: # Runtime of EXT task (PID 23338) is 0.250000 seconds # Runtime of RT task (PID 23339) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 1 PASS: EXT task got more than 4.00% of runtime ===== END =====
v3: - clarify that fair is not the only dl_server (Juri Lelli) - remove explicit stop to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when it's invoked by the dl_server (Tejun Heo) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi) v2: - drop ->balance() now that pick_task() has an rf argument (Andrea Righi)
Cc: Luigi De Matteis ldematteis123@gmail.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/core.c | 3 +++ kernel/sched/ext.c | 45 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 50 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 096e8d03d85e7..31a9c9381c63f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8679,6 +8679,9 @@ void __init sched_init(void) hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); fair_server_init(rq); +#ifdef CONFIG_SCHED_CLASS_EXT + ext_server_init(rq); +#endif
#ifdef CONFIG_SCHED_CORE rq->core = rq; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index d1ef5bda95aec..2a25749c54ba1 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -902,6 +902,9 @@ static void update_curr_scx(struct rq *rq) if (!curr->scx.slice) touch_core_sched(rq, curr); } + + if (dl_server_active(&rq->ext_server)) + dl_server_update(&rq->ext_server, delta_exec); }
static bool scx_dsq_priq_less(struct rb_node *node_a, @@ -1409,6 +1412,15 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p);
+ if (rq->scx.nr_running == 1) { + /* Account for idle runtime */ + if (!rq->nr_running) + dl_server_update_idle_time(rq, rq->curr, &rq->ext_server); + + /* Start dl_server if this is the first task being enqueued */ + dl_server_start(&rq->ext_server); + } + do_enqueue_task(rq, p, enq_flags, sticky_cpu); out: rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; @@ -2444,6 +2456,30 @@ static struct task_struct *pick_task_scx(struct rq *rq, struct rq_flags *rf) return do_pick_task_scx(rq, rf, false); }
+/* + * Select the next task to run from the ext scheduling class. + * + * Use do_pick_task_scx() directly with @force_scx enabled, since the + * dl_server must always select a sched_ext task. + */ +static struct task_struct * +ext_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf) +{ + return do_pick_task_scx(dl_se->rq, rf, true); +} + +/* + * Initialize the ext server deadline entity. + */ +void ext_server_init(struct rq *rq) +{ + struct sched_dl_entity *dl_se = &rq->ext_server; + + init_dl_entity(dl_se); + + dl_server_init(dl_se, rq, ext_server_pick_task); +} + #ifdef CONFIG_SCHED_CORE /** * scx_prio_less - Task ordering for core-sched @@ -3023,6 +3059,15 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) static void switched_from_scx(struct rq *rq, struct task_struct *p) { scx_disable_task(p); + + /* + * After class switch, if the DL server is still active, restart it so + * that DL timers will be queued, in case SCX switched to higher class. + */ + if (dl_server_active(&rq->ext_server)) { + dl_server_stop(&rq->ext_server); + dl_server_start(&rq->ext_server); + } }
static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index eaae470841dea..002e5c1808014 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -415,6 +415,7 @@ extern void dl_server_update_idle_time(struct rq *rq, struct task_struct *p, struct sched_dl_entity *rq_dl_server); extern void fair_server_init(struct rq *rq); +extern void ext_server_init(struct rq *rq); extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); @@ -1154,6 +1155,7 @@ struct rq { struct dl_rq dl; #ifdef CONFIG_SCHED_CLASS_EXT struct scx_rq scx; + struct sched_dl_entity ext_server; #endif
struct sched_dl_entity fair_server;
 
            From: Joel Fernandes joelagnelf@nvidia.com
When a sched_ext server is loaded, tasks in the fair class are automatically moved to the sched_ext class. Add support to modify the ext server parameters similar to how the fair server parameters are modified.
Re-use common code between ext and fair servers as needed.
v2: - use dl_se->dl_server to determine if dl_se is a DL server (Peter Zijlstra) - depend on CONFIG_SCHED_CLASS_EXT (Andrea Righi)
Reviewed-by: Juri Lelli juri.lelli@redhat.com Co-developed-by: Andrea Righi arighi@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/sched/debug.c | 157 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 133 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index e71f6618c1a6a..9c2084d203df5 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -336,14 +336,16 @@ enum dl_param { DL_PERIOD, };
-static unsigned long fair_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ -static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ +static unsigned long dl_server_period_max = (1UL << 22) * NSEC_PER_USEC; /* ~4 seconds */ +static unsigned long dl_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
-static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos, enum dl_param param) +static ssize_t sched_server_write_common(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos, enum dl_param param, + void *server) { long cpu = (long) ((struct seq_file *) filp->private_data)->private; struct rq *rq = cpu_rq(cpu); + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; u64 runtime, period; int retval = 0; size_t err; @@ -356,8 +358,8 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu scoped_guard (rq_lock_irqsave, rq) { bool is_active;
- runtime = rq->fair_server.dl_runtime; - period = rq->fair_server.dl_period; + runtime = dl_se->dl_runtime; + period = dl_se->dl_period;
switch (param) { case DL_RUNTIME: @@ -373,25 +375,25 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu }
if (runtime > period || - period > fair_server_period_max || - period < fair_server_period_min) { + period > dl_server_period_max || + period < dl_server_period_min) { return -EINVAL; }
- is_active = dl_server_active(&rq->fair_server); + is_active = dl_server_active(dl_se); if (is_active) { update_rq_clock(rq); - dl_server_stop(&rq->fair_server); + dl_server_stop(dl_se); }
- retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); + retval = dl_server_apply_params(dl_se, runtime, period, 0);
if (!runtime) - printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", - cpu_of(rq)); + printk_deferred("%s server disabled on CPU %d, system may crash due to starvation.\n", + server == &rq->fair_server ? "Fair" : "Ext", cpu_of(rq));
if (is_active) - dl_server_start(&rq->fair_server); + dl_server_start(dl_se);
if (retval < 0) return retval; @@ -401,36 +403,42 @@ static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubu return cnt; }
-static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) +static size_t sched_server_show_common(struct seq_file *m, void *v, enum dl_param param, + void *server) { - unsigned long cpu = (unsigned long) m->private; - struct rq *rq = cpu_rq(cpu); + struct sched_dl_entity *dl_se = (struct sched_dl_entity *)server; u64 value;
switch (param) { case DL_RUNTIME: - value = rq->fair_server.dl_runtime; + value = dl_se->dl_runtime; break; case DL_PERIOD: - value = rq->fair_server.dl_period; + value = dl_se->dl_period; break; }
seq_printf(m, "%llu\n", value); return 0; - }
static ssize_t sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, + &rq->fair_server); }
static int sched_fair_server_runtime_show(struct seq_file *m, void *v) { - return sched_fair_server_show(m, v, DL_RUNTIME); + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_RUNTIME, &rq->fair_server); }
static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) @@ -446,16 +454,57 @@ static const struct file_operations fair_server_runtime_fops = { .release = single_release, };
+#ifdef CONFIG_SCHED_CLASS_EXT +static ssize_t +sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_RUNTIME, + &rq->ext_server); +} + +static int sched_ext_server_runtime_show(struct seq_file *m, void *v) +{ + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_RUNTIME, &rq->ext_server); +} + +static int sched_ext_server_runtime_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_ext_server_runtime_show, inode->i_private); +} + +static const struct file_operations ext_server_runtime_fops = { + .open = sched_ext_server_runtime_open, + .write = sched_ext_server_runtime_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; +#endif /* CONFIG_SCHED_CLASS_EXT */ + static ssize_t sched_fair_server_period_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, + &rq->fair_server); }
static int sched_fair_server_period_show(struct seq_file *m, void *v) { - return sched_fair_server_show(m, v, DL_PERIOD); + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server); }
static int sched_fair_server_period_open(struct inode *inode, struct file *filp) @@ -471,6 +520,40 @@ static const struct file_operations fair_server_period_fops = { .release = single_release, };
+#ifdef CONFIG_SCHED_CLASS_EXT +static ssize_t +sched_ext_server_period_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD, + &rq->ext_server); +} + +static int sched_ext_server_period_show(struct seq_file *m, void *v) +{ + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + + return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server); +} + +static int sched_ext_server_period_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_ext_server_period_show, inode->i_private); +} + +static const struct file_operations ext_server_period_fops = { + .open = sched_ext_server_period_open, + .write = sched_ext_server_period_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; +#endif /* CONFIG_SCHED_CLASS_EXT */ + static struct dentry *debugfs_sched;
static void debugfs_fair_server_init(void) @@ -494,6 +577,29 @@ static void debugfs_fair_server_init(void) } }
+#ifdef CONFIG_SCHED_CLASS_EXT +static void debugfs_ext_server_init(void) +{ + struct dentry *d_ext; + unsigned long cpu; + + d_ext = debugfs_create_dir("ext_server", debugfs_sched); + if (!d_ext) + return; + + for_each_possible_cpu(cpu) { + struct dentry *d_cpu; + char buf[32]; + + snprintf(buf, sizeof(buf), "cpu%lu", cpu); + d_cpu = debugfs_create_dir(buf, d_ext); + + debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops); + debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops); + } +} +#endif /* CONFIG_SCHED_CLASS_EXT */ + static __init int sched_init_debug(void) { struct dentry __maybe_unused *numa; @@ -532,6 +638,9 @@ static __init int sched_init_debug(void) debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
debugfs_fair_server_init(); +#ifdef CONFIG_SCHED_CLASS_EXT + debugfs_ext_server_init(); +#endif
return 0; }
 
            Always account for both the ext_server and fair_server bandwidth, especially during CPU hotplug operations.
Ignoring either can lead to imbalances in total_bw when sched_ext schedulers are active and CPUs are brought online / offline.
Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/deadline.c | 54 +++++++++++++++++++++++++++++++---------- kernel/sched/topology.c | 5 ++++ 2 files changed, 46 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6ecfaaa1f912d..f786174a126c8 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2994,6 +2994,36 @@ void dl_add_task_root_domain(struct task_struct *p) task_rq_unlock(rq, p, &rf); }
+static void dl_server_add_bw(struct root_domain *rd, int cpu) +{ + struct sched_dl_entity *dl_se; + + dl_se = &cpu_rq(cpu)->fair_server; + if (dl_server(dl_se)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); + +#ifdef CONFIG_SCHED_CLASS_EXT + dl_se = &cpu_rq(cpu)->ext_server; + if (dl_server(dl_se)) + __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(cpu)); +#endif +} + +static u64 dl_server_read_bw(int cpu) +{ + u64 dl_bw = 0; + + if (cpu_rq(cpu)->fair_server.dl_server) + dl_bw += cpu_rq(cpu)->fair_server.dl_bw; + +#ifdef CONFIG_SCHED_CLASS_EXT + if (cpu_rq(cpu)->ext_server.dl_server) + dl_bw += cpu_rq(cpu)->ext_server.dl_bw; +#endif + + return dl_bw; +} + void dl_clear_root_domain(struct root_domain *rd) { int i; @@ -3013,10 +3043,9 @@ void dl_clear_root_domain(struct root_domain *rd) * them, we need to account for them here explicitly. */ for_each_cpu(i, rd->span) { - struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; - - if (dl_server(dl_se) && cpu_active(i)) - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); + if (!cpu_active(i)) + continue; + dl_server_add_bw(rd, i); } }
@@ -3513,7 +3542,7 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) unsigned long flags, cap; struct dl_bw *dl_b; bool overflow = 0; - u64 fair_server_bw = 0; + u64 dl_server_bw = 0;
rcu_read_lock_sched(); dl_b = dl_bw_of(cpu); @@ -3546,27 +3575,26 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) cap -= arch_scale_cpu_capacity(cpu);
/* - * cpu is going offline and NORMAL tasks will be moved away - * from it. We can thus discount dl_server bandwidth - * contribution as it won't need to be servicing tasks after - * the cpu is off. + * cpu is going offline and NORMAL and EXT tasks will be + * moved away from it. We can thus discount dl_server + * bandwidth contribution as it won't need to be servicing + * tasks after the cpu is off. */ - if (cpu_rq(cpu)->fair_server.dl_server) - fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; + dl_server_bw = dl_server_read_bw(cpu);
/* * Not much to check if no DEADLINE bandwidth is present. * dl_servers we can discount, as tasks will be moved out the * offlined CPUs anyway. */ - if (dl_b->total_bw - fair_server_bw > 0) { + if (dl_b->total_bw - dl_server_bw > 0) { /* * Leaving at least one CPU for DEADLINE tasks seems a * wise thing to do. As said above, cpu is not offline * yet, so account for that. */ if (dl_bw_cpus(cpu) - 1) - overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); + overflow = __dl_overflow(dl_b, cap, dl_server_bw, 0); else overflow = 1; } diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 711076aa49801..1ec8e74b80219 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -508,6 +508,11 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) if (rq->fair_server.dl_server) __dl_server_attach_root(&rq->fair_server, rq);
+#ifdef CONFIG_SCHED_CLASS_EXT + if (rq->ext_server.dl_server) + __dl_server_attach_root(&rq->ext_server, rq); +#endif + rq_unlock_irqrestore(rq, &rf);
if (old_rd)
 
            Enable or disable the appropriate DL servers (ext and fair) depending on whether an scx scheduler is started in full or partial mode:
- in full mode, disable the fair DL server and enable the ext DL server on all online CPUs, - in partial mode (%SCX_OPS_SWITCH_PARTIAL), keep both fair and ext DL servers active to support tasks in both scheduling classes.
Additionally, handle CPU hotplug events by selectively enabling or disabling the relevant DL servers on the CPU that is going offline/online. This ensures correct bandwidth reservation also when CPUs are brought online or offline.
v2: - start the dl_server if there's any scx task running in the rq (Christian Loehle)
Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- kernel/sched/ext.c | 99 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 89 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2a25749c54ba1..9e23ead618e16 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2600,6 +2600,59 @@ static void set_cpus_allowed_scx(struct task_struct *p, p, (struct cpumask *)p->cpus_ptr); }
+static void dl_server_on(struct rq *rq, bool switch_all) +{ + struct rq_flags rf; + int err; + + rq_lock_irqsave(rq, &rf); + update_rq_clock(rq); + + if (switch_all) { + /* + * If all fair tasks are moved to the scx scheduler, we + * don't need the fair DL server anymore, so remove it. + * + * When the current scx scheduler is unloaded, the fair DL + * server will be re-initialized. + */ + if (dl_server_active(&rq->fair_server)) + dl_server_stop(&rq->fair_server); + dl_server_remove_params(&rq->fair_server, rq, &rf); + } + + err = dl_server_init_params(&rq->ext_server); + WARN_ON_ONCE(err); + if (rq->scx.nr_running) + dl_server_start(&rq->ext_server); + + rq_unlock_irqrestore(rq, &rf); +} + +static void dl_server_off(struct rq *rq, bool switch_all) +{ + struct rq_flags rf; + int err; + + rq_lock_irqsave(rq, &rf); + update_rq_clock(rq); + + if (dl_server_active(&rq->ext_server)) + dl_server_stop(&rq->ext_server); + dl_server_remove_params(&rq->ext_server, rq, &rf); + + if (switch_all) { + /* + * Re-initialize the fair DL server if it was previously disabled + * because all fair tasks had been moved to the ext class. + */ + err = dl_server_init_params(&rq->fair_server); + WARN_ON_ONCE(err); + } + + rq_unlock_irqrestore(rq, &rf); +} + static void handle_hotplug(struct rq *rq, bool online) { struct scx_sched *sch = scx_root; @@ -2615,9 +2668,20 @@ static void handle_hotplug(struct rq *rq, bool online) if (unlikely(!sch)) return;
- if (scx_enabled()) + if (scx_enabled()) { + bool is_switching_all = READ_ONCE(scx_switching_all); + scx_idle_update_selcpu_topology(&sch->ops);
+ /* + * Update ext and fair DL servers on hotplug events. + */ + if (online) + dl_server_on(rq, is_switching_all); + else + dl_server_off(rq, is_switching_all); + } + if (online && SCX_HAS_OP(sch, cpu_online)) SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); else if (!online && SCX_HAS_OP(sch, cpu_offline)) @@ -3976,6 +4040,7 @@ static void scx_disable_workfn(struct kthread_work *work) struct scx_exit_info *ei = sch->exit_info; struct scx_task_iter sti; struct task_struct *p; + bool is_switching_all = READ_ONCE(scx_switching_all); int kind, cpu;
kind = atomic_read(&sch->exit_kind); @@ -4031,6 +4096,22 @@ static void scx_disable_workfn(struct kthread_work *work)
scx_init_task_enabled = false;
+ for_each_online_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + + /* + * Invalidate all the rq clocks to prevent getting outdated + * rq clocks from a previous scx scheduler. + */ + scx_rq_clock_invalidate(rq); + + /* + * We are unloading the sched_ext scheduler, we do not need its + * DL server bandwidth anymore, remove it for all CPUs. + */ + dl_server_off(rq, is_switching_all); + } + scx_task_iter_start(&sti); while ((p = scx_task_iter_next_locked(&sti))) { unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; @@ -4052,15 +4133,6 @@ static void scx_disable_workfn(struct kthread_work *work) scx_task_iter_stop(&sti); percpu_up_write(&scx_fork_rwsem);
- /* - * Invalidate all the rq clocks to prevent getting outdated - * rq clocks from a previous scx scheduler. - */ - for_each_possible_cpu(cpu) { - struct rq *rq = cpu_rq(cpu); - scx_rq_clock_invalidate(rq); - } - /* no task is on scx, turn off all the switches and flush in-progress calls */ static_branch_disable(&__scx_enabled); bitmap_zero(sch->has_op, SCX_OPI_END); @@ -4834,6 +4906,13 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) } } scx_task_iter_stop(&sti); + + /* + * Enable the ext DL server on all online CPUs. + */ + for_each_online_cpu(cpu) + dl_server_on(cpu_rq(cpu), !(ops->flags & SCX_OPS_SWITCH_PARTIAL)); + percpu_up_write(&scx_fork_rwsem);
scx_bypass(false);
 
            Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class.
v3: - add a comment to explain the 4% threshold (Emil Tsalapatis) v2: - replaced occurences of CFS in the test with EXT (Joel Fernandes)
Reviewed-by: Emil Tsalapatis emil@etsalapatis.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com --- tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ 3 files changed, 246 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 5fe45f9c5f8fd..c9255d1499b6e 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -183,6 +183,7 @@ auto-test-targets := \ select_cpu_dispatch_bad_dsq \ select_cpu_dispatch_dbl_dsp \ select_cpu_vtime \ + rt_stall \ test_example \
testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c new file mode 100644 index 0000000000000..80086779dd1eb --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. + * + * Copyright (c) 2025 NVIDIA Corporation. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops rt_stall_ops = { + .exit = (void *)rt_stall_exit, + .name = "rt_stall", +}; diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c new file mode 100644 index 0000000000000..d0ffa0e72b37b --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -0,0 +1,222 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sched.h> +#include <sys/prctl.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <time.h> +#include <linux/sched.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "rt_stall.bpf.skel.h" +#include "scx_test.h" +#include "../kselftest.h" + +#define CORE_ID 0 /* CPU to pin tasks to */ +#define RUN_TIME 5 /* How long to run the test in seconds */ + +/* Simple busy-wait function for test tasks */ +static void process_func(void) +{ + while (1) { + /* Busy wait */ + for (volatile unsigned long i = 0; i < 10000000UL; i++) + ; + } +} + +/* Set CPU affinity to a specific core */ +static void set_affinity(int cpu) +{ + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(cpu, &mask); + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } +} + +/* Set task scheduling policy and priority */ +static void set_sched(int policy, int priority) +{ + struct sched_param param; + + param.sched_priority = priority; + if (sched_setscheduler(0, policy, ¶m) != 0) { + perror("sched_setscheduler"); + exit(EXIT_FAILURE); + } +} + +/* Get process runtime from /proc/<pid>/stat */ +static float get_process_runtime(int pid) +{ + char path[256]; + FILE *file; + long utime, stime; + int fields; + + snprintf(path, sizeof(path), "/proc/%d/stat", pid); + file = fopen(path, "r"); + if (file == NULL) { + perror("Failed to open stat file"); + return -1; + } + + /* Skip the first 13 fields and read the 14th and 15th */ + fields = fscanf(file, + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", + &utime, &stime); + fclose(file); + + if (fields != 2) { + fprintf(stderr, "Failed to read stat file\n"); + return -1; + } + + /* Calculate the total time spent in the process */ + long total_time = utime + stime; + long ticks_per_second = sysconf(_SC_CLK_TCK); + float runtime_seconds = total_time * 1.0 / ticks_per_second; + + return runtime_seconds; +} + +static enum scx_test_status setup(void **ctx) +{ + struct rt_stall *skel; + + skel = rt_stall__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static bool sched_stress_test(void) +{ + /* + * We're expecting the EXT task to get around 5% of CPU time when + * competing with the RT task (small 1% fluctuations are expected). + * + * However, the EXT task should get at least 4% of the CPU to prove + * that the EXT deadline server is working correctly. A percentage + * less than 4% indicates a bug where RT tasks can potentially + * stall SCHED_EXT tasks, causing the test to fail. + */ + const float expected_min_ratio = 0.04; /* 4% */ + + float ext_runtime, rt_runtime, actual_ratio; + int ext_pid, rt_pid; + + ksft_print_header(); + ksft_set_plan(1); + + /* Create and set up a EXT task */ + ext_pid = fork(); + if (ext_pid == 0) { + set_affinity(CORE_ID); + process_func(); + exit(0); + } else if (ext_pid < 0) { + perror("fork for EXT task"); + ksft_exit_fail(); + } + + /* Create an RT task */ + rt_pid = fork(); + if (rt_pid == 0) { + set_affinity(CORE_ID); + set_sched(SCHED_FIFO, 50); + process_func(); + exit(0); + } else if (rt_pid < 0) { + perror("fork for RT task"); + ksft_exit_fail(); + } + + /* Let the processes run for the specified time */ + sleep(RUN_TIME); + + /* Get runtime for the EXT task */ + ext_runtime = get_process_runtime(ext_pid); + if (ext_runtime == -1) + ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", ext_pid); + ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n", + ext_pid, ext_runtime); + + /* Get runtime for the RT task */ + rt_runtime = get_process_runtime(rt_pid); + if (rt_runtime == -1) + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); + + /* Kill the processes */ + kill(ext_pid, SIGKILL); + kill(rt_pid, SIGKILL); + waitpid(ext_pid, NULL, 0); + waitpid(rt_pid, NULL, 0); + + /* Verify that the scx task got enough runtime */ + actual_ratio = ext_runtime / (ext_runtime + rt_runtime); + ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100); + + if (actual_ratio >= expected_min_ratio) { + ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n", + expected_min_ratio * 100); + return true; + } + ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n", + expected_min_ratio * 100); + return false; +} + +static enum scx_test_status run(void *ctx) +{ + struct rt_stall *skel = ctx; + struct bpf_link *link; + bool res; + + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + res = sched_stress_test(); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); + bpf_link__destroy(link); + + if (!res) + ksft_exit_fail(); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct rt_stall *skel = ctx; + + rt_stall__destroy(skel); +} + +struct scx_test rt_stall = { + .name = "rt_stall", + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&rt_stall)
 
            On 10/29/25 19:08, Andrea Righi wrote:
Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class.
v3: - add a comment to explain the 4% threshold (Emil Tsalapatis) v2: - replaced occurences of CFS in the test with EXT (Joel Fernandes)
Reviewed-by: Emil Tsalapatis emil@etsalapatis.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com
tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ 3 files changed, 246 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 5fe45f9c5f8fd..c9255d1499b6e 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -183,6 +183,7 @@ auto-test-targets := \ select_cpu_dispatch_bad_dsq \ select_cpu_dispatch_dbl_dsp \ select_cpu_vtime \
- rt_stall \ test_example \
testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c new file mode 100644 index 0000000000000..80086779dd1eb --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- A scheduler that verified if RT tasks can stall SCHED_EXT tasks.
- Copyright (c) 2025 NVIDIA Corporation.
- */
+#include <scx/common.bpf.h>
+char _license[] SEC("license") = "GPL";
+UEI_DEFINE(uei);
+void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) +{
- UEI_RECORD(uei, ei);
+}
+SEC(".struct_ops.link") +struct sched_ext_ops rt_stall_ops = {
- .exit = (void *)rt_stall_exit,
- .name = "rt_stall",
+}; diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c new file mode 100644 index 0000000000000..d0ffa0e72b37b --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -0,0 +1,222 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Copyright (c) 2025 NVIDIA Corporation.
- */
+#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sched.h> +#include <sys/prctl.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <time.h> +#include <linux/sched.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "rt_stall.bpf.skel.h" +#include "scx_test.h" +#include "../kselftest.h"
+#define CORE_ID 0 /* CPU to pin tasks to */ +#define RUN_TIME 5 /* How long to run the test in seconds */
+/* Simple busy-wait function for test tasks */ +static void process_func(void) +{
- while (1) {
/* Busy wait */
for (volatile unsigned long i = 0; i < 10000000UL; i++)
;- }
+}
+/* Set CPU affinity to a specific core */ +static void set_affinity(int cpu) +{
- cpu_set_t mask;
- CPU_ZERO(&mask);
- CPU_SET(cpu, &mask);
- if (sched_setaffinity(0, sizeof(mask), &mask) != 0) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);- }
+}
+/* Set task scheduling policy and priority */ +static void set_sched(int policy, int priority) +{
- struct sched_param param;
- param.sched_priority = priority;
- if (sched_setscheduler(0, policy, ¶m) != 0) {
perror("sched_setscheduler");
exit(EXIT_FAILURE);- }
+}
+/* Get process runtime from /proc/<pid>/stat */ +static float get_process_runtime(int pid) +{
- char path[256];
- FILE *file;
- long utime, stime;
- int fields;
- snprintf(path, sizeof(path), "/proc/%d/stat", pid);
- file = fopen(path, "r");
- if (file == NULL) {
perror("Failed to open stat file");
return -1;- }
- /* Skip the first 13 fields and read the 14th and 15th */
- fields = fscanf(file,
"%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
&utime, &stime);- fclose(file);
- if (fields != 2) {
fprintf(stderr, "Failed to read stat file\n");
return -1;- }
- /* Calculate the total time spent in the process */
- long total_time = utime + stime;
- long ticks_per_second = sysconf(_SC_CLK_TCK);
- float runtime_seconds = total_time * 1.0 / ticks_per_second;
- return runtime_seconds;
+}
+static enum scx_test_status setup(void **ctx) +{
- struct rt_stall *skel;
- skel = rt_stall__open();
- SCX_FAIL_IF(!skel, "Failed to open");
- SCX_ENUM_INIT(skel);
- SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel");
- *ctx = skel;
- return SCX_TEST_PASS;
+}
+static bool sched_stress_test(void) +{
- /*
* We're expecting the EXT task to get around 5% of CPU time when
* competing with the RT task (small 1% fluctuations are expected).
*
* However, the EXT task should get at least 4% of the CPU to prove
* that the EXT deadline server is working correctly. A percentage
* less than 4% indicates a bug where RT tasks can potentially
* stall SCHED_EXT tasks, causing the test to fail.
*/- const float expected_min_ratio = 0.04; /* 4% */
- float ext_runtime, rt_runtime, actual_ratio;
- int ext_pid, rt_pid;
- ksft_print_header();
- ksft_set_plan(1);
- /* Create and set up a EXT task */
- ext_pid = fork();
- if (ext_pid == 0) {
set_affinity(CORE_ID);
process_func();
exit(0);- } else if (ext_pid < 0) {
perror("fork for EXT task");
ksft_exit_fail();- }
- /* Create an RT task */
- rt_pid = fork();
- if (rt_pid == 0) {
set_affinity(CORE_ID);
set_sched(SCHED_FIFO, 50);
process_func();
exit(0);- } else if (rt_pid < 0) {
perror("fork for RT task");
ksft_exit_fail();- }
- /* Let the processes run for the specified time */
- sleep(RUN_TIME);
- /* Get runtime for the EXT task */
- ext_runtime = get_process_runtime(ext_pid);
- if (ext_runtime == -1)
ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", ext_pid);- ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n",
ext_pid, ext_runtime);- /* Get runtime for the RT task */
- rt_runtime = get_process_runtime(rt_pid);
- if (rt_runtime == -1)
ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid);- ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime);
- /* Kill the processes */
- kill(ext_pid, SIGKILL);
- kill(rt_pid, SIGKILL);
- waitpid(ext_pid, NULL, 0);
- waitpid(rt_pid, NULL, 0);
- /* Verify that the scx task got enough runtime */
- actual_ratio = ext_runtime / (ext_runtime + rt_runtime);
- ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100);
- if (actual_ratio >= expected_min_ratio) {
ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n",
expected_min_ratio * 100);
return true;- }
- ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n",
expected_min_ratio * 100);- return false;
+}
+static enum scx_test_status run(void *ctx) +{
- struct rt_stall *skel = ctx;
- struct bpf_link *link;
- bool res;
- link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
- SCX_FAIL_IF(!link, "Failed to attach scheduler");
- res = sched_stress_test();
- SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
- bpf_link__destroy(link);
- if (!res)
ksft_exit_fail();- return SCX_TEST_PASS;
+}
+static void cleanup(void *ctx) +{
- struct rt_stall *skel = ctx;
- rt_stall__destroy(skel);
+}
+struct scx_test rt_stall = {
- .name = "rt_stall",
- .description = "Verify that RT tasks cannot stall SCHED_EXT tasks",
- .setup = setup,
- .run = run,
- .cleanup = cleanup,
+}; +REGISTER_SCX_TEST(&rt_stall)
I'd still prefer something like the below to also test if the fair_server stop -> ext_server start -> fair_server start -> ext_server stop flow works correctly, but FWIW Tested-by: Christian Loehle christian.loehle@arm.com
------8<------ @@ -188,19 +188,24 @@ static bool sched_stress_test(void) static enum scx_test_status run(void *ctx) { struct rt_stall *skel = ctx; - struct bpf_link *link; + struct bpf_link *link = NULL; bool res;
- link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); - SCX_FAIL_IF(!link, "Failed to attach scheduler"); - - res = sched_stress_test(); - - SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); - bpf_link__destroy(link); - - if (!res) - ksft_exit_fail(); + for (int i = 0; i < 4; i++) { + if (i % 2) { + memset(&skel->data->uei, 0, sizeof(skel->data->uei)); + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + } + res = sched_stress_test(); + if (i % 2) { + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); + bpf_link__destroy(link); + } + + if (!res) + ksft_exit_fail(); + }
return SCX_TEST_PASS; }
 
            Hi Christian,
On Thu, Oct 30, 2025 at 04:49:48PM +0000, Christian Loehle wrote:
On 10/29/25 19:08, Andrea Righi wrote:
Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class.
v3: - add a comment to explain the 4% threshold (Emil Tsalapatis) v2: - replaced occurences of CFS in the test with EXT (Joel Fernandes)
Reviewed-by: Emil Tsalapatis emil@etsalapatis.com Co-developed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Andrea Righi arighi@nvidia.com
...
I'd still prefer something like the below to also test if the fair_server stop -> ext_server start -> fair_server start -> ext_server stop flow works correctly, but FWIW Tested-by: Christian Loehle christian.loehle@arm.com
Ack, I'll also run some tests on my side with this applied.
And yes, this definitely improves the selftest. I think we can also apply it as a follow-up patch later.
Thanks, -Andrea
------8<------ @@ -188,19 +188,24 @@ static bool sched_stress_test(void) static enum scx_test_status run(void *ctx) { struct rt_stall *skel = ctx;
struct bpf_link *link;
struct bpf_link *link = NULL; bool res;
link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
SCX_FAIL_IF(!link, "Failed to attach scheduler");
res = sched_stress_test();
SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
bpf_link__destroy(link);
if (!res)
ksft_exit_fail();
for (int i = 0; i < 4; i++) {
if (i % 2) {
memset(&skel->data->uei, 0, sizeof(skel->data->uei));
link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops);
SCX_FAIL_IF(!link, "Failed to attach scheduler");
}
res = sched_stress_test();
if (i % 2) {
SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
bpf_link__destroy(link);
}
if (!res)
ksft_exit_fail();
}return SCX_TEST_PASS; }
 
            From: Joel Fernandes joelagnelf@nvidia.com
Add a new kselftest to verify that the total_bw value in /sys/kernel/debug/sched/debug remains consistent across all CPUs under different sched_ext BPF program states:
1. Before a BPF scheduler is loaded 2. While a BPF scheduler is loaded and active 3. After a BPF scheduler is unloaded
The test runs CPU stress threads to ensure DL server bandwidth values stabilize before checking consistency. This helps catch potential issues with DL server bandwidth accounting during sched_ext transitions.
v2: - small coding style fixes (Andrea Righi)
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- tools/testing/selftests/sched_ext/Makefile | 1 + tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++ 2 files changed, 282 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index c9255d1499b6e..2c601a7eaff5f 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -185,6 +185,7 @@ auto-test-targets := \ select_cpu_vtime \ rt_stall \ test_example \ + total_bw \
testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c new file mode 100644 index 0000000000000..5b0a619bab86e --- /dev/null +++ b/tools/testing/selftests/sched_ext/total_bw.c @@ -0,0 +1,281 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test to verify that total_bw value remains consistent across all CPUs + * in different BPF program states. + * + * Copyright (C) 2025 NVIDIA Corporation. + */ +#include <bpf/bpf.h> +#include <errno.h> +#include <pthread.h> +#include <scx/common.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include "minimal.bpf.skel.h" +#include "scx_test.h" + +#define MAX_CPUS 512 +#define STRESS_DURATION_SEC 5 + +struct total_bw_ctx { + struct minimal *skel; + long baseline_bw[MAX_CPUS]; + int nr_cpus; +}; + +static void *cpu_stress_thread(void *arg) +{ + volatile int i; + time_t end_time = time(NULL) + STRESS_DURATION_SEC; + + while (time(NULL) < end_time) + for (i = 0; i < 1000000; i++) + ; + + return NULL; +} + +/* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ +static int run_cpu_stress(int nr_cpus) +{ + pthread_t *threads; + int i, ret = 0; + + threads = calloc(nr_cpus, sizeof(pthread_t)); + if (!threads) + return -ENOMEM; + + /* Create threads to run on each CPU */ + for (i = 0; i < nr_cpus; i++) { + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { + ret = -errno; + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); + break; + } + } + + /* Wait for all threads to complete */ + for (i = 0; i < nr_cpus; i++) { + if (threads[i]) + pthread_join(threads[i], NULL); + } + + free(threads); + return ret; +} + +static int read_total_bw_values(long *bw_values, int max_cpus) +{ + FILE *fp; + char line[256]; + int cpu_count = 0; + + fp = fopen("/sys/kernel/debug/sched/debug", "r"); + if (!fp) { + SCX_ERR("Failed to open debug file"); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + char *bw_str = strstr(line, "total_bw"); + + if (bw_str) { + bw_str = strchr(bw_str, ':'); + if (bw_str) { + /* Only store up to max_cpus values */ + if (cpu_count < max_cpus) + bw_values[cpu_count] = atol(bw_str + 1); + cpu_count++; + } + } + } + + fclose(fp); + return cpu_count; +} + +static bool verify_total_bw_consistency(long *bw_values, int count) +{ + int i; + long first_value; + + if (count <= 0) + return false; + + first_value = bw_values[0]; + + for (i = 1; i < count; i++) { + if (bw_values[i] != first_value) { + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", + first_value, i, bw_values[i]); + return false; + } + } + + return true; +} + +static int fetch_verify_total_bw(long *bw_values, int nr_cpus) +{ + int attempts = 0; + int max_attempts = 10; + int count; + + /* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ + if (run_cpu_stress(nr_cpus) < 0) { + SCX_ERR("Failed to run CPU stress"); + return -1; + } + + /* Try multiple times to get stable values */ + while (attempts < max_attempts) { + count = read_total_bw_values(bw_values, nr_cpus); + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); + /* If system has more CPUs than we're testing, that's OK */ + if (count < nr_cpus) { + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); + attempts++; + sleep(1); + continue; + } + + /* Only verify the CPUs we're testing */ + if (verify_total_bw_consistency(bw_values, nr_cpus)) { + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); + return 0; + } + + attempts++; + sleep(1); + } + + return -1; +} + +static enum scx_test_status setup(void **ctx) +{ + struct total_bw_ctx *test_ctx; + + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); + return SCX_TEST_SKIP; + } + + test_ctx = calloc(1, sizeof(*test_ctx)); + if (!test_ctx) + return SCX_TEST_FAIL; + + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); + if (test_ctx->nr_cpus <= 0) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ + if (test_ctx->nr_cpus > MAX_CPUS) + test_ctx->nr_cpus = MAX_CPUS; + + /* Test scenario 1: BPF program not loaded */ + /* Read and verify baseline total_bw before loading BPF program */ + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable baseline values"); + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* Load the BPF skeleton */ + test_ctx->skel = minimal__open(); + if (!test_ctx->skel) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + SCX_ENUM_INIT(test_ctx->skel); + if (minimal__load(test_ctx->skel)) { + minimal__destroy(test_ctx->skel); + free(test_ctx); + return SCX_TEST_FAIL; + } + + *ctx = test_ctx; + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + struct bpf_link *link; + long loaded_bw[MAX_CPUS]; + long unloaded_bw[MAX_CPUS]; + int i; + + /* Test scenario 2: BPF program loaded */ + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values with BPF loaded"); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + bpf_link__destroy(link); + + /* Test scenario 3: BPF program unloaded */ + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values after BPF unload"); + return SCX_TEST_FAIL; + } + + /* Verify all three scenarios have the same total_bw values */ + for (i = 0; i < test_ctx->nr_cpus; i++) { + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", + i, test_ctx->baseline_bw[i], loaded_bw[i]); + return SCX_TEST_FAIL; + } + + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", + i, test_ctx->baseline_bw[i], unloaded_bw[i]); + return SCX_TEST_FAIL; + } + } + + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + + if (test_ctx) { + if (test_ctx->skel) + minimal__destroy(test_ctx->skel); + free(test_ctx); + } +} + +struct scx_test total_bw = { + .name = "total_bw", + .description = "Verify total_bw consistency across BPF program states", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&total_bw)
 
            On 10/29/25 19:08, Andrea Righi wrote:
sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks.
Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality schedulers can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct.
== Highlights in this version ==
- wait for inactive_task_timer() to fire before removing the bandwidth reservation (Juri/Peter: please check if this new dl_server_remove_params() implementation makes sense to you)
- removed the explicit dl_server_stop() from dequeue_task_scx() and rely on the delayed stop behavior (Juri/Peter: ditto)
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v10:
- reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi)
- define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi)
- add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi)
- wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli)
- remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli)
- do not restart pick_task() when invoked by the dl_server (Tejun Heo)
- rename rq_dl_server to dl_server (Peter Zijlstra)
- fixed a missing dl_server start in dl_server_on() (Christian Loehle)
- add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis)
Changes in v9:
- Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly
- Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v8:
- Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf
- Simplify dl_se->dl_server check (suggested by PeterZ)
- Small coding style fixes in the kselftests
- Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7:
- Rebased to Linus master
- Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6:
- Added Acks to few patches
- Fixes to few nits suggested by Tejun
- Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5:
- Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs
- Address comment from Andrea about redundant rq clock invalidation
- Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4:
- Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX
- Fixed other issues
- Rebased on Linus master
- All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
- Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3:
- Removed code duplication in debugfs. Made ext interface separate
- Fixed issue where rq_lock_irqsave was not used in the relinquish patch
- Fixed running bw accounting issue in dl_server_remove_params
- Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2:
- Fixed a hang related to using rq_lock instead of rq_lock_irqsave
- Added support to remove BW of DL servers when they are switched to/from EXT
- Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (5): sched/deadline: Add support to initialize and remove dl_server bandwidth sched_ext: Add a DL server for sched_ext tasks sched/deadline: Account ext server bandwidth sched_ext: Selectively enable ext and fair DL servers selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (6): sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/deadline: Clear the defer params sched/deadline: Add a server arg to dl_server_update_idle_time() sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency
kernel/sched/core.c | 3 + kernel/sched/deadline.c | 169 +++++++++++--- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 144 +++++++++++- kernel/sched/fair.c | 2 +- kernel/sched/idle.c | 2 +- kernel/sched/sched.h | 8 +- kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 12 files changed, 955 insertions(+), 77 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
Thanks Andrea, I've tested a few things I had in mind with no complaints. Most importantly it a) it doesn't break the existing fair_server and b) Ensures BPF schedulers don't stall even with something like: sudo chrt -r 95 stress-ng --cpu 0 --taskset 0-$(($(nproc)-1)) -t 30m
For patches 0 to 9: Tested-by: Christian Loehle christian.loehle@arm.com
linux-kselftest-mirror@lists.linaro.org

