sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks.
Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality they can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct.
== Design ==
- The EXT server is initialized at boot time and remains configured throughout the system's lifetime - It starts automatically when the first sched_ext task is enqueued (rq->scx.nr_running == 1) - The server's pick function (ext_server_pick_task) always selects sched_ext tasks when active - Runtime accounting happens in update_curr_scx() during task execution and update_curr_idle() when idle - Bandwidth accounting includes both fair and ext servers in root domain calculations - A debugfs interface (/sys/kernel/debug/sched/ext_server/) allows runtime tuning of server parameters
== Highlights in this version ==
As discussed at the sched_ext microconference at LPC Tokyo, the plan is to start with a simpler approach, avoiding automatically creating or tearing down the EXT server bandwidth reservation when a BPF scheduler is loaded or unloaded. Instead, the reservation is kept permanently active. This significantly simplifies the logic while still addressing the starvation issue.
Any fine-tuning of the bandwidth reservation is delegated to the system administrator, who can adjust it via the debugfs interface. In the future, a more suitable interface can be introduced and automatic removal of the reservation when the BPF scheduler is unloaded can be revisited.
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v11: - do not create/remove the bandwidth reservation for the ext server when a BPF scheduler is loaded/unloaded, but keep the reservation bandwdith always active - change rt_stall kselftest to validate both FAIR and EXT DL servers - Link to v10: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v10: - reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi) - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when invoked by the dl_server (Tejun Heo) - rename rq_dl_server to dl_server (Peter Zijlstra) - fixed a missing dl_server start in dl_server_on() (Christian Loehle) - add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis) - Link to v9: https://lore.kernel.org/all/20251017093214.70029-1-arighi@nvidia.com/
Changes in v9: - Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v8: - Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf - Simplify dl_se->dl_server check (suggested by PeterZ) - Small coding style fixes in the kselftests - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7: - Rebased to Linus master - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6: - Added Acks to few patches - Fixes to few nits suggested by Tejun - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5: - Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs - Address comment from Andrea about redundant rq clock invalidation - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4: - Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX - Fixed other issues - Rebased on Linus master - All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3: - Removed code duplication in debugfs. Made ext interface separate - Fixed issue where rq_lock_irqsave was not used in the relinquish patch - Fixed running bw accounting issue in dl_server_remove_params - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2: - Fixed a hang related to using rq_lock instead of rq_lock_irqsave - Added support to remove BW of DL servers when they are switched to/from EXT - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (2): sched_ext: Add a DL server for sched_ext tasks selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (5): sched/deadline: Clear the defer params sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency
kernel/sched/core.c | 6 + kernel/sched/deadline.c | 87 +++++-- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 42 ++++ kernel/sched/idle.c | 3 + kernel/sched/sched.h | 2 + kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 240 +++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 11 files changed, 811 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c