[PATCH AUTOSEL 6.17] sched_ext: defer queue_balance_callback() until after ops.dispatch

3 Nov 2025

From: Emil Tsalapatis etsal@meta.com
[ Upstream commit a8ad873113d3fe01f9b5d737d4b0570fa36826b0 ]
The sched_ext code calls queue_balance_callback() during enqueue_task()
to defer operations that drop multiple locks until we can unpin them.
The call assumes that the rq lock is held until the callbacks are
invoked, and the pending callbacks will not be visible to any other
threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().
However, balance_one() may actually drop the lock during a BPF dispatch
call. Another thread may win the race to get the rq lock and see the
pending callback. To avoid this, sched_ext must only queue the callback
after the dispatch calls have completed.
CPU 0                   CPU 1           CPU 2
scx_balance()
  rq_unpin_lock()
  scx_balance_one()
    |= IN_BALANCE	scx_enqueue()
    ops.dispatch()
      rq_unlock()
                        rq_lock()
                        queue_balance_callback()
                        rq_unlock()
                                        [WARN] rq_pin_lock()
      rq_lock()
    &= ~IN_BALANCE
rq_repin_lock()
Changelog
v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4)
- Fixed explanation in patch description (Andrea)
- Fixed scx_rq mask state updates (Andrea)
- Added Reviewed-by tag from Andrea
Reported-by: Jakub Kicinski kuba@kernel.org
Signed-off-by: Emil Tsalapatis (Meta) emil@etsalapatis.com
Reviewed-by: Andrea Righi arighi@nvidia.com
Signed-off-by: Tejun Heo tj@kernel.org
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
Based on my comprehensive analysis using semantic code analysis tools
and repository examination, here is my recommendation:
## **ANSWER: YES** (with critical caveat)
This commit **should be backported to stable kernel trees**, BUT the
follow-up fix commit a3c4a0a42e61a ("sched_ext: fix flag check for
deferred callbacks") **MUST be included together**, as the original
patch has a bug.
---
## **Extensive Analysis:**
### **1. Feature Availability Analysis**
Using git history analysis, I determined:
- **sched_ext was introduced in v6.12-rc1** (commit f0e1a0643a59b)
- **v6.11 and earlier kernels DO NOT have sched_ext** (verified with
  `git ls-tree`)
- **Only stable trees v6.12+ need this fix** (v6.12.x, v6.13.x, v6.16.x,
  v6.17.x all have kernel/sched/ext.c)
### **2. Semantic Code Analysis Using MCP Tools**
**Functions analyzed:**
- `mcp__semcode__find_function`: Located schedule_deferred(),
  balance_one(), balance_scx()
- `mcp__semcode__find_callers`: Traced call graph to understand impact
  scope
**Call chain discovered:**
```
Core scheduler → balance_scx (.balance callback)
  ↓
balance_one() [sets SCX_RQ_IN_BALANCE flag]
  ↓
ops.dispatch() [BPF scheduler callback - CAN DROP RQ LOCK]
  ↓
[RACE WINDOW - other CPUs can acquire lock]
  ↓
schedule_deferred() → queue_balance_callback()
  ↓
WARN_ON_ONCE() in rq_pin_lock() on CPU 2
```
**Impact scope:**
- schedule_deferred() called by: direct_dispatch()
- direct_dispatch() called by: do_enqueue_task()
- do_enqueue_task() called by: enqueue_task_scx, put_prev_task_scx,
  scx_bpf_reenqueue_local
- These are **core scheduler operations** triggered by normal task
  scheduling
- **User-space exposure**: Yes, any process using sched_ext can trigger
  this
### **3. Bug Severity Analysis**
**Race condition mechanism** (from commit message and code):
1. CPU 0: balance_one() sets IN_BALANCE flag, calls ops.dispatch()
2. ops.dispatch() **drops rq lock** during BPF execution
3. CPU 1: Acquires lock, calls schedule_deferred(), sees IN_BALANCE,
   queues callback
4. CPU 2: Calls rq_pin_lock(), sees pending callback → **WARN_ON_ONCE()
   triggers**
**Code reference** (kernel/sched/sched.h:1790-1797):
```c
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
    rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
    rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
    rf->clock_update_flags = 0;
    WARN_ON_ONCE(rq->balance_callback &&
                 rq->balance_callback != &balance_push_callback);  // ←
VIOLATION
}
```
**Severity**: Medium-High
- Not a crash, but scheduler correctness issue
- Generates kernel warnings in logs
- Indicates inconsistent scheduler state
- Reported by Jakub Kicinski (well-known kernel developer)
### **4. Code Changes Analysis**
**Changes are minimal and focused:**
- kernel/sched/ext.c: +29 lines, -2 lines
- kernel/sched/sched.h: +1 line (new flag SCX_RQ_BAL_CB_PENDING)
**Behavioral change:**
- BEFORE: queue_balance_callback() called immediately when
  SCX_RQ_IN_BALANCE set
- AFTER: Set SCX_RQ_BAL_CB_PENDING flag, defer actual queuing until
  after ops.dispatch()
- NEW: maybe_queue_balance_callback() called after balance_one()
  completes
**No architectural changes:** Just timing adjustment to avoid race
window
### **5. Critical Follow-up Fix Required**
**Commit a3c4a0a42e61a analysis:**
```diff
- if (rq->scx.flags & SCX_RQ_BAL_PENDING)
+       if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING)
```
The original patch checks the **wrong flag** in schedule_deferred().
This means:
- Without the follow-up, multiple deferred operations could be queued
  incorrectly
- **Both commits must be backported together**
### **6. Stable Tree Compliance**
✅ **Fixes important bug**: Race condition causing kernel warnings
✅ **No new features**: Pure bug fix
✅ **No architectural changes**: Small, contained fix
✅ **Minimal regression risk**: Changes only affect sched_ext code path
✅ **Subsystem confined**: Only affects sched_ext subsystem
### **7. Backport Recommendation**
**YES**, backport to all stable trees with sched_ext (v6.12+), with
these requirements:
1. **MUST include both commits:**
   - a8ad873113d3 ("sched_ext: defer queue_balance_callback()")
   - a3c4a0a42e61a ("sched_ext: fix flag check for deferred callbacks")
2. **Target stable trees:** 6.12.x, 6.13.x, 6.14.x, 6.15.x, 6.16.x,
   6.17.x
3. **Not needed for:** v6.11.x and earlier (no sched_ext code)
**Reasoning:** This is a correctness fix for a real race condition in
core scheduler code that can be triggered by normal scheduling
operations when using BPF extensible schedulers. The fix is small,
contained, and follows all stable kernel rules.
kernel/sched/ext.c   | 29 +++++++++++++++++++++++++++--
 kernel/sched/sched.h |  1 +
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 040ca7419b4f9..b796ce247fffd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -820,13 +820,23 @@ static void schedule_deferred(struct rq *rq)
    if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
    	return;
+	/* Don't do anything if there already is a deferred operation. */
+	if (rq->scx.flags & SCX_RQ_BAL_PENDING)
+		return;
+
    /*
     * If in balance, the balance callbacks will be called before rq lock is
     * released. Schedule one.
+	 *
+	 *
+	 * We can't directly insert the callback into the
+	 * rq's list: The call can drop its lock and make the pending balance
+	 * callback visible to unrelated code paths that call rq_pin_lock().
+	 *
+	 * Just let balance_one() know that it must do it itself.
     */
    if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
-		queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
-				       deferred_bal_cb_workfn);
+		rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
    	return;
    }
@@ -2043,6 +2053,19 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
    dspc->cursor = 0;
 }
+static inline void maybe_queue_balance_callback(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+
+	if (!(rq->scx.flags & SCX_RQ_BAL_CB_PENDING))
+		return;
+
+	queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+				deferred_bal_cb_workfn);
+
+	rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
+}
+
 static int balance_one(struct rq *rq, struct task_struct *prev)
 {
    struct scx_sched *sch = scx_root;
@@ -2190,6 +2213,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
 #endif
    rq_repin_lock(rq, rf);
+	maybe_queue_balance_callback(rq);
+
    return ret;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72fb9129afb6a..c7f67f54d4e3e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -782,6 +782,7 @@ enum scx_rq_flags {
    SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
    SCX_RQ_BYPASSING	= 1 << 4,
    SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
+	SCX_RQ_BAL_CB_PENDING	= 1 << 6, /* must queue a cb after dispatching */
SCX_RQ_IN_WAKEUP	= 1 << 16,
    SCX_RQ_IN_BALANCE	= 1 << 17,
-- 
2.51.0



    

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] sched_ext: defer queue_balance_callback() until after ops.dispatch