[PATCH AUTOSEL 6.12 1/7] sched: Don't try to catch up excess steal time.

List overview All Threads
Download

newer

older

[PATCH AUTOSEL 6.6 1/3] sched:...

[PATCH AUTOSEL 6.13 1/7] sched:...

Sasha Levin

26 Jan 2025 26 Jan '25

2:50 p.m.

From: Suleiman Souhlal suleiman@google.com

[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]

When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, in some situations, as they end up getting additional steal time that did not actually happen. This is because there is a window between reading the elapsed time in update_rq_clock() and sampling the steal time in update_rq_clock_task(). If the VCPU gets preempted between those two points, any additional steal time is accounted to the outgoing task even though the calculated delta did not actually contain any of that "stolen" time. When this race happens, we can end up with steal time that exceeds the calculated delta, and the previous code would try to catch up that excess steal time in future clock updates, which is given to the next, incoming task, even though it did not actually have any time stolen.

This behavior is particularly bad when steal time can be very long, which we've seen when trying to extend steal time to contain the duration that the host was suspended [0]. When this happens, clock_task stays frozen, during which the running task stays running for the whole duration, since its run time doesn't increase. However the race can happen even under normal operation.

Ideally we would read the elapsed cpu time and the steal time atomically, to prevent this race from happening in the first place, but doing so is non-trivial.

Since the time between those two points isn't otherwise accounted anywhere, neither to the outgoing task nor the incoming task (because the "end of outgoing task" and "start of incoming task" timestamps are the same), I would argue that the right thing to do is to simply drop any excess steal time, in order to prevent these issues.

[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/

Signed-off-by: Suleiman Souhlal suleiman@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/sched/core.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d07dc87787dff..9015c792830c2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((&paravirt_steal_rq_enabled))) { - steal = paravirt_steal_clock(cpu_of(rq)); + u64 prev_steal; + + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); steal -= rq->prev_steal_time_rq;

if (unlikely(steal > delta)) steal = delta;

- rq->prev_steal_time_rq += steal; + rq->prev_steal_time_rq = prev_steal; delta -= steal; } #endif

-- 2.39.5

Show replies by date

Sasha Levin

26 Jan 26 Jan

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 2/7] sched/deadline: Correctly account for allocated bandwidth during hotplug

From: Juri Lelli juri.lelli@redhat.com

[ Upstream commit d4742f6ed7ea6df56e381f82ba4532245fa1e561 ]

For hotplug operations, DEADLINE needs to check that there is still enough bandwidth left after removing the CPU that is going offline. We however fail to do so currently.

Restore the correct behavior by restructuring dl_bw_manage() a bit, so that overflow conditions (not enough bandwidth left) are properly checked. Also account for dl_server bandwidth, i.e. discount such bandwidth in the calculation since NORMAL tasks will be anyway moved away from the CPU as a result of the hotplug operation.

Signed-off-by: Juri Lelli juri.lelli@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Phil Auld pauld@redhat.com Tested-by: Waiman Long longman@redhat.com Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/sched/core.c | 2 +- kernel/sched/deadline.c | 48 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 2 +- 3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9015c792830c2..7bac05aa3d6c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8081,7 +8081,7 @@ static void cpuset_cpu_active(void) static int cpuset_cpu_inactive(unsigned int cpu) { if (!cpuhp_tasks_frozen) { - int ret = dl_bw_check_overflow(cpu); + int ret = dl_bw_deactivate(cpu);

if (ret) return ret; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a17c23b53049c..36fd501ad9958 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -3464,29 +3464,31 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, }

enum dl_bw_request { - dl_bw_req_check_overflow = 0, + dl_bw_req_deactivate = 0, dl_bw_req_alloc, dl_bw_req_free };

static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) { - unsigned long flags; + unsigned long flags, cap; struct dl_bw *dl_b; bool overflow = 0; + u64 fair_server_bw = 0;

rcu_read_lock_sched(); dl_b = dl_bw_of(cpu); raw_spin_lock_irqsave(&dl_b->lock, flags);

- if (req == dl_bw_req_free) { + cap = dl_bw_capacity(cpu); + switch (req) { + case dl_bw_req_free: __dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu)); - } else { - unsigned long cap = dl_bw_capacity(cpu); - + break; + case dl_bw_req_alloc: overflow = __dl_overflow(dl_b, cap, 0, dl_bw);

- if (req == dl_bw_req_alloc && !overflow) { + if (!overflow) { /* * We reserve space in the destination * root_domain, as we can't fail after this point. @@ -3495,6 +3497,34 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) */ __dl_add(dl_b, dl_bw, dl_bw_cpus(cpu)); } + break; + case dl_bw_req_deactivate: + /* + * cpu is going offline and NORMAL tasks will be moved away + * from it. We can thus discount dl_server bandwidth + * contribution as it won't need to be servicing tasks after + * the cpu is off. + */ + if (cpu_rq(cpu)->fair_server.dl_server) + fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; + + /* + * Not much to check if no DEADLINE bandwidth is present. + * dl_servers we can discount, as tasks will be moved out the + * offlined CPUs anyway. + */ + if (dl_b->total_bw - fair_server_bw > 0) { + /* + * Leaving at least one CPU for DEADLINE tasks seems a + * wise thing to do. + */ + if (dl_bw_cpus(cpu)) + overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); + else + overflow = 1; + } + + break; }

raw_spin_unlock_irqrestore(&dl_b->lock, flags); @@ -3503,9 +3533,9 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) return overflow ? -EBUSY : 0; }

-int dl_bw_check_overflow(int cpu) +int dl_bw_deactivate(int cpu) { - return dl_bw_manage(dl_bw_req_check_overflow, cpu, 0); + return dl_bw_manage(dl_bw_req_deactivate, cpu, 0); }

int dl_bw_alloc(int cpu, u64 dl_bw) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f2ef520513c4a..0a8ed2948e5e0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -362,7 +362,7 @@ extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr); extern bool __checkparam_dl(const struct sched_attr *attr); extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); -extern int dl_bw_check_overflow(int cpu); +extern int dl_bw_deactivate(int cpu); extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec); /* * SCHED_DEADLINE supports servers (nested scheduling) with the following

-- 2.39.5

Sasha Levin

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 3/7] sched/deadline: Check bandwidth overflow earlier for hotplug

From: Juri Lelli juri.lelli@redhat.com

[ Upstream commit 53916d5fd3c0b658de3463439dd2b7ce765072cb ]

Currently we check for bandwidth overflow potentially due to hotplug operations at the end of sched_cpu_deactivate(), after the cpu going offline has already been removed from scheduling, active_mask, etc. This can create issues for DEADLINE tasks, as there is a substantial race window between the start of sched_cpu_deactivate() and the moment we possibly decide to roll-back the operation if dl_bw_deactivate() returns failure in cpuset_cpu_inactive(). An example is a throttled task that sees its replenishment timer firing while the cpu it was previously running on is considered offline, but before dl_bw_deactivate() had a chance to say no and roll-back happened.

Fix this by directly calling dl_bw_deactivate() first thing in sched_cpu_deactivate() and do the required calculation in the former function considering the cpu passed as an argument as offline already.

By doing so we also simplify sched_cpu_deactivate(), as there is no need anymore for any kind of roll-back if we fail early.

Signed-off-by: Juri Lelli juri.lelli@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Phil Auld pauld@redhat.com Tested-by: Waiman Long longman@redhat.com Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/sched/core.c | 22 +++++++--------------- kernel/sched/deadline.c | 12 ++++++++++-- 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7bac05aa3d6c5..01e06dd3184a7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8078,19 +8078,14 @@ static void cpuset_cpu_active(void) cpuset_update_active_cpus(); }

-static int cpuset_cpu_inactive(unsigned int cpu) +static void cpuset_cpu_inactive(unsigned int cpu) { if (!cpuhp_tasks_frozen) { - int ret = dl_bw_deactivate(cpu); - - if (ret) - return ret; cpuset_update_active_cpus(); } else { num_cpus_frozen++; partition_sched_domains(1, NULL, NULL); } - return 0; }

static inline void sched_smt_present_inc(int cpu) @@ -8152,6 +8147,11 @@ int sched_cpu_deactivate(unsigned int cpu) struct rq *rq = cpu_rq(cpu); int ret;

+ ret = dl_bw_deactivate(cpu); + + if (ret) + return ret; + /* * Remove CPU from nohz.idle_cpus_mask to prevent participating in * load balancing when not active @@ -8197,15 +8197,7 @@ int sched_cpu_deactivate(unsigned int cpu) return 0;

sched_update_numa(cpu, false); - ret = cpuset_cpu_inactive(cpu); - if (ret) { - sched_smt_present_inc(cpu); - sched_set_rq_online(rq, cpu); - balance_push_set(cpu, false); - set_cpu_active(cpu, true); - sched_update_numa(cpu, true); - return ret; - } + cpuset_cpu_inactive(cpu); sched_domains_numa_masks_clear(cpu); return 0; } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 36fd501ad9958..ec2b66a7aae0e 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -3499,6 +3499,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) } break; case dl_bw_req_deactivate: + /* + * cpu is not off yet, but we need to do the math by + * considering it off already (i.e., what would happen if we + * turn cpu off?). + */ + cap -= arch_scale_cpu_capacity(cpu); + /* * cpu is going offline and NORMAL tasks will be moved away * from it. We can thus discount dl_server bandwidth @@ -3516,9 +3523,10 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) if (dl_b->total_bw - fair_server_bw > 0) { /* * Leaving at least one CPU for DEADLINE tasks seems a - * wise thing to do. + * wise thing to do. As said above, cpu is not offline + * yet, so account for that. */ - if (dl_bw_cpus(cpu)) + if (dl_bw_cpus(cpu) - 1) overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); else overflow = 1;

-- 2.39.5

Sasha Levin

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 4/7] x86: Convert unreachable() to BUG()

From: Peter Zijlstra peterz@infradead.org

[ Upstream commit 2190966fbc14ca2cd4ea76eefeb96a47d8e390df ]

Avoid unreachable() as it can (and will in the absence of UBSAN) generate fallthrough code. Use BUG() so we get a UD2 trap (with unreachable annotation).

Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Josh Poimboeuf jpoimboe@kernel.org Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/reboot.c | 2 +- arch/x86/kvm/svm/sev.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index f63f8fd00a91f..15507e739c255 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -838,7 +838,7 @@ void __noreturn stop_this_cpu(void *dummy) #ifdef CONFIG_SMP if (smp_ops.stop_this_cpu) { smp_ops.stop_this_cpu(); - unreachable(); + BUG(); } #endif

diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 615922838c510..dc1dd3f3e67fc 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -883,7 +883,7 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)

if (smp_ops.stop_this_cpu) { smp_ops.stop_this_cpu(); - unreachable(); + BUG(); }

/* Assume hlt works */ diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index fb854cf20ac3b..e9af87b128140 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3833,7 +3833,7 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc) goto next_range; }

- unreachable(); + BUG(); }

static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)

-- 2.39.5

Sasha Levin

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 5/7] locking/ww_mutex/test: Use swap() macro

From: Thorsten Blum thorsten.blum@toblux.com

[ Upstream commit 0d3547df6934b8f9600630322799a2a76b4567d8 ]

Fixes the following Coccinelle/coccicheck warning reported by swap.cocci:

WARNING opportunity for swap()

Compile-tested only.

[Boqun: Add the report tags from Jiapeng and Abaci Robot [1].]

Reported-by: Abaci Robot abaci@linux.alibaba.com Reported-by: Jiapeng Chong jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=11531 Link: https://lore.kernel.org/r/20241025081455.55089-1-jiapeng.chong@linux.alibaba... [1] Acked-by: Waiman Long longman@redhat.com Signed-off-by: Thorsten Blum thorsten.blum@toblux.com Signed-off-by: Boqun Feng boqun.feng@gmail.com Link: https://lore.kernel.org/r/20240731135850.81018-2-thorsten.blum@toblux.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/locking/test-ww_mutex.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c index 10a5736a21c22..b5c2a2de45788 100644 --- a/kernel/locking/test-ww_mutex.c +++ b/kernel/locking/test-ww_mutex.c @@ -402,7 +402,7 @@ static inline u32 prandom_u32_below(u32 ceil) static int *get_random_order(int count) { int *order; - int n, r, tmp; + int n, r;

order = kmalloc_array(count, sizeof(*order), GFP_KERNEL); if (!order) @@ -413,11 +413,8 @@ static int *get_random_order(int count)

for (n = count - 1; n > 1; n--) { r = prandom_u32_below(n + 1); - if (r != n) { - tmp = order[n]; - order[n] = order[r]; - order[r] = tmp; - } + if (r != n) + swap(order[n], order[r]); }

return order;

-- 2.39.5

Sasha Levin

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 6/7] lockdep: Fix upper limit for LOCKDEP_*_BITS configs

From: Carlos Llamas cmllamas@google.com

[ Upstream commit e638072e61726cae363d48812815197a2a0e097f ]

Lockdep has a set of configs used to determine the size of the static arrays that it uses. However, the upper limit that was initially setup for these configs is too high (30 bit shift). This equates to several GiB of static memory for individual symbols. Using such high values leads to linker errors:

$ make defconfig $ ./scripts/config -e PROVE_LOCKING --set-val LOCKDEP_BITS 30 $ make olddefconfig all [...] ld: kernel image bigger than KERNEL_IMAGE_SIZE ld: section .bss VMA wraps around address space

Adjust the upper limits to the maximum values that avoid these issues. The need for anything more, likely points to a problem elsewhere. Note that LOCKDEP_CHAINS_BITS was intentionally left out as its upper limit had a different symptom and has already been fixed [1].

Reported-by: J. R. Okajima hooanon05g@gmail.com Closes: https://lore.kernel.org/all/30795.1620913191@jrobl/ [1] Cc: Peter Zijlstra peterz@infradead.org Cc: Boqun Feng boqun.feng@gmail.com Cc: Ingo Molnar mingo@redhat.com Cc: Waiman Long longman@redhat.com Cc: Will Deacon will@kernel.org Acked-by: Waiman Long longman@redhat.com Signed-off-by: Carlos Llamas cmllamas@google.com Signed-off-by: Boqun Feng boqun.feng@gmail.com Link: https://lore.kernel.org/r/20241024183631.643450-2-cmllamas@google.com Signed-off-by: Sasha Levin sashal@kernel.org --- lib/Kconfig.debug | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 3f9c238bb58ea..e48375fe5a50c 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1511,7 +1511,7 @@ config LOCKDEP_SMALL config LOCKDEP_BITS int "Bitsize for MAX_LOCKDEP_ENTRIES" depends on LOCKDEP && !LOCKDEP_SMALL - range 10 30 + range 10 24 default 15 help Try increasing this value if you hit "BUG: MAX_LOCKDEP_ENTRIES too low!" message. @@ -1527,7 +1527,7 @@ config LOCKDEP_CHAINS_BITS config LOCKDEP_STACK_TRACE_BITS int "Bitsize for MAX_STACK_TRACE_ENTRIES" depends on LOCKDEP && !LOCKDEP_SMALL - range 10 30 + range 10 26 default 19 help Try increasing this value if you hit "BUG: MAX_STACK_TRACE_ENTRIES too low!" message. @@ -1535,7 +1535,7 @@ config LOCKDEP_STACK_TRACE_BITS config LOCKDEP_STACK_TRACE_HASH_BITS int "Bitsize for STACK_TRACE_HASH_SIZE" depends on LOCKDEP && !LOCKDEP_SMALL - range 10 30 + range 10 26 default 14 help Try increasing this value if you need large STACK_TRACE_HASH_SIZE. @@ -1543,7 +1543,7 @@ config LOCKDEP_STACK_TRACE_HASH_BITS config LOCKDEP_CIRCULAR_QUEUE_BITS int "Bitsize for elements in circular_queue struct" depends on LOCKDEP - range 10 30 + range 10 26 default 12 help Try increasing this value if you hit "lockdep bfs error:-1" warning due to __cq_enqueue() failure.

-- 2.39.5

Sasha Levin

2:50 p.m.

New subject: [PATCH AUTOSEL 6.12 7/7] x86/amd_nb: Restrict init function to AMD-based systems

From: Yazen Ghannam yazen.ghannam@amd.com

[ Upstream commit bee9e840609cc67d0a7d82f22a2130fb7a0a766d ]

The code implicitly operates on AMD-based systems by matching on PCI IDs. However, the use of these IDs is going away.

Add an explicit CPU vendor check instead of relying on PCI IDs.

Signed-off-by: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Link: https://lore.kernel.org/r/20241206161210.163701-3-yazen.ghannam@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kernel/amd_nb.c | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c index 9fe9972d2071b..37b8244899d89 100644 --- a/arch/x86/kernel/amd_nb.c +++ b/arch/x86/kernel/amd_nb.c @@ -582,6 +582,10 @@ static __init void fix_erratum_688(void)

static __init int init_amd_nbs(void) { + if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD && + boot_cpu_data.x86_vendor != X86_VENDOR_HYGON) + return 0; + amd_cache_northbridges(); amd_cache_gart();

-- 2.39.5

159

days inactive

159

days old

linux-stable-mirror@lists.linaro.org

6 comments

participants

tags (0)

participants (1)

Sasha Levin