On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote:
Hi,
We are observing a performance regression on one of our arm64
boards.
We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
Rework schedutil governor performance estimation").
UI speedometer benchmark: w/commit: 395 +/-38 w/o commit: 439 +/-14
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
On 11/21/25 16:35, Christian Loehle wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote:
Hi,
We are observing a performance regression on one of our arm64
boards.
We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
Rework schedutil governor performance estimation").
UI speedometer benchmark: w/commit: 395 +/-38 w/o commit: 439 +/-14
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Sorry, should be 1.0% => 50.1%
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Can we get a sched_switch / sched_migrate / sched_wakeup trace for this? Perfetto would also do if that is better for you.
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
On Fri, 21 Nov 2025 at 17:35, Christian Loehle christian.loehle@arm.com wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote:
Hi,
We are observing a performance regression on one of our arm64
boards.
We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
Rework schedutil governor performance estimation").
UI speedometer benchmark: w/commit: 395 +/-38 w/o commit: 439 +/-14
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Yu-che, could you give us the capacity-dmips-mhz of each cpu (it's in the DT) ?
it could be that : the diff for big 21% the diff for mid (24% * mid capacity ratio) ~ 20% and probably for Little too (51% * little capacity ratio) ~ 20%
The patch fixes a problem that sometime the min clamping was wrongly added to the utilization
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
On Sat, Nov 22, 2025 at 1:58 AM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 17:35, Christian Loehle christian.loehle@arm.com wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote: > Hi, > > We are observing a performance regression on one of our arm64
boards.
> We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
> Rework schedutil governor performance estimation"). > > UI speedometer benchmark: > w/commit: 395 +/-38 > w/o commit: 439 +/-14 >
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Yu-che, could you give us the capacity-dmips-mhz of each cpu (it's in the DT) ?
it could be that : the diff for big 21% the diff for mid (24% * mid capacity ratio) ~ 20% and probably for Little too (51% * little capacity ratio) ~ 20%
The patch fixes a problem that sometime the min clamping was wrongly added to the utilization
Sure. The capacity-dmips-mhz for CPU0-3, CPU4-6, CPU7 are 714, 835, 1024 respectively. That is, the corresponding cpu_capacity are 472, 759, 1024 respectively. It looks like the frequency differences are indeed close to 20% after multiplying the cpu capacity ratio.
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
On Fri, 21 Nov 2025 at 17:43, Christian Loehle christian.loehle@arm.com wrote:
On 11/21/25 16:35, Christian Loehle wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote: > Hi, > > We are observing a performance regression on one of our arm64
boards.
> We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
> Rework schedutil governor performance estimation"). > > UI speedometer benchmark: > w/commit: 395 +/-38 > w/o commit: 439 +/-14 >
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Sorry, should be 1.0% => 50.1%
Having in mind that we have uclamp min at 20% ~204, this means that the tasks are not put in little cluster after the revert so the little goes back to low freq but 204 is less than half of little capacity
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Can we get a sched_switch / sched_migrate / sched_wakeup trace for this? Perfetto would also do if that is better for you.
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
On Mon, 24 Nov 2025 at 17:30, Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 17:43, Christian Loehle christian.loehle@arm.com wrote:
On 11/21/25 16:35, Christian Loehle wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote: > On 11/20/25 04:45, Sergey Senozhatsky wrote: >> Hi, >> >> We are observing a performance regression on one of our arm64
boards.
>> We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
>> Rework schedutil governor performance estimation"). >> >> UI speedometer benchmark: >> w/commit: 395 +/-38 >> w/o commit: 439 +/-14 >> > > Hi Sergey, > Would be nice to get some details. What board?
It's an MT8196 chromebook.
> What do the OPPs look like?
How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
> Does this system use uclamp during the benchmark? How?
How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Sorry, should be 1.0% => 50.1%
Having in mind that we have uclamp min at 20% ~204, this means that the tasks are not put in little cluster after the revert so the little goes back to low freq but 204 is less than half of little capacity
As Christian said, it would be good to have a trace with scheduler events. Having task and cpu util would be interesting too: perfetto should record all that for you
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Can we get a sched_switch / sched_migrate / sched_wakeup trace for this? Perfetto would also do if that is better for you.
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
Hi Christian and Vincent,
On Tue, Nov 25, 2025 at 12:41 AM Vincent Guittot vincent.guittot@linaro.org wrote:
On Mon, 24 Nov 2025 at 17:30, Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 17:43, Christian Loehle christian.loehle@arm.com wrote:
On 11/21/25 16:35, Christian Loehle wrote:
On 11/21/25 15:37, Yu-Che Cheng wrote:
Hi Vincent,
On Fri, Nov 21, 2025 at 10:00 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Fri, 21 Nov 2025 at 04:55, Sergey Senozhatsky senozhatsky@chromium.org wrote: > > Hi Christian, > > On (25/11/20 10:15), Christian Loehle wrote: >> On 11/20/25 04:45, Sergey Senozhatsky wrote: >>> Hi, >>> >>> We are observing a performance regression on one of our arm64
boards.
>>> We tracked it down to the linux-6.6.y commit ada8d7fa0ad4
("sched/cpufreq:
You mentioned that you tracked down to linux-6.6.y but which kernel are you using ?
We're using ChromeOS 6.6 kernel, which is currently on top of linux-v6.6.99. But we've tested that the performance regression still happens on exactly the same scheduler codes (`kernel/sched`) as upstream v6.6.99, compared to those on v6.6.88.
>>> Rework schedutil governor performance estimation"). >>> >>> UI speedometer benchmark: >>> w/commit: 395 +/-38 >>> w/o commit: 439 +/-14 >>> >> >> Hi Sergey, >> Would be nice to get some details. What board? > > It's an MT8196 chromebook. > >> What do the OPPs look like? > > How do I find that out?
In /sys/kernel/debug/opp/cpu*/ or /sys/devices/system/cpu/cpufreq/policy*/scaling_available_frequencies with related_cpus
The energy model on the device is:
CPU0-3: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 339000 | 34362 | | 400000 | 42099 | | 500000 | 52907 | | 600000 | 63795 | | 700000 | 74747 | | 800000 | 88445 | | 900000 | 101444 | | 1000000 | 120377 | | 1100000 | 136859 | | 1200000 | 154162 | | 1300000 | 174843 | | 1400000 | 196833 | | 1500000 | 217052 | | 1600000 | 247844 | | 1700000 | 281464 | | 1800000 | 321764 | | 1900000 | 352114 | | 2000000 | 383791 | | 2100000 | 421809 | | 2200000 | 461767 | | 2300000 | 503648 | | 2400000 | 540731 | +------------+------------+
CPU4-6: +------------+------------+ | freq (khz) | power (uw) | +============+============+ | 622000 | 131738 | | 700000 | 147102 | | 800000 | 172219 | | 900000 | 205455 | | 1000000 | 233632 | | 1100000 | 254313 | | 1200000 | 288843 | | 1300000 | 330863 | | 1400000 | 358947 | | 1500000 | 400589 | | 1600000 | 444247 | | 1700000 | 497941 | | 1800000 | 539959 | | 1900000 | 584011 | | 2000000 | 657172 | | 2100000 | 746489 | | 2200000 | 822854 | | 2300000 | 904913 | | 2400000 | 1006581 | | 2500000 | 1115458 | | 2600000 | 1205167 | | 2700000 | 1330751 | | 2800000 | 1450661 | | 2900000 | 1596740 | | 3000000 | 1736568 | | 3100000 | 1887001 | | 3200000 | 2048877 | | 3300000 | 2201141 | +------------+------------+
CPU7:
+------------+------------+ | freq (khz) | power (uw) | +============+============+ | 798000 | 320028 | | 900000 | 330714 | | 1000000 | 358108 | | 1100000 | 384730 | | 1200000 | 410669 | | 1300000 | 438355 | | 1400000 | 469865 | | 1500000 | 502740 | | 1600000 | 531645 | | 1700000 | 560380 | | 1800000 | 588902 | | 1900000 | 617278 | | 2000000 | 645584 | | 2100000 | 698653 | | 2200000 | 744179 | | 2300000 | 810471 | | 2400000 | 895816 | | 2500000 | 985234 | | 2600000 | 1097802 | | 2700000 | 1201162 | | 2800000 | 1332076 | | 2900000 | 1439847 | | 3000000 | 1575917 | | 3100000 | 1741987 | | 3200000 | 1877346 | | 3300000 | 2161512 | | 3400000 | 2437879 | | 3500000 | 2933742 | | 3600000 | 3322959 | | 3626000 | 3486345 | +------------+------------+
> >> Does this system use uclamp during the benchmark? How? > > How do I find that out?
it can be set per cgroup /sys/fs/cgroup/system.slice/<name>/cpu.uclam.min|max or per task with sched_setattr()
You most probably use it because it's the main reason for ada8d7fa0ad4 to remove wrong overestimate of OPP
For the speedometer case, yes, we set the uclamp.min to 20 for the whole browser and UI (chrome). There's no system-wide uclamp settings though.
(From Sergey's traces) Per-cluster time‑weighted average frequency base => revert: little (cpu0–3, max 2.4 GHz): 0.746 GHz => 1.132 GHz (+51.6%) mid (cpu4–6, max 3.3 GHz): 1.043 GHz => 1.303 GHz (+24.9%) big (cpu7, max 3.626 GHz): 2.563 GHz => 3.116 GHz (+21.6%)
And in particular time spent at OPPs (base => revert): Big core at upper 10%: 29.6% => 61.5% little cluster at 339 MHz: 50.1% => 1.0%
Sorry, should be 1.0% => 50.1%
Having in mind that we have uclamp min at 20% ~204, this means that the tasks are not put in little cluster after the revert so the little goes back to low freq but 204 is less than half of little capacity
As Christian said, it would be good to have a trace with scheduler events. Having task and cpu util would be interesting too: perfetto should record all that for you
Here are the Perfetto traces during the Speedometer 2.0 workload. Both of them are based on ChromeOS 6.6 kernel, while checking out the `kernel/sched` directory to upstream/v6.6.88 or v6.6.99.
v6.6.88 (433 score): https://ui.perfetto.dev/#%21/?s=44cd047c79a32fdba44583312ec5118f1e1162f2 v6.6.99 (408 score): https://ui.perfetto.dev/#%21/?s=529eef4a60ddc921907ed380d901e47ddf3d42c9
Also attached the time_in_state of the CPU7 frequencies during the workload, which looks highly correlated to the Speedometer performance since its main thread is running on CPU7 most of the time.
v6.6.88 (433 score): 3626000 567 3600000 54 3500000 54 3400000 88 3300000 77 3200000 61 3100000 80 3000000 61 2900000 75 2800000 59 2700000 51 2600000 58 2500000 54 2400000 57 2300000 49 2200000 42 2100000 37 2000000 397 1900000 0 1800000 0 1700000 0 1600000 0 1500000 0 1400000 0 1300000 0 1200000 0 1100000 0 1000000 0 900000 0 798000 0
v6.6.99 (408 score): 3626000 459 3600000 55 3500000 46 3400000 88 3300000 53 3200000 80 3100000 82 3000000 111 2900000 90 2800000 83 2700000 69 2600000 61 2500000 50 2400000 73 2300000 66 2200000 47 2100000 42 2000000 487 1900000 0 1800000 0 1700000 0 1600000 0 1500000 0 1400000 0 1300000 0 1200000 0 1100000 0 1000000 0 900000 0 798000 0
Interesting that a uclamp.min of 20 (which shouldn't really have much affect on big CPU at all, with or without headroom AFAICS?) makes such a big difference here?
Can we get a sched_switch / sched_migrate / sched_wakeup trace for this? Perfetto would also do if that is better for you.
But we also found other performance regressions in an Android guest VM, where there's no uclamp for the VM and vCPU processes from the host side. Particularly, the RAR extraction throughput reduces about 20% in the RAR app (from RARLAB). Although it's hard to tell if this is some sort of a side-effect of the UI regression as the UI is also running at the same time.
I'd be inclined to say that is because of the vastly different DVFS from the UI workload, yes.
Best regards, Yu-Che
linux-stable-mirror@lists.linaro.org