Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Fixes: 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event") Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1698 Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: Lyude Paul lyude@redhat.com Cc: Francisco Jerez currojerez@riseup.net Cc: stable@vger.kernel.org # v5.5+ --- drivers/gpu/drm/i915/gt/intel_rps.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c b/drivers/gpu/drm/i915/gt/intel_rps.c index 367132092bed..47ddb25edc97 100644 --- a/drivers/gpu/drm/i915/gt/intel_rps.c +++ b/drivers/gpu/drm/i915/gt/intel_rps.c @@ -542,37 +542,38 @@ static void rps_set_power(struct intel_rps *rps, int new_power) /* Note the units here are not exactly 1us, but 1280ns. */ switch (new_power) { case LOW_POWER: - /* Upclock if more than 95% busy over 16ms */ - ei_up = 16000; + /* Upclock if more than 95% busy over 160us */ + ei_up = 160; threshold_up = 95;
- /* Downclock if less than 85% busy over 32ms */ - ei_down = 32000; + /* Downclock if less than 85% busy over 1600us */ + ei_down = 1600; threshold_down = 85; break;
case BETWEEN: - /* Upclock if more than 90% busy over 13ms */ - ei_up = 13000; + /* Upclock if more than 90% busy over 160us */ + ei_up = 160; threshold_up = 90;
- /* Downclock if less than 75% busy over 32ms */ - ei_down = 32000; + /* Downclock if less than 75% busy over 1600us */ + ei_down = 1600; threshold_down = 75; break;
case HIGH_POWER: - /* Upclock if more than 85% busy over 10ms */ - ei_up = 10000; + /* Upclock if more than 85% busy over 160us */ + ei_up = 160; threshold_up = 85;
- /* Downclock if less than 60% busy over 32ms */ - ei_down = 32000; + /* Downclock if less than 60% busy over 1600us */ + ei_down = 1600; threshold_down = 60; break; }
- /* When byt can survive without system hang with dynamic + /* + * When byt can survive without system hang with dynamic * sw freq adjustments, this restriction can be lifted. */ if (IS_VALLEYVIEW(i915))
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
We were explicitly told to mask the interrupt generation at source to conserve power. So I would hope for a statement as to whether that is still a requirement from the HW architects; but I can't see why we would not apply the mask and that this is just paper. If the observation about forcewake tallies, would this not suggest that it is still conserving power on icl?
I haven't looked at whether I see the same phenomenon as you [missing interrupts on icl] locally, but I was expecting something like the bug report since the observation that render times are less than EI was causing the clocks to stay high. And I noticed your problem statement and was hopeful for a link.
They sound like two different problems. (Underclocking for animations is not icl specific.)
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Probably; as you can guess these were pulled out of nowhere based on the observation that the frame lengths are much shorter than the current EI and that in order for us to ramp up to maxclocks in a single frame of animation would take about 4 samples per frame. Based on the reporter's observations, we do have to ramp up very quickly for single frame of rendering in order to hit the vblank, as we are ramping down afterwards.
With a target of 4 samples within say 1ms, 160us isn't too far of the mark. (We have to allow some extra time to catch up rendering.)
As for steady state workloads, I'm optimistic the smoothing helps. (It's harder to find steady state, unthrottled workloads!) -Chris
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
We were explicitly told to mask the interrupt generation at source to conserve power. So I would hope for a statement as to whether that is still a requirement from the HW architects; but I can't see why we would not apply the mask and that this is just paper. If the observation about forcewake tallies, would this not suggest that it is still conserving power on icl?
Yeah, it's hard to see how disabling interrupt generation could save any additional power in a unit which is powered off -- At least on ICL where even the interrupt masking register is powered off...
I haven't looked at whether I see the same phenomenon as you [missing interrupts on icl] locally, but I was expecting something like the bug report since the observation that render times are less than EI was causing the clocks to stay high. And I noticed your problem statement and was hopeful for a link.
Right. In the workloads I was looking at last week the GPU would often be active for periods of time several times greater than the EI, and we would still fail to clock up.
They sound like two different problems. (Underclocking for animations is not icl specific.)
Sure. But it seems like the underclocking problem has been greatly exacerbated by 21abf0bf168d, which may have been mitigating the same ICL problem I was looking at leading to RPS interrupt loss. Maybe 21abf0bf168d wouldn't be necessary with working RPS interrupt delivery? And without 21abf0bf168d platforms earlier than ICL wouldn't have as much of an underclocking problem either.
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Probably; as you can guess these were pulled out of nowhere based on the observation that the frame lengths are much shorter than the current EI and that in order for us to ramp up to maxclocks in a single frame of animation would take about 4 samples per frame. Based on the reporter's observations, we do have to ramp up very quickly for single frame of rendering in order to hit the vblank, as we are ramping down afterwards.
With a target of 4 samples within say 1ms, 160us isn't too far of the mark. (We have to allow some extra time to catch up rendering.)
How about we stop ramping down after the rendering of a single frame? It's not like we save any power by doing that, since the GPU seems to be forced to the minimum frequency for as long as it remains parked anyway. If the GPU remains idle long enough for the RPS utilization counters to drop below the threshold and qualify for a ramp-down the RPS should send us an interrupt, at which point we will ramp down the frequency.
Unconditionally ramping down on parking seems to disturb the accuracy of that RPS feedback loop, which then needs to be worked around by reducing the averaging window of the RPS to a tiny fraction of the oscillation period of any typical GPU workload, which is going to prevent the RPS from seeing a whole oscillation period before it reacts, which is almost guaranteed to have a serious energy-efficiency cost.
As for steady state workloads, I'm optimistic the smoothing helps. (It's harder to find steady state, unthrottled workloads!)
I'm curious, how is that smoothing you do in PATCH 1 better than simply setting 2x the EIs? (Which would also mean half the interrupt processing overhead as in this series)
-Chris
Quoting Francisco Jerez (2020-04-14 22:00:25)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
We were explicitly told to mask the interrupt generation at source to conserve power. So I would hope for a statement as to whether that is still a requirement from the HW architects; but I can't see why we would not apply the mask and that this is just paper. If the observation about forcewake tallies, would this not suggest that it is still conserving power on icl?
Yeah, it's hard to see how disabling interrupt generation could save any additional power in a unit which is powered off -- At least on ICL where even the interrupt masking register is powered off...
I haven't looked at whether I see the same phenomenon as you [missing interrupts on icl] locally, but I was expecting something like the bug report since the observation that render times are less than EI was causing the clocks to stay high. And I noticed your problem statement and was hopeful for a link.
Right. In the workloads I was looking at last week the GPU would often be active for periods of time several times greater than the EI, and we would still fail to clock up.
They sound like two different problems. (Underclocking for animations is not icl specific.)
Sure. But it seems like the underclocking problem has been greatly exacerbated by 21abf0bf168d, which may have been mitigating the same ICL problem I was looking at leading to RPS interrupt loss. Maybe 21abf0bf168d wouldn't be necessary with working RPS interrupt delivery? And without 21abf0bf168d platforms earlier than ICL wouldn't have as much of an underclocking problem either.
21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")
is necessary due to that we can set a boost frequency and then never run the RPS worker due to short activity cycles. See igt/i915_pm_rc6_residency for rc6-idle, the bg_load is essentially just
for (;;) { execbuf(&nop); sleep(.1); }
Without 21abf0bf168d if you trigger a waitboost just before, it never recovers and power utilisation is measurably higher.
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Probably; as you can guess these were pulled out of nowhere based on the observation that the frame lengths are much shorter than the current EI and that in order for us to ramp up to maxclocks in a single frame of animation would take about 4 samples per frame. Based on the reporter's observations, we do have to ramp up very quickly for single frame of rendering in order to hit the vblank, as we are ramping down afterwards.
With a target of 4 samples within say 1ms, 160us isn't too far of the mark. (We have to allow some extra time to catch up rendering.)
How about we stop ramping down after the rendering of a single frame? It's not like we save any power by doing that, since the GPU seems to be forced to the minimum frequency for as long as it remains parked anyway. If the GPU remains idle long enough for the RPS utilization counters to drop below the threshold and qualify for a ramp-down the RPS should send us an interrupt, at which point we will ramp down the frequency.
Because it demonstrably and quite dramatically reduces power consumption for very light desktop workloads.
Unconditionally ramping down on parking seems to disturb the accuracy of that RPS feedback loop, which then needs to be worked around by reducing the averaging window of the RPS to a tiny fraction of the oscillation period of any typical GPU workload, which is going to prevent the RPS from seeing a whole oscillation period before it reacts, which is almost guaranteed to have a serious energy-efficiency cost.
There is no feedback loop in these workloads. There are no completed RPS workers, they are all cancelled if they were even scheduled. (Now we might be tempted to look at the results if they scheduled and take that into consideration instead of unconditionally downclocking.)
As for steady state workloads, I'm optimistic the smoothing helps. (It's harder to find steady state, unthrottled workloads!)
I'm curious, how is that smoothing you do in PATCH 1 better than simply setting 2x the EIs? (Which would also mean half the interrupt processing overhead as in this series)
I'm anticipating where the RPS worker may not run until the next jiffie or two, by which point the iir is stale. But yes, there's only a subtle difference between comparing the last 320us every 160us and comparing the last 320us every 320us. -Chris
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 22:00:25)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
We were explicitly told to mask the interrupt generation at source to conserve power. So I would hope for a statement as to whether that is still a requirement from the HW architects; but I can't see why we would not apply the mask and that this is just paper. If the observation about forcewake tallies, would this not suggest that it is still conserving power on icl?
Yeah, it's hard to see how disabling interrupt generation could save any additional power in a unit which is powered off -- At least on ICL where even the interrupt masking register is powered off...
I haven't looked at whether I see the same phenomenon as you [missing interrupts on icl] locally, but I was expecting something like the bug report since the observation that render times are less than EI was causing the clocks to stay high. And I noticed your problem statement and was hopeful for a link.
Right. In the workloads I was looking at last week the GPU would often be active for periods of time several times greater than the EI, and we would still fail to clock up.
They sound like two different problems. (Underclocking for animations is not icl specific.)
Sure. But it seems like the underclocking problem has been greatly exacerbated by 21abf0bf168d, which may have been mitigating the same ICL problem I was looking at leading to RPS interrupt loss. Maybe 21abf0bf168d wouldn't be necessary with working RPS interrupt delivery? And without 21abf0bf168d platforms earlier than ICL wouldn't have as much of an underclocking problem either.
21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")
is necessary due to that we can set a boost frequency and then never run the RPS worker due to short activity cycles. See igt/i915_pm_rc6_residency for rc6-idle, the bg_load is essentially just
for (;;) { execbuf(&nop); sleep(.1); }
Without 21abf0bf168d if you trigger a waitboost just before, it never recovers and power utilisation is measurably higher.
I can believe that, but can you confirm that the problem isn't a symptom of the PMINTRMSK issue I was talking about earlier? Assuming it isn't, couldn't you do something like 21abf0bf168d which simply stores a timestamp at GPU parking time and delays the actual frequency ramp-down to GPU unpark, at which point it's performed *only* if the time spent idle goes over a down-clock threshold based on the current "power" mode (e.g. something like the value of GEN6_RP_DOWN_EI minus GEN6_RP_DOWN_THRESHOLD) -- So the behavior of frequency rampdown due to parking more closely matches what the RPS would have done if it had been active.
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Probably; as you can guess these were pulled out of nowhere based on the observation that the frame lengths are much shorter than the current EI and that in order for us to ramp up to maxclocks in a single frame of animation would take about 4 samples per frame. Based on the reporter's observations, we do have to ramp up very quickly for single frame of rendering in order to hit the vblank, as we are ramping down afterwards.
With a target of 4 samples within say 1ms, 160us isn't too far of the mark. (We have to allow some extra time to catch up rendering.)
How about we stop ramping down after the rendering of a single frame? It's not like we save any power by doing that, since the GPU seems to be forced to the minimum frequency for as long as it remains parked anyway. If the GPU remains idle long enough for the RPS utilization counters to drop below the threshold and qualify for a ramp-down the RPS should send us an interrupt, at which point we will ramp down the frequency.
Because it demonstrably and quite dramatically reduces power consumption for very light desktop workloads.
Unconditionally ramping down on parking seems to disturb the accuracy of that RPS feedback loop, which then needs to be worked around by reducing the averaging window of the RPS to a tiny fraction of the oscillation period of any typical GPU workload, which is going to prevent the RPS from seeing a whole oscillation period before it reacts, which is almost guaranteed to have a serious energy-efficiency cost.
There is no feedback loop in these workloads. There are no completed RPS workers, they are all cancelled if they were even scheduled. (Now we might be tempted to look at the results if they scheduled and take that into consideration instead of unconditionally downclocking.)
Processing any pending RPS work seems better IMHO than unconditionally assuming they indicate a clock-down.
As for steady state workloads, I'm optimistic the smoothing helps. (It's harder to find steady state, unthrottled workloads!)
I'm curious, how is that smoothing you do in PATCH 1 better than simply setting 2x the EIs? (Which would also mean half the interrupt processing overhead as in this series)
I'm anticipating where the RPS worker may not run until the next jiffie or two, by which point the iir is stale. But yes, there's only a subtle difference between comparing the last 320us every 160us and comparing the last 320us every 320us.
Sounds like the benefit from smoothing is very slight, considering its accuracy and interrupt processing overhead costs.
-Chris
Francisco Jerez currojerez@riseup.net writes:
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 22:00:25)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23) > Try to make RPS dramatically more responsive by shrinking the evaluation > intervales by a factor of 100! The issue is as we now park the GPU > rapidly upon idling, a short or bursty workload such as the composited > desktop never sustains enough work to fill and complete an evaluation > window. As such, the frequency we program remains stuck. This was first > reported as once boosted, we never relinquished the boost [see commit > 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but > it equally applies in the order direction for bursty workloads that > *need* low latency, like desktop animations. > > What we could try is preserve the incomplete EI history across idling, > it is not clear whether that would be effective, nor whether the > presumption of continuous workloads is accurate. A clearer path seems to > treat it as symptomatic that we fail to handle bursty workload with the > current EI, and seek to address that by shrinking the EI so the > evaluations are run much more often. > > This will likely entail more frequent interrupts, and by the time we > process the interrupt in the bottom half [from inside a worker], the > workload on the GPU has changed. To address the changeable nature, in > the previous patch we compared the previous complete EI with the > interrupt request and only up/down clock if both agree. The impact of > asking for, and presumably, receiving more interrupts is still to be > determined and mitigations sought. The first idea is to differentiate > between up/down responsivity and make upclocking more responsive than > downlocking. This should both help thwart jitter on bursty workloads by > making it easier to increase than it is to decrease frequencies, and > reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
We were explicitly told to mask the interrupt generation at source to conserve power. So I would hope for a statement as to whether that is still a requirement from the HW architects; but I can't see why we would not apply the mask and that this is just paper. If the observation about forcewake tallies, would this not suggest that it is still conserving power on icl?
Yeah, it's hard to see how disabling interrupt generation could save any additional power in a unit which is powered off -- At least on ICL where even the interrupt masking register is powered off...
I haven't looked at whether I see the same phenomenon as you [missing interrupts on icl] locally, but I was expecting something like the bug report since the observation that render times are less than EI was causing the clocks to stay high. And I noticed your problem statement and was hopeful for a link.
Right. In the workloads I was looking at last week the GPU would often be active for periods of time several times greater than the EI, and we would still fail to clock up.
They sound like two different problems. (Underclocking for animations is not icl specific.)
Sure. But it seems like the underclocking problem has been greatly exacerbated by 21abf0bf168d, which may have been mitigating the same ICL problem I was looking at leading to RPS interrupt loss. Maybe 21abf0bf168d wouldn't be necessary with working RPS interrupt delivery? And without 21abf0bf168d platforms earlier than ICL wouldn't have as much of an underclocking problem either.
21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")
is necessary due to that we can set a boost frequency and then never run the RPS worker due to short activity cycles. See igt/i915_pm_rc6_residency for rc6-idle, the bg_load is essentially just
for (;;) { execbuf(&nop); sleep(.1); }
Without 21abf0bf168d if you trigger a waitboost just before, it never recovers and power utilisation is measurably higher.
I can believe that, but can you confirm that the problem isn't a symptom of the PMINTRMSK issue I was talking about earlier? Assuming it isn't, couldn't you do something like 21abf0bf168d which simply stores a timestamp at GPU parking time and delays the actual frequency ramp-down to GPU unpark, at which point it's performed *only* if the time spent idle goes over a down-clock threshold based on the current "power" mode (e.g. something like the value of GEN6_RP_DOWN_EI minus GEN6_RP_DOWN_THRESHOLD) -- So the behavior of frequency rampdown due to parking more closely matches what the RPS would have done if it had been active.
If you want to match the behavior of the RPS even more closely you could use a variable delay kind of like the one I'm using for overload tracking in:
https://github.com/curro/linux/commit/0669ce23148eafa00e15fc44b9d07f7919802a...
However using a fixed delay based on GEN6_RP_DOWN_THRESHOLD as I was describing above might be sufficient to avoid the waitboost energy efficiency problem you were talking about.
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
Probably; as you can guess these were pulled out of nowhere based on the observation that the frame lengths are much shorter than the current EI and that in order for us to ramp up to maxclocks in a single frame of animation would take about 4 samples per frame. Based on the reporter's observations, we do have to ramp up very quickly for single frame of rendering in order to hit the vblank, as we are ramping down afterwards.
With a target of 4 samples within say 1ms, 160us isn't too far of the mark. (We have to allow some extra time to catch up rendering.)
How about we stop ramping down after the rendering of a single frame? It's not like we save any power by doing that, since the GPU seems to be forced to the minimum frequency for as long as it remains parked anyway. If the GPU remains idle long enough for the RPS utilization counters to drop below the threshold and qualify for a ramp-down the RPS should send us an interrupt, at which point we will ramp down the frequency.
Because it demonstrably and quite dramatically reduces power consumption for very light desktop workloads.
Unconditionally ramping down on parking seems to disturb the accuracy of that RPS feedback loop, which then needs to be worked around by reducing the averaging window of the RPS to a tiny fraction of the oscillation period of any typical GPU workload, which is going to prevent the RPS from seeing a whole oscillation period before it reacts, which is almost guaranteed to have a serious energy-efficiency cost.
There is no feedback loop in these workloads. There are no completed RPS workers, they are all cancelled if they were even scheduled. (Now we might be tempted to look at the results if they scheduled and take that into consideration instead of unconditionally downclocking.)
Processing any pending RPS work seems better IMHO than unconditionally assuming they indicate a clock-down.
As for steady state workloads, I'm optimistic the smoothing helps. (It's harder to find steady state, unthrottled workloads!)
I'm curious, how is that smoothing you do in PATCH 1 better than simply setting 2x the EIs? (Which would also mean half the interrupt processing overhead as in this series)
I'm anticipating where the RPS worker may not run until the next jiffie or two, by which point the iir is stale. But yes, there's only a subtle difference between comparing the last 320us every 160us and comparing the last 320us every 320us.
Sounds like the benefit from smoothing is very slight, considering its accuracy and interrupt processing overhead costs.
-Chris
Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
Just stuck a pr_err() into gen11_handle_rps_events(), and momentarily before we were throttled (and so capped at 100% load), interrupts were being delivered:
[ 887.521727] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.538039] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.538253] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.538555] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.554731] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.554857] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.555604] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.571373] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.571496] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.571646] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.588199] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.588380] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.588692] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.604718] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.604937] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.621591] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.621755] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.637988] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.638166] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.638803] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.654812] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.655029] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.671423] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.671649] gen11_rps_irq_handler: { iir:20, events:20 }
That looks within expectations for the short EI settings. So many interrupts is a drag, and I would be tempted to remove the process bottom half.
Oh well, I should check how many of those are translated into frequency updates. I just wanted to first check if in the first try I stumbled into the same loss of interrupts issue. -Chris
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
Just stuck a pr_err() into gen11_handle_rps_events(), and momentarily before we were throttled (and so capped at 100% load), interrupts were being delivered:
[ 887.521727] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.538039] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.538253] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.538555] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.554731] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.554857] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.555604] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.571373] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.571496] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.571646] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.588199] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.588380] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.588692] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.604718] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.604937] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.621591] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.621755] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.637988] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.638166] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.638803] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.654812] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.655029] gen11_rps_irq_handler: { iir:20, events:20 } [ 887.671423] gen11_rps_irq_handler: { iir:10, events:10 } [ 887.671649] gen11_rps_irq_handler: { iir:20, events:20 }
That looks within expectations for the short EI settings. So many interrupts is a drag, and I would be tempted to remove the process bottom half.
Oh well, I should check how many of those are translated into frequency updates. I just wanted to first check if in the first try I stumbled into the same loss of interrupts issue.
The interrupt loss seems to be non-deterministic, it's not like we lose 100% of them, since there is always a chance that the GPU is awake during the PMINTRMSK write. It's easily noticeable anyway with most GPU-bound workloads on ICL, particularly with the current long EIs.
-Chris
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
Confirmed that the PMINTRMSK is sufficiently delayed to cause problems.
That said it *might* be helpful to reduce the EIs we use right now in addition, but a factor of 100 seems over the top since that will cause the evaluation interval to be roughly two orders of magnitude shorter than the rendering time of a typical frame, which can lead to massive oscillations even in workloads that use a small fraction of the GPU time to render a single frame. Maybe we want something in between?
And confirmed that both are problems :) The EI are just too large to handle the bursty workload. That is with the 10+ms EI, we do not see any interrupts in the simple animations as we park within an EI. -Chris
Quoting Chris Wilson (2020-04-15 08:37:07)
Quoting Francisco Jerez (2020-04-14 20:39:48)
Chris Wilson chris@chris-wilson.co.uk writes:
Quoting Chris Wilson (2020-04-14 17:14:23)
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Another worry I'd like to raise, is that by reducing the EI we risk unstable evaluations. I'm not sure how accurate the HW is, and I worry about borderline workloads (if that is possible) but mainly the worry is how the HW is sampling.
The other unmentioned unknown is the latency in reprogramming the frequency. At what point does it start to become a significant factor? I'm presuming the RPS evaluation itself is free, until it has to talk across the chip to send an interrupt. -Chris
At least on ICL the problem which this patch and 21abf0bf168d were working around seems to have to do with RPS interrupt delivery being inadvertently blocked for extended periods of time. Looking at the GPU utilization and RPS events on a graph I could see the GPU being stuck at low frequency without any RPS interrupts firing, for a time interval orders of magnitude greater than the EI we're theoretically programming today. IOW it seems like the real problem isn't that our EIs are too long, but that we're missing a bunch of them.
The solution I was suggesting for this on IRC during the last couple of days wouldn't have any of the drawbacks you mention above, I'll send it to this list in a moment if the general approach seems okay to you:
https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b...
Confirmed that the PMINTRMSK is sufficiently delayed to cause problems.
[ 68.462428] rcs0: UP interrupt not recorded for spinner, pm_iir:0, prev_up:2ee0, up_threshold:2c88, up_ei:2ee0 Have selftest \o/ FW fixes selftest. -Chris
Hi Chris,
On Tue, Apr 14, 2020 at 05:14:23PM +0100, Chris Wilson wrote:
Try to make RPS dramatically more responsive by shrinking the evaluation intervales by a factor of 100! The issue is as we now park the GPU rapidly upon idling, a short or bursty workload such as the composited desktop never sustains enough work to fill and complete an evaluation window. As such, the frequency we program remains stuck. This was first reported as once boosted, we never relinquished the boost [see commit 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] but it equally applies in the order direction for bursty workloads that *need* low latency, like desktop animations.
What we could try is preserve the incomplete EI history across idling, it is not clear whether that would be effective, nor whether the presumption of continuous workloads is accurate. A clearer path seems to treat it as symptomatic that we fail to handle bursty workload with the current EI, and seek to address that by shrinking the EI so the evaluations are run much more often.
This will likely entail more frequent interrupts, and by the time we process the interrupt in the bottom half [from inside a worker], the workload on the GPU has changed. To address the changeable nature, in the previous patch we compared the previous complete EI with the interrupt request and only up/down clock if both agree. The impact of asking for, and presumably, receiving more interrupts is still to be determined and mitigations sought. The first idea is to differentiate between up/down responsivity and make upclocking more responsive than downlocking. This should both help thwart jitter on bursty workloads by making it easier to increase than it is to decrease frequencies, and reduce the number of interrupts we would need to process.
Fixes: 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event") Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1698 Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: Lyude Paul lyude@redhat.com Cc: Francisco Jerez currojerez@riseup.net Cc: stable@vger.kernel.org # v5.5+
drivers/gpu/drm/i915/gt/intel_rps.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c b/drivers/gpu/drm/i915/gt/intel_rps.c index 367132092bed..47ddb25edc97 100644 --- a/drivers/gpu/drm/i915/gt/intel_rps.c +++ b/drivers/gpu/drm/i915/gt/intel_rps.c @@ -542,37 +542,38 @@ static void rps_set_power(struct intel_rps *rps, int new_power) /* Note the units here are not exactly 1us, but 1280ns. */ switch (new_power) { case LOW_POWER:
/* Upclock if more than 95% busy over 16ms */
ei_up = 16000;
/* Upclock if more than 95% busy over 160us */
threshold_up = 95;ei_up = 160;
/* Downclock if less than 85% busy over 32ms */
ei_down = 32000;
/* Downclock if less than 85% busy over 1600us */
threshold_down = 85; break;ei_down = 1600;
case BETWEEN:
/* Upclock if more than 90% busy over 13ms */
ei_up = 13000;
/* Upclock if more than 90% busy over 160us */
threshold_up = 90;ei_up = 160;
/* Downclock if less than 75% busy over 32ms */
ei_down = 32000;
/* Downclock if less than 75% busy over 1600us */
threshold_down = 75; break;ei_down = 1600;
case HIGH_POWER:
/* Upclock if more than 85% busy over 10ms */
ei_up = 10000;
/* Upclock if more than 85% busy over 160us */
threshold_up = 85;ei_up = 160;
/* Downclock if less than 60% busy over 32ms */
ei_down = 32000;
/* Downclock if less than 60% busy over 1600us */
ei_down = 1600;
This is quite a drammatic change.
Can we have a more dynamic selection of the interval depending on the frequency we are running? We reduce the interval in low frequencies and increase the interval in high frequencies.
Andi
linux-stable-mirror@lists.linaro.org