During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP") Cc: stable@vger.kernel.org Suggested-by: Doug Anderson dianders@chromium.org Suggested-by: Rob Clark robdclark@chromium.org Signed-off-by: Bjorn Andersson quic_bjorande@quicinc.com --- drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h index 1ccd1edd693c..4c0528794e7a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = { .min_llcc_ib = 0, .min_dram_ib = 800000, .danger_lut_tbl = {0xf, 0xffff, 0x0}, + .safe_lut_tbl = {0xfe00, 0xfe00, 0xffff}, .qos_lut_tbl = { {.nentry = ARRAY_SIZE(sc8180x_qos_linear), .entries = sc8180x_qos_linear
--- base-commit: c503e3eec382ac708ee7adf874add37b77c5d312 change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
Best regards,
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP") Cc: stable@vger.kernel.org Suggested-by: Doug Anderson dianders@chromium.org Suggested-by: Rob Clark robdclark@chromium.org Signed-off-by: Bjorn Andersson quic_bjorande@quicinc.com
drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h index 1ccd1edd693c..4c0528794e7a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = { .min_llcc_ib = 0, .min_dram_ib = 800000, .danger_lut_tbl = {0xf, 0xffff, 0x0},
- .safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
What does these values represent? And how SAFE is to override the default QoS values?
I'm not too familiar with the MSM DRM driver, so please excuse my ignorance.
- Mani
.qos_lut_tbl = { {.nentry = ARRAY_SIZE(sc8180x_qos_linear), .entries = sc8180x_qos_linear
base-commit: c503e3eec382ac708ee7adf874add37b77c5d312 change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
Best regards,
Bjorn Andersson quic_bjorande@quicinc.com
On Tue, Oct 31, 2023 at 1:19 AM Manivannan Sadhasivam manivannan.sadhasivam@linaro.org wrote:
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP") Cc: stable@vger.kernel.org Suggested-by: Doug Anderson dianders@chromium.org Suggested-by: Rob Clark robdclark@chromium.org Signed-off-by: Bjorn Andersson quic_bjorande@quicinc.com
drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h index 1ccd1edd693c..4c0528794e7a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = { .min_llcc_ib = 0, .min_dram_ib = 800000, .danger_lut_tbl = {0xf, 0xffff, 0x0},
.safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
What does these values represent? And how SAFE is to override the default QoS values?
I'm not too familiar with the MSM DRM driver, so please excuse my ignorance.
for realtime dma (like scanout) there is a sort of "safe" signal from the dma master to the smmu to indicate when it has enough data buffered for it to be safe to do tlbinv without risking underflow. When things aren't "safe" the smmu will stall tlbinv. This is just configuring the thresholds for the "safe" signal.
BR, -R
- Mani
.qos_lut_tbl = { {.nentry = ARRAY_SIZE(sc8180x_qos_linear), .entries = sc8180x_qos_linear
base-commit: c503e3eec382ac708ee7adf874add37b77c5d312 change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
Best regards,
Bjorn Andersson quic_bjorande@quicinc.com
-- மணிவண்ணன் சதாசிவம்
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
I ran some tests with a gigabit ethernet adapter to get an idea of how this performs in comparison to using lazy iommu mode ("non-strict"):
6.6 6.6-lazy 6.6-dpu 6.6-dpu-lazy iperf3 recv 114 941 941 941 MBit/s iperf3 send 124 891 703 940 MBit/s
scp recv 14.6 110 110 111 MB/s scp send 12.5 98.9 91.5 110 MB/s
This patch in itself indeed improves things quite a bit, but there is still some performance that can be gained by using lazy iommu mode.
Notably, lazy mode with this patch applied appears to saturate the link in both directions.
Tested-by: Johan Hovold johan+linaro@kernel.org
Johan
On Tue, Oct 31, 2023 at 5:35 AM Johan Hovold johan@kernel.org wrote:
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
I ran some tests with a gigabit ethernet adapter to get an idea of how this performs in comparison to using lazy iommu mode ("non-strict"):
6.6 6.6-lazy 6.6-dpu 6.6-dpu-lazy
iperf3 recv 114 941 941 941 MBit/s iperf3 send 124 891 703 940 MBit/s
scp recv 14.6 110 110 111 MB/s scp send 12.5 98.9 91.5 110 MB/s
This patch in itself indeed improves things quite a bit, but there is still some performance that can be gained by using lazy iommu mode.
Notably, lazy mode with this patch applied appears to saturate the link in both directions.
Maybe there is still room for SoC specific udev rules so dma masters without firmware can be configured as "lazy", ie. like:
https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/refs/...
BR, -R
Tested-by: Johan Hovold johan+linaro@kernel.org
Johan
On Mon, Oct 30, 2023 at 6:23 PM Bjorn Andersson quic_bjorande@quicinc.com wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP") Cc: stable@vger.kernel.org Suggested-by: Doug Anderson dianders@chromium.org Suggested-by: Rob Clark robdclark@chromium.org Signed-off-by: Bjorn Andersson quic_bjorande@quicinc.com
drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h index 1ccd1edd693c..4c0528794e7a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = { .min_llcc_ib = 0, .min_dram_ib = 800000, .danger_lut_tbl = {0xf, 0xffff, 0x0},
.safe_lut_tbl = {0xfe00, 0xfe00, 0xffff}, .qos_lut_tbl = { {.nentry = ARRAY_SIZE(sc8180x_qos_linear), .entries = sc8180x_qos_linear
base-commit: c503e3eec382ac708ee7adf874add37b77c5d312 change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
Best regards,
Bjorn Andersson quic_bjorande@quicinc.com
Tested-by: Steev Klimaszewski steev@kali.org
On 10/30/2023 4:23 PM, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP") Cc: stable@vger.kernel.org Suggested-by: Doug Anderson dianders@chromium.org Suggested-by: Rob Clark robdclark@chromium.org Signed-off-by: Bjorn Andersson quic_bjorande@quicinc.com
Matches what we have in downstream DT, hence
Reviewed-by: Abhinav Kumar quic_abhinavk@quicinc.com
drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h index 1ccd1edd693c..4c0528794e7a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = { .min_llcc_ib = 0, .min_dram_ib = 800000, .danger_lut_tbl = {0xf, 0xffff, 0x0},
- .safe_lut_tbl = {0xfe00, 0xfe00, 0xffff}, .qos_lut_tbl = { {.nentry = ARRAY_SIZE(sc8180x_qos_linear), .entries = sc8180x_qos_linear
base-commit: c503e3eec382ac708ee7adf874add37b77c5d312 change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
Best regards,
On Mon, 30 Oct 2023 16:23:20 -0700, Bjorn Andersson wrote:
During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to typically take 1-2ms to complete. As expected this results in poor performance, something that has been mitigated by proposing running the iommu in non-strict mode (boot with iommu.strict=0).
This turns out to be related to the SAFE logic, and programming the QOS SAFE values in the DPU (per suggestion from Rob and Doug) reduces the TLB sync time to below 10us, which means significant less time spent with interrupts disabled and a significant boost in throughput.
[...]
Applied, thanks!
[1/1] drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog https://gitlab.freedesktop.org/drm/msm/-/commit/a33b2431d11b
Best regards,
linux-stable-mirror@lists.linaro.org