From: Long Li <longli(a)microsoft.com>
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().
Tests showed no regression in network latency benchmarks.
Cc: stable(a)vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Long Li <longli(a)microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index cd4d5ceb9f2d..ef1f0ce8e44d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq *rxq)
recv_buf_oob = &rxq->rx_oobs[curr_index];
- err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob->wqe_req,
- &recv_buf_oob->wqe_inf);
+ err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob->wqe_req,
+ &recv_buf_oob->wqe_inf);
if (WARN_ON_ONCE(err))
return;
@@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
mana_process_rx_cqe(rxq, cq, &comp[i]);
}
+ if (comp_read) {
+ struct gdma_context *gc = rxq->gdma_rq->gdma_dev->gdma_context;
+
+ mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
+ }
+
if (rxq->xdp_flush)
xdp_do_flush();
}
--
2.34.1
Hi Sebastian,
On 18.04.23 14:26, Sebastian Andrzej Siewior wrote:
> [...]. The timer still fires
> every 4ms with HZ=250 but timer is no longer aligned with
> CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
> of the timestamp. The offset differs with every boot and makes it
> impossible for user land to align with the tick.
I can observe these per-boot offsets too, but...
> Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a
> multiple of 1000/CONFIG_HZ ms.
this change doesn't seem to achieve that goal, unfortunately. Quite the
opposite. It makes the (boot) clock run faster and, because of the per-
boot different offset, differently fast for each boot. Up to the point
where it's running too fast to make any progress at all.
This patch causes VM boot hangs for us. It took a while to identify as
the boot hangs were only ~1 out of 30 but it's clearly it. Reverting
the commit got me 100 boots in a row without any issue.
Instrumenting the kernel a little gave me a clue what the bug is. When
switching from the boot timer tick device (which is 'hpet' in my setup)
to 'lapic-deadline', the mode of the timer isn't changed and kept at
TICKDEV_MODE_PERIODIC. As that device doesn't support this mode,
tick_setup_periodic() will switch over to CLOCK_EVT_STATE_ONESHOT mode
and program the next expire event based on tick_next_period.
clockevents_program_event() will calculate the delta of that timestamp
and ktime_get() and pass that value on to dev->set_next_event() (which
is lapic_next_deadline()) which will write it to the IA32_TSC_DEADLINE
MSR.
That delta value -- which is still the per-boot different offset to
ktime_get() your patch introduces -- now gets stuck and is taken as the
new *jiffies tick time*. That's because tick_handle_periodic() ->
tick_periodic() will advance tick_next_period by TICK_NSEC, make
do_timer() increment jiffies_64 by one and then program the next event
to be in TICK_NSEC ns based on the device's old expiry time, i.e. keep
the offset intact. This is followed by re-arming the event by a call to
clockevents_program_event() which does the already-know delta
calculation and writes, again, the too little value to
IA32_TSC_DEADLINE.
This effectively makes the jiffies based clock go too fast as the timer
IRQ comes too early (less than TICK_NSEC ns). Sometimes it's barely
noticeable, but sometimes it's so fast that the kernel is overloaded
with only handling the local timer IRQ without making any further
progress, especially in (nested) VM setups.
Without commit e9523a0d8189 ("tick/common: Align tick period with the
HZ tick."), which was backported to many stable and LTS kernels (v6.3.2
(571c3b46c9b3), v6.2.15 (f0cb827199ec), v6.1.28 (290e26ec0d01),
v5.15.111 (a55050c7989c), v5.10.180 (c4013689269d) and v5.4.243
(a3e7a3d472c2)) this clock drift is gone and my VMs boot again.
Before that commit, the delta between tick_next_period and ktime_get()
was initially zero, so tick_handle_period() had to loop, as
clockevents_program_event() will return with -ETIME. The next attempt
would be done with a delta of TICK_NSEC which will make
clockevents_program_event() succeed and ensure that future events don't
need the additional loop iteration, as the delta got stuck at TICK_NSEC
-- exactly where it should be.
We observed the bug first on the v6.3, v6.1 and v5.15 stable branch
updates from May 11th and then, a week later, on v5.4 too. All first
occurrences were coinciding with the bad commit going into the
corresponding stable and LTS kernel releases.
The issue manifests itself as a fast running clock only during boot,
when the clock source is still jiffies based. That'll eventually lead
to a boot hang as the timer IRQs are firing too fast.
To reproduce this you can either boot loop a VM and try to get "lucky"
to hit a big enough 'rem' value or just apply this little diff instead:
---8<---
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 65b8658da829..b01cf18a5d42 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -225,6 +225,7 @@ static void tick_setup_device(struct tick_device *td,
next_p = ktime_get();
div_u64_rem(next_p, TICK_NSEC, &rem);
+ rem = TICK_NSEC - 123;
if (rem) {
next_p -= rem;
next_p += TICK_NSEC;
--->8---
This should make the kernel get stuck with only handling timer ticks
but not making any further progress.
Change the subtrahend to 1234 to get a system that boots but has an
unrealistically fast clock during kernel initialization.
As reverting that commit fixes the issue for us but it seemingly fixes
another bug for Klaus (or at least attempted to), the now uncovered bug
should be fixed instead.
The fundamental issue is that the jiffies based clock source cannot be
trusted and shouldn't be used to calculate offsets to timestamps in the
future when tick_next_period mod ktime_get() != 0.
Can we defer the offset adjustment of tick_next_period to a later point
in time when a stable clock source gets used, like 'tsc'?
Thanks,
Mathias
The Processor _PDC buffer bits notify ACPI of the OS capabilities, and
so ACPI can adjust the return of other Processor methods taking the OS
capabilities into account.
When Linux is running as a Xen dom0, it's the hypervisor the entity
in charge of processor power management, and hence Xen needs to make
sure the capabilities reported in the _PDC buffer match the
capabilities of the driver in Xen.
Introduce a small helper to sanitize the buffer when running as Xen
dom0.
Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/include/asm/xen/hypervisor.h | 2 ++
arch/x86/xen/enlighten.c | 17 +++++++++++++++++
drivers/acpi/processor_pdc.c | 8 ++++++++
3 files changed, 27 insertions(+)
diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h
index b9f512138043..b4ed90ef5e68 100644
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -63,12 +63,14 @@ void __init mem_map_via_hcall(struct boot_params *boot_params_p);
#ifdef CONFIG_XEN_DOM0
bool __init xen_processor_present(uint32_t acpi_id);
+void xen_sanitize_pdc(uint32_t *buf);
#else
static inline bool xen_processor_present(uint32_t acpi_id)
{
BUG();
return false;
}
+static inline void xen_sanitize_pdc(uint32_t *buf) { BUG(); }
#endif
#endif /* _ASM_X86_XEN_HYPERVISOR_H */
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d4c44361a26c..394dd6675113 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -372,4 +372,21 @@ bool __init xen_processor_present(uint32_t acpi_id)
return false;
}
+
+void xen_sanitize_pdc(uint32_t *buf)
+{
+ struct xen_platform_op op = {
+ .cmd = XENPF_set_processor_pminfo,
+ .interface_version = XENPF_INTERFACE_VERSION,
+ .u.set_pminfo.id = -1,
+ .u.set_pminfo.type = XEN_PM_PDC,
+ };
+ int ret;
+
+ set_xen_guest_handle(op.u.set_pminfo.pdc, buf);
+ ret = HYPERVISOR_platform_op(&op);
+ if (ret)
+ pr_info("sanitize of _PDC buffer bits from Xen failed: %d\n",
+ ret);
+}
#endif
diff --git a/drivers/acpi/processor_pdc.c b/drivers/acpi/processor_pdc.c
index 18fb04523f93..58f4c208517a 100644
--- a/drivers/acpi/processor_pdc.c
+++ b/drivers/acpi/processor_pdc.c
@@ -137,6 +137,14 @@ acpi_processor_eval_pdc(acpi_handle handle, struct acpi_object_list *pdc_in)
buffer[2] &= ~(ACPI_PDC_C_C2C3_FFH | ACPI_PDC_C_C1_FFH);
}
+ if (xen_initial_domain())
+ /*
+ * When Linux is running as Xen dom0 it's the hypervisor the
+ * entity in charge of the processor power management, and so
+ * Xen needs to check the OS capabilities reported in the _PDC
+ * buffer matches what the hypervisor driver supports.
+ */
+ xen_sanitize_pdc((uint32_t *)pdc_in->pointer->buffer.pointer);
status = acpi_evaluate_object(handle, "_PDC", pdc_in, NULL);
if (ACPI_FAILURE(status))
--
2.37.3
Fix the test for the AST2200 in the DRAM initialization. The value
in ast->chip has to be compared against an enum constant instead of
a numerical value.
This bug got introduced when the driver was first imported into the
kernel.
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Fixes: 312fec1405dd ("drm: Initial KMS driver for AST (ASpeed Technologies) 2000 series (v2)")
Cc: Dave Airlie <airlied(a)redhat.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v3.5+
---
drivers/gpu/drm/ast/ast_post.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
index a005aec18a020..0262aaafdb1c5 100644
--- a/drivers/gpu/drm/ast/ast_post.c
+++ b/drivers/gpu/drm/ast/ast_post.c
@@ -291,7 +291,7 @@ static void ast_init_dram_reg(struct drm_device *dev)
;
} while (ast_read32(ast, 0x10100) != 0xa8);
} else {/* AST2100/1100 */
- if (ast->chip == AST2100 || ast->chip == 2200)
+ if (ast->chip == AST2100 || ast->chip == AST2200)
dram_reg_info = ast2100_dram_table_data;
else
dram_reg_info = ast1100_dram_table_data;
--
2.41.0
If the BO has been moved the PT should be updated, otherwise the VAs
might point to invalid PT.
This fixes random GPU hangs when replacing sparse mappings from the
userspace, while OP_MAP/OP_UNMAP works fine because always valid BOs
are correctly handled there.
Cc: stable(a)vger.kernel.org
Signed-off-by: Samuel Pitoiset <samuel.pitoiset(a)gmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 143d11afe0e5..eff73c428b12 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1771,18 +1771,30 @@ int amdgpu_vm_bo_clear_mappings(struct amdgpu_device *adev,
/* Insert partial mapping before the range */
if (!list_empty(&before->list)) {
+ struct amdgpu_bo *bo = before->bo_va->base.bo;
+
amdgpu_vm_it_insert(before, &vm->va);
if (before->flags & AMDGPU_PTE_PRT)
amdgpu_vm_prt_get(adev);
+
+ if (bo && bo->tbo.base.resv == vm->root.bo->tbo.base.resv &&
+ !before->bo_va->base.moved)
+ amdgpu_vm_bo_moved(&before->bo_va->base);
} else {
kfree(before);
}
/* Insert partial mapping after the range */
if (!list_empty(&after->list)) {
+ struct amdgpu_bo *bo = after->bo_va->base.bo;
+
amdgpu_vm_it_insert(after, &vm->va);
if (after->flags & AMDGPU_PTE_PRT)
amdgpu_vm_prt_get(adev);
+
+ if (bo && bo->tbo.base.resv == vm->root.bo->tbo.base.resv &&
+ !after->bo_va->base.moved)
+ amdgpu_vm_bo_moved(&after->bo_va->base);
} else {
kfree(after);
}
--
2.41.0
In an ACPI-based dual-bridge system, IRQ of each bridge's
PCH PIC sent to CPU is always a zero-based number, which
means that the IRQ on PCH PIC of each bridge is mapped into
vector range from 0 to 63 of upstream irqchip(e.g. EIOINTC).
EIOINTC N: [0 ... 63 | 64 ... 255]
-------- ----------
^ ^
| |
PCH PIC N |
PCH MSI N
For example, the IRQ vector number of sata controller on
PCH PIC of each bridge is 16, which is sent to upstream
irqchip of EIOINTC when an interrupt occurs, which will set
bit 16 of EIOINTC. Since hwirq of 16 on EIOINTC has been
mapped to a irq_desc for sata controller during hierarchy
irq allocation, the related mapped IRQ will be found through
irq_resolve_mapping() in the IRQ domain of EIOINTC.
So, the IRQ number set in HT vector register should be fixed
to be a zero-based number.
Cc: stable(a)vger.kernel.org
Reviewed-by: Huacai Chen <chenhuacai(a)loongson.cn>
Co-developed-by: liuyun <liuyun(a)loongson.cn>
Signed-off-by: liuyun <liuyun(a)loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin(a)loongson.cn>
---
drivers/irqchip/irq-loongson-pch-pic.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/irqchip/irq-loongson-pch-pic.c b/drivers/irqchip/irq-loongson-pch-pic.c
index e5fe4d50be05..921c5c0190d1 100644
--- a/drivers/irqchip/irq-loongson-pch-pic.c
+++ b/drivers/irqchip/irq-loongson-pch-pic.c
@@ -401,14 +401,12 @@ static int __init acpi_cascade_irqdomain_init(void)
int __init pch_pic_acpi_init(struct irq_domain *parent,
struct acpi_madt_bio_pic *acpi_pchpic)
{
- int ret, vec_base;
+ int ret;
struct fwnode_handle *domain_handle;
if (find_pch_pic(acpi_pchpic->gsi_base) >= 0)
return 0;
- vec_base = acpi_pchpic->gsi_base - GSI_MIN_PCH_IRQ;
-
domain_handle = irq_domain_alloc_fwnode(&acpi_pchpic->address);
if (!domain_handle) {
pr_err("Unable to allocate domain handle\n");
@@ -416,7 +414,7 @@ int __init pch_pic_acpi_init(struct irq_domain *parent,
}
ret = pch_pic_init(acpi_pchpic->address, acpi_pchpic->size,
- vec_base, parent, domain_handle, acpi_pchpic->gsi_base);
+ 0, parent, domain_handle, acpi_pchpic->gsi_base);
if (ret < 0) {
irq_domain_free_fwnode(domain_handle);
--
2.31.1
From: Liu Peibao <liupeibao(a)loongson.cn>
In DeviceTree path, when ht_vec_base is not zero, the hwirq of PCH PIC
will be assigned incorrectly. Because when pch_pic_domain_translate()
adds the ht_vec_base to hwirq, the hwirq does not have the ht_vec_base
subtracted when calling irq_domain_set_info().
The ht_vec_base is designed for the parent irq chip/domain of the PCH PIC.
It seems not proper to deal this in callbacks of the PCH PIC domain and
let's put this back like the initial commit ef8c01eb64ca ("irqchip: Add
Loongson PCH PIC controller").
Fixes: bcdd75c596c8 ("irqchip/loongson-pch-pic: Add ACPI init support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Huacai Chen <chenhuacai(a)loongson.cn>
Signed-off-by: Liu Peibao <liupeibao(a)loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin(a)loongson.cn>
---
drivers/irqchip/irq-loongson-pch-pic.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/irqchip/irq-loongson-pch-pic.c b/drivers/irqchip/irq-loongson-pch-pic.c
index 921c5c0190d1..93a71f66efeb 100644
--- a/drivers/irqchip/irq-loongson-pch-pic.c
+++ b/drivers/irqchip/irq-loongson-pch-pic.c
@@ -164,7 +164,7 @@ static int pch_pic_domain_translate(struct irq_domain *d,
if (fwspec->param_count < 2)
return -EINVAL;
- *hwirq = fwspec->param[0] + priv->ht_vec_base;
+ *hwirq = fwspec->param[0];
*type = fwspec->param[1] & IRQ_TYPE_SENSE_MASK;
} else {
if (fwspec->param_count < 1)
@@ -196,7 +196,7 @@ static int pch_pic_alloc(struct irq_domain *domain, unsigned int virq,
parent_fwspec.fwnode = domain->parent->fwnode;
parent_fwspec.param_count = 1;
- parent_fwspec.param[0] = hwirq;
+ parent_fwspec.param[0] = hwirq + priv->ht_vec_base;
err = irq_domain_alloc_irqs_parent(domain, virq, 1, &parent_fwspec);
if (err)
--
2.31.1