The PE Reset State "0" obtained from RTAS calls
ibm_read_slot_reset_[state|state2] indicates that
the Reset is deactivated and the PE is not in the MMIO
Stopped or DMA Stopped state.
With PE Reset State "0", the MMIO and DMA is allowed for
the PE. The function pseries_eeh_get_state() is currently
not indicating that to the caller because of which the
drivers are unable to resume the MMIO and DMA activity.
The patch fixes that by reflecting what is actually allowed.
Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Narayana Murty N <nnmlinux(a)linux.ibm.com>
---
Changelog:
V1:https://lore.kernel.org/all/20241107042027.338065-1-nnmlinux@linux.ibm.c…
--added Fixes tag for "powerpc/pseries: Cleanup on
pseries_eeh_get_state()".
V2:https://lore.kernel.org/stable/20241212075044.10563-1-nnmlinux%40linux.i…
--Updated the patch description to include it in the stable kernel tree.
---
arch/powerpc/platforms/pseries/eeh_pseries.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 1893f66371fa..b12ef382fec7 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -580,8 +580,10 @@ static int pseries_eeh_get_state(struct eeh_pe *pe, int *delay)
switch(rets[0]) {
case 0:
- result = EEH_STATE_MMIO_ACTIVE |
- EEH_STATE_DMA_ACTIVE;
+ result = EEH_STATE_MMIO_ACTIVE |
+ EEH_STATE_DMA_ACTIVE |
+ EEH_STATE_MMIO_ENABLED |
+ EEH_STATE_DMA_ENABLED;
break;
case 1:
result = EEH_STATE_RESET_ACTIVE |
--
2.47.1
This driver will soon be getting more features so show it some
refactoring love in the meantime. Switching to using a workqueue and
sleeping locks improves cryptsetup benchmark results for AES encryption.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
---
Bartosz Golaszewski (9):
crypto: qce - fix goto jump in error path
crypto: qce - unregister previously registered algos in error path
crypto: qce - remove unneeded call to icc_set_bw() in error path
crypto: qce - shrink code with devres clk helpers
crypto: qce - convert qce_dma_request() to use devres
crypto: qce - make qce_register_algs() a managed interface
crypto: qce - use __free() for a buffer that's always freed
crypto: qce - convert tasklet to workqueue
crypto: qce - switch to using a mutex
drivers/crypto/qce/core.c | 131 ++++++++++++++++------------------------------
drivers/crypto/qce/core.h | 9 ++--
drivers/crypto/qce/dma.c | 22 ++++----
drivers/crypto/qce/dma.h | 3 +-
drivers/crypto/qce/sha.c | 6 +--
5 files changed, 68 insertions(+), 103 deletions(-)
---
base-commit: f486c8aa16b8172f63bddc70116a0c897a7f3f02
change-id: 20241128-crypto-qce-refactor-ab58869eec34
Best regards,
--
Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
Qualcomm Kryo 200-series Gold cores appear to have a derivative of an
ARM Cortex A75 in them. Since A75 needs Spectre mitigation via
firmware then the Kyro 300-series Gold cores also should need Spectre
mitigation via firmware.
Unless devices with a Kryo 3XX gold core have a firmware that provides
ARM_SMCCC_ARCH_WORKAROUND_3 (which seems unlikely at the time this
patch is posted), this will make devices with these cores report that
they are vulnerable to Spectre BHB with no mitigation in place. This
patch will also cause them not to do a WARN splat at boot about an
unknown CPU ID and to stop trying to do a "loop" mitigation for these
cores which is (presumably) not reliable for them.
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
I don't really have any good way to test this patch but it seems
likely it's needed.
NOTE: presumably this patch won't actually do much on its own because
(I believe) it requires a firmware update (one adding
ARM_SMCCC_ARCH_WORKAROUND_3) to go with it.
Changes in v2:
- Rebased / reworded QCOM_KRYO_3XX_GOLD patch
arch/arm64/kernel/proton-pack.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index 3b179a1bf815..f8e0d87d9e2d 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -845,6 +845,7 @@ static const struct midr_range spectre_bhb_firmware_mitigated_list[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A73),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A75),
MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_2XX_GOLD),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_3XX_GOLD),
{},
};
--
2.47.1.613.gc27f4b7a9f-goog
Qualcomm Kryo 200-series Gold cores appear to have a derivative of an
ARM Cortex A73 in them. Since A73 needs Spectre mitigation via
firmware then the Kyro 200-series Gold cores also should need Spectre
mitigation via firmware.
Unless devices with a Kryo 2XX gold core have a firmware that provides
ARM_SMCCC_ARCH_WORKAROUND_3 (which seems unlikely at the time this
patch is posted), this will make devices with these cores report that
they are vulnerable to Spectre BHB with no mitigation in place. This
patch will also cause them not to do a WARN splat at boot about an
unknown CPU ID and to stop trying to do a "loop" mitigation for these
cores which is (presumably) not reliable for them.
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
I don't really have any good way to test this patch but it seems
likely it's needed. If nothing else the claim is that that Qualcomm
Kyro 280 CPU is vulnerable [1] but I don't see any mitigations in the
kernel for it.
NOTE: presumably this patch won't actually do much on its own because
(I believe) it requires a firmware update (one adding
ARM_SMCCC_ARCH_WORKAROUND_3) to go with it.
[1] https://spectreattack.com/spectre.pdf
Changes in v2:
- Rebased / reworded QCOM_KRYO_2XX_GOLD patch
arch/arm64/kernel/proton-pack.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index 04c3f0567999..3b179a1bf815 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -844,6 +844,7 @@ static unsigned long system_bhb_mitigations;
static const struct midr_range spectre_bhb_firmware_mitigated_list[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A73),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A75),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_2XX_GOLD),
{},
};
--
2.47.1.613.gc27f4b7a9f-goog
Qualcomm Kryo 400-series Gold cores appear to have a derivative of an
ARM Cortex A76 in them. Since A76 needs Spectre mitigation via looping
then the Kyro 400-series Gold cores also should need Spectre
mitigation via looping.
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
The "k" value here really should come from analysis by Qualcomm, but
until we can get that analysis let's choose the same value as A76: 24.
Ideally someone from Qualcomm can confirm that this mitigation is
needed and confirm / provide the proper "k" value.
...or do people think that this should go in the k32 list to be
safe. At least adding it to the list of CPUs we don't warn about seems
like a good idea since it seems very unlikely that it needs a FW
mitigation when the A76 it's based on doesn't.
...or should we just drop this until Qualcomm tells us the right "k"
value here?
Changes in v2:
- Slight change to wording and notes of KRYO_4XX_GOLD patch
arch/arm64/kernel/proton-pack.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index 012485b75019..04c3f0567999 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -887,6 +887,7 @@ u8 spectre_bhb_loop_affected(int scope)
MIDR_ALL_VERSIONS(MIDR_CORTEX_A76),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A77),
MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_4XX_GOLD),
{},
};
static const struct midr_range spectre_bhb_k11_list[] = {
--
2.47.1.613.gc27f4b7a9f-goog
The 2XX cores appear to be based on ARM Cortex A53. The 3XX and 4XX
cores appear to be based on ARM Cortex A55. Both of those cores appear
to be "safe" from a Spectre point of view. While it would be nice to
get confirmation from Qualcomm, it seems hard to believe that they
made big enough changes to these cores to affect the Spectre BHB
vulnerability status. Add them to the safe list.
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
Changes in v2:
- New
arch/arm64/kernel/proton-pack.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index 39c5573c7527..012485b75019 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -851,6 +851,9 @@ static const struct midr_range spectre_bhb_safe_list[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A35),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A53),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_2XX_SILVER),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_3XX_SILVER),
+ MIDR_ALL_VERSIONS(MIDR_QCOM_KRYO_4XX_SILVER),
{},
};
--
2.47.1.613.gc27f4b7a9f-goog
For interrupt-map entries, the DTS specification requires
that #address-cells is defined for both the child node and the
interrupt parent. For the PCIe interrupt-map entries, the parent
node ("gic") has not specified #address-cells. The existing layout
of the PCIe interrupt-map entries indicates that it assumes
that #address-cells is zero for this node.
Explicitly set #address-cells to zero for "gic" so that it complies
with the device tree specification.
NVIDIA EDK2 has been working around this by assuming #address-cells
is zero in this scenario, but that workaround is being removed and so
this update is needed or else NVIDIA EDK2 cannot successfully parse the
device tree and the board cannot boot.
Fixes: ec142c44b026 ("arm64: tegra: Add P2U and PCIe controller nodes to Tegra234 DT")
Signed-off-by: Brad Griffis <bgriffis(a)nvidia.com>
Cc: stable(a)vger.kernel.org
---
arch/arm64/boot/dts/nvidia/tegra234.dtsi | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/boot/dts/nvidia/tegra234.dtsi b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
index 984c85eab41a..e1c07c99e9bd 100644
--- a/arch/arm64/boot/dts/nvidia/tegra234.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
@@ -4010,6 +4010,7 @@ ccplex@e000000 {
gic: interrupt-controller@f400000 {
compatible = "arm,gic-v3";
+ #address-cells = <0>;
reg = <0x0 0x0f400000 0x0 0x010000>, /* GICD */
<0x0 0x0f440000 0x0 0x200000>; /* GICR */
interrupt-parent = <&gic>;
--
2.34.1
It appears that the relatively popular RK3399 SoC has been put together
using a large amount of illicit substances, as experiments reveal
that its integration of GIC500 exposes the *secure* programming
interface to non-secure.
This has some pretty bad effects on the way priorities are handled,
and results in a dead machine if booting with pseudo-NMI enabled
(irqchip.gicv3_pseudo_nmi=1) if the kernel contains 18fdb6348c480
("arm64: irqchip/gic-v3: Select priorities at boot time"), which
relies on the priorities being programmed using the NS view.
Let's restore some sanity by going one step further and disable
security altogether in this case. This is not any worse, and
puts us in a mode where priorities actually make some sense.
Huge thanks to Mark Kettenis who initially identified this issue
on OpenBSD, and to Chen-Yu Tsai who reported the problem in
Linux.
Fixes: 18fdb6348c480 ("arm64: irqchip/gic-v3: Select priorities at boot time")
Reported-by: Mark Kettenis <mark.kettenis(a)xs4all.nl>
Reported-by: Chen-Yu Tsai <wenst(a)chromium.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
drivers/irqchip/irq-gic-v3.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 34db379d066a5..79d8cc80693c3 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -161,7 +161,22 @@ static bool cpus_have_group0 __ro_after_init;
static void __init gic_prio_init(void)
{
- cpus_have_security_disabled = gic_dist_security_disabled();
+ bool ds;
+
+ ds = gic_dist_security_disabled();
+ if (!ds) {
+ u32 val;
+
+ val = readl_relaxed(gic_data.dist_base + GICD_CTLR);
+ val |= GICD_CTLR_DS;
+ writel_relaxed(val, gic_data.dist_base + GICD_CTLR);
+
+ ds = gic_dist_security_disabled();
+ if (ds)
+ pr_warn("Broken GIC integration, security disabled");
+ }
+
+ cpus_have_security_disabled = ds;
cpus_have_group0 = gic_has_group0();
/*
--
2.39.2
Instead of returning a generic NULL on error from
drm_dp_tunnel_mgr_create(), use error pointers with informative codes
to align the function with stub that is executed when
CONFIG_DRM_DISPLAY_DP_TUNNEL is unset. This will also trigger IS_ERR()
in current caller (intel_dp_tunnerl_mgr_init()) instead of bypassing it
via NULL pointer.
v2: use error codes inside drm_dp_tunnel_mgr_create() instead of handling
on caller's side (Michal, Imre)
v3: fixup commit message and add "CC"/"Fixes" lines (Andi),
mention aligning function code with stub
Fixes: 91888b5b1ad2 ("drm/i915/dp: Add support for DP tunnel BW allocation")
Cc: Imre Deak <imre.deak(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.9+
Signed-off-by: Krzysztof Karas <krzysztof.karas(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
---
drivers/gpu/drm/display/drm_dp_tunnel.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/display/drm_dp_tunnel.c b/drivers/gpu/drm/display/drm_dp_tunnel.c
index 48b2df120086..90fe07a89260 100644
--- a/drivers/gpu/drm/display/drm_dp_tunnel.c
+++ b/drivers/gpu/drm/display/drm_dp_tunnel.c
@@ -1896,8 +1896,8 @@ static void destroy_mgr(struct drm_dp_tunnel_mgr *mgr)
*
* Creates a DP tunnel manager for @dev.
*
- * Returns a pointer to the tunnel manager if created successfully or NULL in
- * case of an error.
+ * Returns a pointer to the tunnel manager if created successfully or error
+ * pointer in case of failure.
*/
struct drm_dp_tunnel_mgr *
drm_dp_tunnel_mgr_create(struct drm_device *dev, int max_group_count)
@@ -1907,7 +1907,7 @@ drm_dp_tunnel_mgr_create(struct drm_device *dev, int max_group_count)
mgr = kzalloc(sizeof(*mgr), GFP_KERNEL);
if (!mgr)
- return NULL;
+ return ERR_PTR(-ENOMEM);
mgr->dev = dev;
init_waitqueue_head(&mgr->bw_req_queue);
@@ -1916,7 +1916,7 @@ drm_dp_tunnel_mgr_create(struct drm_device *dev, int max_group_count)
if (!mgr->groups) {
kfree(mgr);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
#ifdef CONFIG_DRM_DISPLAY_DP_TUNNEL_STATE_DEBUG
@@ -1927,7 +1927,7 @@ drm_dp_tunnel_mgr_create(struct drm_device *dev, int max_group_count)
if (!init_group(mgr, &mgr->groups[i])) {
destroy_mgr(mgr);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
mgr->group_count++;
--
2.34.1
From: Steven Rostedt <rostedt(a)goodmis.org>
A bug was discovered where the idle shadow stacks were not initialized
for offline CPUs when starting function graph tracer, and when they came
online they were not traced due to the missing shadow stack. To fix
this, the idle task shadow stack initialization was moved to using the
CPU hotplug callbacks. But it removed the initialization when the
function graph was enabled. The problem here is that the hotplug
callbacks are called when the CPUs come online, but the idle shadow
stack initialization only happens if function graph is currently
active. This caused the online CPUs to not get their shadow stack
initialized.
The idle shadow stack initialization still needs to be done when the
function graph is registered, as they will not be allocated if function
graph is not registered.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home
Fixes: 2c02f7375e65 ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Reported-by: Linus Walleij <linus.walleij(a)linaro.org>
Tested-by: Linus Walleij <linus.walleij(a)linaro.org>
Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgG…
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/fgraph.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0bf78517b5d4..ddedcb50917f 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1215,7 +1215,7 @@ void fgraph_update_pid_func(void)
static int start_graph_tracing(void)
{
unsigned long **ret_stack_list;
- int ret;
+ int ret, cpu;
ret_stack_list = kcalloc(FTRACE_RETSTACK_ALLOC_SIZE,
sizeof(*ret_stack_list), GFP_KERNEL);
@@ -1223,6 +1223,12 @@ static int start_graph_tracing(void)
if (!ret_stack_list)
return -ENOMEM;
+ /* The cpu_boot init_task->ret_stack will never be freed */
+ for_each_online_cpu(cpu) {
+ if (!idle_task(cpu)->ret_stack)
+ ftrace_graph_init_idle_task(idle_task(cpu), cpu);
+ }
+
do {
ret = alloc_retstack_tasklist(ret_stack_list);
} while (ret == -EAGAIN);
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The "%p" in the trace output is by default hashes the pointer. An option
was added to disable the hashing as reading trace output is a privileged
operation (just like reading kallsyms). When hashing is disabled, the
iter->fmt temp buffer is used to add "x" to "%p" into "%px" before sending
to the svnprintf() functions.
The problem with using iter->fmt, is that the trace_check_vprintf() that
makes sure that trace events "%pX" pointers are not dereferencing freed
addresses (and prints a warning if it does) also uses the iter->fmt to
save to and use to print out for the trace file. When the hash_ptr option
is disabled, the "%px" version is added to the iter->fmt buffer, and that
then is passed to the trace_check_vprintf() function that then uses the
iter->fmt as a temp buffer. Obviously this caused bad results.
This was noticed when backporting the persistent ring buffer to 5.10 and
added this code without the option being disabled by default, so it failed
one of the selftests because the sched_wakeup was missing the "comm"
field:
cat-907 [006] dN.4. 249.722403: sched_wakeup: comm= pid=74 prio=120 target_cpu=006
Instead of showing:
<idle>-0 [004] dNs6. 49.076464: sched_wakeup: comm=sshd-session pid=896 prio=120 target_cpu=0040
To fix this, change trace_check_vprintf() to modify the iter->fmt instead
of copying to it. If the fmt passed in is not the iter->fmt, first copy
the entire fmt string to iter->fmt and then iterate the iter->fmt. When
the format needs to be processed, perform the following like actions:
save_ch = p[i];
p[i] = '\0';
trace_seq_printf(&iter->seq, p, str);
p[i] = save_ch;
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Link: https://lore.kernel.org/20241212105426.113f2be3@batman.local.home
Fixes: efbbdaa22bb78 ("tracing: Show real address for trace event arguments")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 90 +++++++++++++++++++++++++++-----------------
1 file changed, 55 insertions(+), 35 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..b44b1cdaa20e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3711,8 +3711,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
{
long text_delta = 0;
long data_delta = 0;
- const char *p = fmt;
const char *str;
+ char save_ch;
+ char *buf = NULL;
+ char *p;
bool good;
int i, j;
@@ -3720,7 +3722,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
return;
if (static_branch_unlikely(&trace_no_verify))
- goto print;
+ goto print_fmt;
/*
* When the kernel is booted with the tp_printk command line
@@ -3735,8 +3737,21 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
/* Don't bother checking when doing a ftrace_dump() */
if (iter->fmt == static_fmt_buf)
- goto print;
+ goto print_fmt;
+ if (fmt != iter->fmt) {
+ int len = strlen(fmt);
+ while (iter->fmt_size < len + 1) {
+ /*
+ * If we can't expand the copy buffer,
+ * just print it.
+ */
+ if (!trace_iter_expand_format(iter))
+ goto print_fmt;
+ }
+ strscpy(iter->fmt, fmt, iter->fmt_size);
+ }
+ p = iter->fmt;
while (*p) {
bool star = false;
int len = 0;
@@ -3748,14 +3763,6 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
* as well as %p[sS] if delta is non-zero
*/
for (i = 0; p[i]; i++) {
- if (i + 1 >= iter->fmt_size) {
- /*
- * If we can't expand the copy buffer,
- * just print it.
- */
- if (!trace_iter_expand_format(iter))
- goto print;
- }
if (p[i] == '\\' && p[i+1]) {
i++;
@@ -3788,10 +3795,11 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
if (!p[i])
break;
- /* Copy up to the %s, and print that */
- strncpy(iter->fmt, p, i);
- iter->fmt[i] = '\0';
- trace_seq_vprintf(&iter->seq, iter->fmt, ap);
+ /* Print up to the %s */
+ save_ch = p[i];
+ p[i] = '\0';
+ trace_seq_vprintf(&iter->seq, p, ap);
+ p[i] = save_ch;
/* Add delta to %pS pointers */
if (p[i+1] == 'p') {
@@ -3837,6 +3845,8 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
good = trace_safe_str(iter, str, star, len);
}
+ p += i;
+
/*
* If you hit this warning, it is likely that the
* trace event in question used %s on a string that
@@ -3849,41 +3859,51 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
fmt, seq_buf_str(&iter->seq.seq))) {
int ret;
+#define TEMP_BUFSIZ 1024
+
+ if (!buf) {
+ char *buf = kmalloc(TEMP_BUFSIZ, GFP_KERNEL);
+ if (!buf) {
+ /* Need buffer to read address */
+ trace_seq_printf(&iter->seq, "(0x%px)[UNSAFE-MEMORY]", str);
+ p += j + 1;
+ goto print;
+ }
+ }
+ if (len >= TEMP_BUFSIZ)
+ len = TEMP_BUFSIZ - 1;
/* Try to safely read the string */
if (star) {
- if (len + 1 > iter->fmt_size)
- len = iter->fmt_size - 1;
- if (len < 0)
- len = 0;
- ret = copy_from_kernel_nofault(iter->fmt, str, len);
- iter->fmt[len] = 0;
- star = false;
+ ret = copy_from_kernel_nofault(buf, str, len);
+ buf[len] = 0;
} else {
- ret = strncpy_from_kernel_nofault(iter->fmt, str,
- iter->fmt_size);
+ ret = strncpy_from_kernel_nofault(buf, str, TEMP_BUFSIZ);
}
if (ret < 0)
trace_seq_printf(&iter->seq, "(0x%px)", str);
else
- trace_seq_printf(&iter->seq, "(0x%px:%s)",
- str, iter->fmt);
- str = "[UNSAFE-MEMORY]";
- strcpy(iter->fmt, "%s");
+ trace_seq_printf(&iter->seq, "(0x%px:%s)", str, buf);
+ trace_seq_puts(&iter->seq, "[UNSAFE-MEMORY]");
} else {
- strncpy(iter->fmt, p + i, j + 1);
- iter->fmt[j+1] = '\0';
+ save_ch = p[j + 1];
+ p[j + 1] = '\0';
+ if (star)
+ trace_seq_printf(&iter->seq, p, len, str);
+ else
+ trace_seq_printf(&iter->seq, p, str);
+ p[j + 1] = save_ch;
}
- if (star)
- trace_seq_printf(&iter->seq, iter->fmt, len, str);
- else
- trace_seq_printf(&iter->seq, iter->fmt, str);
- p += i + j + 1;
+ p += j + 1;
}
print:
if (*p)
trace_seq_vprintf(&iter->seq, p, ap);
+ kfree(buf);
+ return;
+ print_fmt:
+ trace_seq_vprintf(&iter->seq, fmt, ap);
}
const char *trace_event_format(struct trace_iterator *iter, const char *fmt)
--
2.45.2
From: Ian Ray <ian.ray(a)gehealthcare.com>
[ Upstream commit bfc6444b57dc7186b6acc964705d7516cbaf3904 ]
Ensure that `i2c_lock' is held when setting interrupt latch and mask in
pca953x_irq_bus_sync_unlock() in order to avoid races.
The other (non-probe) call site pca953x_gpio_set_multiple() ensures the
lock is held before calling pca953x_write_regs().
The problem occurred when a request raced against irq_bus_sync_unlock()
approximately once per thousand reboots on an i.MX8MP based system.
* Normal case
0-0022: write register AI|3a {03,02,00,00,01} Input latch P0
0-0022: write register AI|49 {fc,fd,ff,ff,fe} Interrupt mask P0
0-0022: write register AI|08 {ff,00,00,00,00} Output P3
0-0022: write register AI|12 {fc,00,00,00,00} Config P3
* Race case
0-0022: write register AI|08 {ff,00,00,00,00} Output P3
0-0022: write register AI|08 {03,02,00,00,01} *** Wrong register ***
0-0022: write register AI|12 {fc,00,00,00,00} Config P3
0-0022: write register AI|49 {fc,fd,ff,ff,fe} Interrupt mask P0
Signed-off-by: Ian Ray <ian.ray(a)gehealthcare.com>
Link: https://lore.kernel.org/r/20240620042915.2173-1-ian.ray@gehealthcare.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
Signed-off-by: Guocai He <guocai.he.cn(a)windriver.com>
---
This commit is to solve the CVE-2024-42253. Please merge this commit to linux-5.15.y.
drivers/gpio/gpio-pca953x.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpio/gpio-pca953x.c b/drivers/gpio/gpio-pca953x.c
index 4860bf3b7e00..4e97b6ae4f72 100644
--- a/drivers/gpio/gpio-pca953x.c
+++ b/drivers/gpio/gpio-pca953x.c
@@ -672,6 +672,8 @@ static void pca953x_irq_bus_sync_unlock(struct irq_data *d)
int level;
if (chip->driver_data & PCA_PCAL) {
+ guard(mutex)(&chip->i2c_lock);
+
/* Enable latch on interrupt-enabled inputs */
pca953x_write_regs(chip, PCAL953X_IN_LATCH, chip->irq_mask);
--
2.34.1
From: Baokun Li <libaokun1(a)huawei.com>
[ Upstream commit b4b4fda34e535756f9e774fb2d09c4537b7dfd1c ]
In the following concurrency we will access the uninitialized rs->lock:
ext4_fill_super
ext4_register_sysfs
// sysfs registered msg_ratelimit_interval_ms
// Other processes modify rs->interval to
// non-zero via msg_ratelimit_interval_ms
ext4_orphan_cleanup
ext4_msg(sb, KERN_INFO, "Errors on filesystem, "
__ext4_msg
___ratelimit(&(EXT4_SB(sb)->s_msg_ratelimit_state)
if (!rs->interval) // do nothing if interval is 0
return 1;
raw_spin_trylock_irqsave(&rs->lock, flags)
raw_spin_trylock(lock)
_raw_spin_trylock
__raw_spin_trylock
spin_acquire(&lock->dep_map, 0, 1, _RET_IP_)
lock_acquire
__lock_acquire
register_lock_class
assign_lock_key
dump_stack();
ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10);
raw_spin_lock_init(&rs->lock);
// init rs->lock here
and get the following dump_stack:
=========================================================
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 12 PID: 753 Comm: mount Tainted: G E 6.7.0-rc6-next-20231222 #504
[...]
Call Trace:
dump_stack_lvl+0xc5/0x170
dump_stack+0x18/0x30
register_lock_class+0x740/0x7c0
__lock_acquire+0x69/0x13a0
lock_acquire+0x120/0x450
_raw_spin_trylock+0x98/0xd0
___ratelimit+0xf6/0x220
__ext4_msg+0x7f/0x160 [ext4]
ext4_orphan_cleanup+0x665/0x740 [ext4]
__ext4_fill_super+0x21ea/0x2b10 [ext4]
ext4_fill_super+0x14d/0x360 [ext4]
[...]
=========================================================
Normally interval is 0 until s_msg_ratelimit_state is initialized, so
___ratelimit() does nothing. But registering sysfs precedes initializing
rs->lock, so it is possible to change rs->interval to a non-zero value
via the msg_ratelimit_interval_ms interface of sysfs while rs->lock is
uninitialized, and then a call to ext4_msg triggers the problem by
accessing an uninitialized rs->lock. Therefore register sysfs after all
initializations are complete to avoid such problems.
Signed-off-by: Baokun Li <libaokun1(a)huawei.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Link: https://lore.kernel.org/r/20240102133730.1098120-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
[ Resolve merge conflicts ]
Signed-off-by: Bin Lan <bin.lan.cn(a)windriver.com>
---
fs/ext4/super.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 987d49e18dbe..e6f08ee9895f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5506,19 +5506,15 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
if (err)
goto failed_mount6;
- err = ext4_register_sysfs(sb);
- if (err)
- goto failed_mount7;
-
err = ext4_init_orphan_info(sb);
if (err)
- goto failed_mount8;
+ goto failed_mount7;
#ifdef CONFIG_QUOTA
/* Enable quota usage during mount. */
if (ext4_has_feature_quota(sb) && !sb_rdonly(sb)) {
err = ext4_enable_quotas(sb);
if (err)
- goto failed_mount9;
+ goto failed_mount8;
}
#endif /* CONFIG_QUOTA */
@@ -5545,7 +5541,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
ext4_msg(sb, KERN_INFO, "recovery complete");
err = ext4_mark_recovery_complete(sb, es);
if (err)
- goto failed_mount10;
+ goto failed_mount9;
}
if (test_opt(sb, DISCARD) && !bdev_max_discard_sectors(sb->s_bdev))
@@ -5562,15 +5558,17 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
atomic_set(&sbi->s_warning_count, 0);
atomic_set(&sbi->s_msg_count, 0);
+ /* Register sysfs after all initializations are complete. */
+ err = ext4_register_sysfs(sb);
+ if (err)
+ goto failed_mount9;
+
return 0;
-failed_mount10:
+failed_mount9:
ext4_quota_off_umount(sb);
-failed_mount9: __maybe_unused
+failed_mount8: __maybe_unused
ext4_release_orphan_info(sb);
-failed_mount8:
- ext4_unregister_sysfs(sb);
- kobject_put(&sbi->s_kobj);
failed_mount7:
ext4_unregister_li_request(sb);
failed_mount6:
--
2.43.0
From: Juntong Deng <juntong.deng(a)outlook.com>
commit bdcb8aa434c6d36b5c215d02a9ef07551be25a37 upstream.
In gfs2_put_super(), whether withdrawn or not, the quota should
be cleaned up by gfs2_quota_cleanup().
Otherwise, struct gfs2_sbd will be freed before gfs2_qd_dealloc (rcu
callback) has run for all gfs2_quota_data objects, resulting in
use-after-free.
Also, gfs2_destroy_threads() and gfs2_quota_cleanup() is already called
by gfs2_make_fs_ro(), so in gfs2_put_super(), after calling
gfs2_make_fs_ro(), there is no need to call them again.
Reported-by: syzbot+29c47e9e51895928698c(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=29c47e9e51895928698c
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com>
Signed-off-by: Guocai He <guocai.he.cn(a)windriver.com>
---
Changes in v2:
Correct the upstream commit id.
This commit is to solve the CVE-2024-52760.
Please merge this commit to linux-5.15.y.
---
fs/gfs2/super.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 268651ac9fc8..98158559893f 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -590,6 +590,8 @@ static void gfs2_put_super(struct super_block *sb)
if (!sb_rdonly(sb)) {
gfs2_make_fs_ro(sdp);
+ } else {
+ gfs2_quota_cleanup(sdp);
}
WARN_ON(gfs2_withdrawing(sdp));
--
2.34.1
From: Tomas Glozar <tglozar(a)redhat.com>
commit 76b3102148135945b013797fac9b206273f0f777 upstream.
Do the same fix as in previous commit also for timerlat-hist.
Link: https://lore.kernel.org/20241011121015.2868751-2-tglozar@redhat.com
Reported-by: Attila Fazekas <afazekas(a)redhat.com>
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
[ Drop hunk fixing printf in timerlat_print_stats_all since that is not
in 6.6 ]
Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
---
tools/tracing/rtla/src/timerlat_hist.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 1c8ecd4ebcbd..667f12f2d67f 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -58,9 +58,9 @@ struct timerlat_hist_cpu {
int *thread;
int *user;
- int irq_count;
- int thread_count;
- int user_count;
+ unsigned long long irq_count;
+ unsigned long long thread_count;
+ unsigned long long user_count;
unsigned long long min_irq;
unsigned long long sum_irq;
@@ -300,15 +300,15 @@ timerlat_print_summary(struct timerlat_hist_params *params,
continue;
if (!params->no_irq)
- trace_seq_printf(trace->seq, "%9d ",
+ trace_seq_printf(trace->seq, "%9llu ",
data->hist[cpu].irq_count);
if (!params->no_thread)
- trace_seq_printf(trace->seq, "%9d ",
+ trace_seq_printf(trace->seq, "%9llu ",
data->hist[cpu].thread_count);
if (params->user_hist)
- trace_seq_printf(trace->seq, "%9d ",
+ trace_seq_printf(trace->seq, "%9llu ",
data->hist[cpu].user_count);
}
trace_seq_printf(trace->seq, "\n");
--
2.47.1
The DWC Databook description for the LWR_TARGET_RW and LWR_TARGET_HW fields
in the IATU_LWR_TARGET_ADDR_OFF_INBOUND_i registers state that:
"Field size depends on log2(BAR_MASK+1) in BAR match mode."
I.e. only the upper bits are writable, and the number of writable bits is
dependent on the configured BAR_MASK.
If we do not write the BAR_MASK before writing the iATU registers, we are
relying the reset value of the BAR_MASK being larger than the requested
size of the first set_bar() call. The reset value of the BAR_MASK is SoC
dependent.
Thus, if the first set_bar() call requests a size that is larger than the
reset value of the BAR_MASK, the iATU will try to write to read-only bits,
which will cause the iATU to end up redirecting to a physical address that
is different from the address that was intended.
Thus, we should always write the iATU registers after writing the BAR_MASK.
Cc: stable(a)vger.kernel.org
Fixes: f8aed6ec624f ("PCI: dwc: designware: Add EP mode support")
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
.../pci/controller/dwc/pcie-designware-ep.c | 28 ++++++++++---------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index f3ac7d46a855..bad588ef69a4 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -222,19 +222,10 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
return -EINVAL;
- reg = PCI_BASE_ADDRESS_0 + (4 * bar);
-
- if (!(flags & PCI_BASE_ADDRESS_SPACE))
- type = PCIE_ATU_TYPE_MEM;
- else
- type = PCIE_ATU_TYPE_IO;
-
- ret = dw_pcie_ep_inbound_atu(ep, func_no, type, epf_bar->phys_addr, bar);
- if (ret)
- return ret;
-
if (ep->epf_bar[bar])
- return 0;
+ goto config_atu;
+
+ reg = PCI_BASE_ADDRESS_0 + (4 * bar);
dw_pcie_dbi_ro_wr_en(pci);
@@ -246,9 +237,20 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
dw_pcie_ep_writel_dbi(ep, func_no, reg + 4, 0);
}
- ep->epf_bar[bar] = epf_bar;
dw_pcie_dbi_ro_wr_dis(pci);
+config_atu:
+ if (!(flags & PCI_BASE_ADDRESS_SPACE))
+ type = PCIE_ATU_TYPE_MEM;
+ else
+ type = PCIE_ATU_TYPE_IO;
+
+ ret = dw_pcie_ep_inbound_atu(ep, func_no, type, epf_bar->phys_addr, bar);
+ if (ret)
+ return ret;
+
+ ep->epf_bar[bar] = epf_bar;
+
return 0;
}
--
2.47.0
In commit 4284c88fff0e ("PCI: designware-ep: Allow pci_epc_set_bar() update
inbound map address") set_bar() was modified to support dynamically
changing the backing physical address of a BAR that was already configured.
This means that set_bar() can be called twice, without ever calling
clear_bar() (as calling clear_bar() would clear the BAR's PCI address
assigned by the host).
This can only be done if the new BAR size/flags does not differ from the
existing BAR configuration. Add these missing checks.
If we allow set_bar() to set e.g. a new BAR size that differs from the
existing BAR size, the new address translation range will be smaller than
the BAR size already determined by the host, which would mean that a read
past the new BAR size would pass the iATU untranslated, which could allow
the host to read memory not belonging to the new struct pci_epf_bar.
While at it, add comments which clarifies the support for dynamically
changing the physical address of a BAR. (Which was also missing.)
Cc: stable(a)vger.kernel.org
Fixes: 4284c88fff0e ("PCI: designware-ep: Allow pci_epc_set_bar() update inbound map address")
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
.../pci/controller/dwc/pcie-designware-ep.c | 22 ++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index bad588ef69a4..44a617d54b15 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -222,8 +222,28 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
return -EINVAL;
- if (ep->epf_bar[bar])
+ /*
+ * Certain EPF drivers dynamically change the physical address of a BAR
+ * (i.e. they call set_bar() twice, without ever calling clear_bar(), as
+ * calling clear_bar() would clear the BAR's PCI address assigned by the
+ * host).
+ */
+ if (ep->epf_bar[bar]) {
+ /*
+ * We can only dynamically change a BAR if the new BAR size and
+ * BAR flags do not differ from the existing configuration.
+ */
+ if (ep->epf_bar[bar]->barno != bar ||
+ ep->epf_bar[bar]->size != size ||
+ ep->epf_bar[bar]->flags != flags)
+ return -EINVAL;
+
+ /*
+ * When dynamically changing a BAR, skip writing the BAR reg, as
+ * that would clear the BAR's PCI address assigned by the host.
+ */
goto config_atu;
+ }
reg = PCI_BASE_ADDRESS_0 + (4 * bar);
--
2.47.1
The "DesignWare Cores PCI Express Controller Register Descriptions,
Version 4.60a", section "1.21.70 IATU_LWR_TARGET_ADDR_OFF_INBOUND_i",
fields LWR_TARGET_RW and LWR_TARGET_HW both state that:
"Field size depends on log2(BAR_MASK+1) in BAR match mode."
I.e. only the upper bits are writable, and the number of writable bits is
dependent on the configured BAR_MASK.
If we do not write the BAR_MASK before writing the iATU registers, we are
relying the reset value of the BAR_MASK being larger than the requested
BAR size (which is supplied in the struct pci_epf_bar which is passed to
pci_epc_set_bar()). The reset value of the BAR_MASK is SoC dependent.
Thus, if the struct pci_epf_bar requests a BAR size that is larger than the
reset value of the BAR_MASK, the iATU will try to write to read-only bits,
which will cause the iATU to end up redirecting to a physical address that
is different from the address that was intended.
Thus, we should always write the iATU registers after writing the BAR_MASK.
Cc: stable(a)vger.kernel.org
Fixes: f8aed6ec624f ("PCI: dwc: designware: Add EP mode support")
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
---
.../pci/controller/dwc/pcie-designware-ep.c | 28 ++++++++++---------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index f3ac7d46a855..bad588ef69a4 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -222,19 +222,10 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
return -EINVAL;
- reg = PCI_BASE_ADDRESS_0 + (4 * bar);
-
- if (!(flags & PCI_BASE_ADDRESS_SPACE))
- type = PCIE_ATU_TYPE_MEM;
- else
- type = PCIE_ATU_TYPE_IO;
-
- ret = dw_pcie_ep_inbound_atu(ep, func_no, type, epf_bar->phys_addr, bar);
- if (ret)
- return ret;
-
if (ep->epf_bar[bar])
- return 0;
+ goto config_atu;
+
+ reg = PCI_BASE_ADDRESS_0 + (4 * bar);
dw_pcie_dbi_ro_wr_en(pci);
@@ -246,9 +237,20 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
dw_pcie_ep_writel_dbi(ep, func_no, reg + 4, 0);
}
- ep->epf_bar[bar] = epf_bar;
dw_pcie_dbi_ro_wr_dis(pci);
+config_atu:
+ if (!(flags & PCI_BASE_ADDRESS_SPACE))
+ type = PCIE_ATU_TYPE_MEM;
+ else
+ type = PCIE_ATU_TYPE_IO;
+
+ ret = dw_pcie_ep_inbound_atu(ep, func_no, type, epf_bar->phys_addr, bar);
+ if (ret)
+ return ret;
+
+ ep->epf_bar[bar] = epf_bar;
+
return 0;
}
--
2.47.1
Add everything needed to support the DSI output on Renesas r8a779h0
(V4M) SoC, and the DP output (via sn65dsi86 DSI to DP bridge) on the
Renesas grey-hawk board.
Overall the DSI and the board design is almost identical to Renesas
r8a779g0 and white-hawk board.
Note: the v4 no longer has the dts and the clk patches, as those have
been merged to renesas-devel.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen+renesas(a)ideasonboard.com>
---
Changes in v4:
- Dropped patches merged to renesas-devel
- Added new patch "dt-bindings: display: renesas,du: Add missing
maxItems" to fix the bindings
- Add the missing maxItems to "dt-bindings: display: renesas,du: Add
r8a779h0"
- Link to v3: https://lore.kernel.org/r/20241206-rcar-gh-dsi-v3-0-d74c2166fa15@ideasonboa…
Changes in v3:
- Update "Write DPTSR only if there are more than one crtc" patch to
"Write DPTSR only if the second source exists"
- Add Laurent's Rb
- Link to v2: https://lore.kernel.org/r/20241205-rcar-gh-dsi-v2-0-42471851df86@ideasonboa…
Changes in v2:
- Add the DT binding with a new conditional block, so that we can set
only the port@0 as required
- Drop port@1 from r8a779h0.dtsi (there's no port@1)
- Add a new patch to write DPTSR only if num_crtcs > 1
- Drop RCAR_DU_FEATURE_NO_DPTSR (not needed anymore)
- Add Cc: stable to the fix, and move it as first patch
- Added the tags from reviews
- Link to v1: https://lore.kernel.org/r/20241203-rcar-gh-dsi-v1-0-738ae1a95d2a@ideasonboa…
---
Tomi Valkeinen (7):
drm/rcar-du: dsi: Fix PHY lock bit check
drm/rcar-du: Write DPTSR only if the second source exists
dt-bindings: display: renesas,du: Add missing maxItems
dt-bindings: display: renesas,du: Add r8a779h0
dt-bindings: display: bridge: renesas,dsi-csi2-tx: Add r8a779h0
drm/rcar-du: dsi: Add r8a779h0 support
drm/rcar-du: Add support for r8a779h0
.../display/bridge/renesas,dsi-csi2-tx.yaml | 1 +
.../devicetree/bindings/display/renesas,du.yaml | 63 ++++++++++++++++++++--
drivers/gpu/drm/renesas/rcar-du/rcar_du_drv.c | 18 +++++++
drivers/gpu/drm/renesas/rcar-du/rcar_du_group.c | 24 ++++++---
drivers/gpu/drm/renesas/rcar-du/rcar_mipi_dsi.c | 4 +-
.../gpu/drm/renesas/rcar-du/rcar_mipi_dsi_regs.h | 1 -
6 files changed, 99 insertions(+), 12 deletions(-)
---
base-commit: adc218676eef25575469234709c2d87185ca223a
change-id: 20241008-rcar-gh-dsi-9c01f5deeac8
Best regards,
--
Tomi Valkeinen <tomi.valkeinen(a)ideasonboard.com>
Commit 1fa08aece425 ("nsfs: convert to path_from_stashed() helper") reused
nsfs dentry's d_fsdata, which no longer contains a pointer to
proc_ns_operations.
Fix the remaining use in is_mnt_ns_file().
Fixes: 1fa08aece425 ("nsfs: convert to path_from_stashed() helper")
Cc: <stable(a)vger.kernel.org> # v6.9
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
---
Came across this while getting the mnt_ns in fsnotify_mark(), tested the
fix in that context. I don't have a test for mainline, though.
fs/namespace.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 23e81c2a1e3f..6eec7794f707 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2055,9 +2055,15 @@ SYSCALL_DEFINE1(oldumount, char __user *, name)
static bool is_mnt_ns_file(struct dentry *dentry)
{
+ struct ns_common *ns;
+
/* Is this a proxy for a mount namespace? */
- return dentry->d_op == &ns_dentry_operations &&
- dentry->d_fsdata == &mntns_operations;
+ if (dentry->d_op != &ns_dentry_operations)
+ return false;
+
+ ns = d_inode(dentry)->i_private;
+
+ return ns->ops == &mntns_operations;
}
struct ns_common *from_mnt_ns(struct mnt_namespace *mnt)
--
2.47.0
Hi,
This series fixes the several suspend issues on Qcom platforms. Patch 1 fixes
the resume failure with spm_lvl=5 suspend on most of the Qcom platforms. For
this patch, I couldn't figure out the exact commit that caused the issue. So I
used the commit that introduced reinit support as a placeholder.
Patch 3 fixes the suspend issue on SM8550 and SM8650 platforms where UFS
PHY retention is not supported. Hence the default spm_lvl=3 suspend fails. So
this patch configures spm_lvl=5 as the default suspend level to force UFSHC/
device powerdown during suspend. This supersedes the previous series [1] that
tried to fix the issue in clock drivers.
This series is tested on Qcom SM8550 MTP and Qcom RB5 boards.
[1] https://lore.kernel.org/linux-arm-msm/20241107-ufs-clk-fix-v1-0-6032ff22a05…
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
Manivannan Sadhasivam (3):
scsi: ufs: qcom: Power off the PHY if it was already powered on in ufs_qcom_power_up_sequence()
scsi: ufs: qcom: Allow passing platform specific OF data
scsi: ufs: qcom: Power down the controller/device during system suspend for SM8550/SM8650 SoCs
drivers/ufs/core/ufshcd-priv.h | 6 ------
drivers/ufs/core/ufshcd.c | 1 -
drivers/ufs/host/ufs-qcom.c | 31 +++++++++++++++++++------------
drivers/ufs/host/ufs-qcom.h | 5 +++++
include/ufs/ufshcd.h | 2 --
5 files changed, 24 insertions(+), 21 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241211-ufs-qcom-suspend-fix-5618e9c56d93
Best regards,
--
Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121203-griminess-blah-4e97@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
On Mon, Oct 28, 2024 at 05:00:02PM +0530, Hardik Gohil wrote:
> On Mon, Oct 21, 2024 at 3:10 PM Greg KH <gregkh(a)linuxfoundation.org> wrote:
> >
> > On Sat, Oct 19, 2024 at 11:06:40AM +0530, Hardik Gohil wrote:
> > > From: Kenton Groombridge <concord(a)gentoo.org>
> > >
> > > [ Upstream commit 2663d0462eb32ae7c9b035300ab6b1523886c718 ]
> >
> > We can't take patches for 5.10 that are not already in 5.15. Please fix
> > up and resend for ALL relevent trees.
> >
> > thanks,
> >
> > greg k-h
>
> I have just confirmed those are applicable to v5.15 and v5.10.
>
> Request to add those patches.
Please send tested backports.
thanks,
greg k-h
The DSI host must be enabled for the panel to be initialized in
prepare(). Set the prepare_prev_first flag to guarantee this.
This fixes the panel operation on NXP i.MX8MP SoC / Samsung DSIM
DSI host.
Fixes: 849b2e3ff969 ("drm/panel: Add Sitronix ST7701 panel driver")
Signed-off-by: Marek Vasut <marex(a)denx.de>
---
Cc: Chris Morgan <macromorgan(a)hotmail.com>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Hironori KIKUCHI <kikuchan98(a)gmail.com>
Cc: Jagan Teki <jagan(a)amarulasolutions.com>
Cc: Jessica Zhang <quic_jesszhan(a)quicinc.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Neil Armstrong <neil.armstrong(a)linaro.org>
Cc: Simona Vetter <simona(a)ffwll.ch>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: dri-devel(a)lists.freedesktop.org
Cc: stable(a)vger.kernel.org # v6.2+
---
Note that the prepare_prev_first flag was added in Linux 6.2.y commit
5ea6b1702781 ("drm/panel: Add prepare_prev_first flag to drm_panel"),
hence the CC stable v6.2+, even if the Fixes tag points to a commit
in Linux 5.1.y .
---
drivers/gpu/drm/panel/panel-sitronix-st7701.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/panel/panel-sitronix-st7701.c b/drivers/gpu/drm/panel/panel-sitronix-st7701.c
index eef03d04e0cd2..1f72ef7ca74c9 100644
--- a/drivers/gpu/drm/panel/panel-sitronix-st7701.c
+++ b/drivers/gpu/drm/panel/panel-sitronix-st7701.c
@@ -1177,6 +1177,7 @@ static int st7701_probe(struct device *dev, int connector_type)
return dev_err_probe(dev, ret, "Failed to get orientation\n");
drm_panel_init(&st7701->panel, dev, &st7701_funcs, connector_type);
+ st7701->panel.prepare_prev_first = true;
/**
* Once sleep out has been issued, ST7701 IC required to wait 120ms
--
2.45.2
The current implementation of the ucsi glink client connector_status()
callback is only relying on the state of the gpio. This means that even
when the cable is unplugged, the orientation propagated to the switches
along the graph is "orientation normal", instead of "orientation none",
which would be the correct one in this case.
One of the Qualcomm DP-USB PHY combo drivers, which needs to be aware of
the orientation change, is relying on the "orientation none" to skip
the reinitialization of the entire PHY. Since the ucsi glink client
advertises "orientation normal" even when the cable is unplugged, the
mentioned PHY is taken down and reinitialized when in fact it should be
left as-is. This triggers a crash within the displayport controller driver
in turn, which brings the whole system down on some Qualcomm platforms.
Propagating "orientation none" from the ucsi glink client on the
connector_status() callback hides the problem of the mentioned PHY driver
away for now. But the "orientation none" is nonetheless the correct one
to be used in this case.
So propagate the "orientation none" instead when the connector status
flags says cable is disconnected.
Fixes: 76716fd5bf09 ("usb: typec: ucsi: glink: move GPIO reading into connector_status callback")
Cc: stable(a)vger.kernel.org # 6.10
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Reviewed-by: Heikki Krogerus <heikki.krogerus(a)linux.intel.com>
Reviewed-by: Neil Armstrong <neil.armstrong(a)linaro.org>
Signed-off-by: Abel Vesa <abel.vesa(a)linaro.org>
---
Changes in v2:
- Re-worded the commit message to explain a bit more what is happening.
- Added Fixes tag and CC'ed stable.
- Dropped the RFC prefix.
- Used the new UCSI_CONSTAT macro which did not exist when v1 was sent.
- Link to v1: https://lore.kernel.org/r/20241017-usb-typec-ucsi-glink-add-orientation-non…
---
drivers/usb/typec/ucsi/ucsi_glink.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/usb/typec/ucsi/ucsi_glink.c b/drivers/usb/typec/ucsi/ucsi_glink.c
index 90948cd6d2972402465a2adaba3e1ed055cf0cfa..fed39d45809050f1e08dc1d34008b5c561461391 100644
--- a/drivers/usb/typec/ucsi/ucsi_glink.c
+++ b/drivers/usb/typec/ucsi/ucsi_glink.c
@@ -185,6 +185,11 @@ static void pmic_glink_ucsi_connector_status(struct ucsi_connector *con)
struct pmic_glink_ucsi *ucsi = ucsi_get_drvdata(con->ucsi);
int orientation;
+ if (!UCSI_CONSTAT(con, CONNECTED)) {
+ typec_set_orientation(con->port, TYPEC_ORIENTATION_NONE);
+ return;
+ }
+
if (con->num > PMIC_GLINK_MAX_PORTS ||
!ucsi->port_orientation[con->num - 1])
return;
---
base-commit: 3e42dc9229c5950e84b1ed705f94ed75ed208228
change-id: 20241017-usb-typec-ucsi-glink-add-orientation-none-73f1f2522999
Best regards,
--
Abel Vesa <abel.vesa(a)linaro.org>
From: yangge <yangge1116(a)126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
To sum up, during long term GUP flow, we should remove ALLOC_CMA
both in __compaction_suitable() and __isolate_free_page().
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
mm/compaction.c | 8 +++++---
mm/page_alloc.c | 4 +++-
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..044c2247 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2384,6 +2384,7 @@ static bool __compaction_suitable(struct zone *zone, int order,
unsigned long wmark_target)
{
unsigned long watermark;
+ bool pin;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
@@ -2395,14 +2396,15 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
+ pin = !!(current->flags & PF_MEMALLOC_PIN);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ pin ? 0 : ALLOC_CMA, wmark_target);
}
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..9a5dfda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2813,6 +2813,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
+ bool pin;
if (!is_migrate_isolate(mt)) {
unsigned long watermark;
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ pin = !!(current->flags & PF_MEMALLOC_PIN);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : ALLOC_CMA))
return 0;
}
--
2.7.4
Hi Carlos,
Please pull this branch with changes for xfs. Christoph said that he'd
rather we rebased the whole xfs for-next branch to preserve
bisectability so this branch folds in his fix for !quota builds and a
missing zero initialization for struct kstat in the mgtime conversion
patch that the build robots just pointed out.
As usual, I did a test-merge with the main upstream branch as of a few
minutes ago, and didn't see any conflicts. Please let me know if you
encounter any problems.
--D
The following changes since commit f92f4749861b06fed908d336b4dee1326003291b:
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux (2024-12-10 18:21:40 -0800)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git tags/xfs-6.13-fixes_2024-12-12
for you to fetch changes up to 12f2930f5f91bc0d67794c69d1961098c7c72040:
xfs: port xfs_ioc_start_commit to multigrain timestamps (2024-12-12 17:45:13 -0800)
----------------------------------------------------------------
xfs: bug fixes for 6.13 [01/12]
Bug fixes for 6.13.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong(a)kernel.org>
----------------------------------------------------------------
Darrick J. Wong (28):
xfs: fix off-by-one error in fsmap's end_daddr usage
xfs: metapath scrubber should use the already loaded inodes
xfs: keep quota directory inode loaded
xfs: return a 64-bit block count from xfs_btree_count_blocks
xfs: don't drop errno values when we fail to ficlone the entire range
xfs: separate healthy clearing mask during repair
xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlink
xfs: mark metadir repair tempfiles with IRECOVERY
xfs: fix null bno_hint handling in xfs_rtallocate_rtg
xfs: fix error bailout in xfs_rtginode_create
xfs: update btree keys correctly when _insrec splits an inode root block
xfs: fix scrub tracepoints when inode-rooted btrees are involved
xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
xfs: only run precommits once per transaction object
xfs: avoid nested calls to __xfs_trans_commit
xfs: don't lose solo superblock counter update transactions
xfs: don't lose solo dquot update transactions
xfs: separate dquot buffer reads from xfs_dqflush
xfs: clean up log item accesses in xfs_qm_dqflush{,_done}
xfs: attach dquot buffer to dquot log item buffer
xfs: convert quotacheck to attach dquot buffers
xfs: fix sb_spino_align checks for large fsblock sizes
xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
xfs: don't crash on corrupt /quotas dirent
xfs: check pre-metadir fields correctly
xfs: fix zero byte checking in the superblock scrubber
xfs: return from xfs_symlink_verify early on V4 filesystems
xfs: port xfs_ioc_start_commit to multigrain timestamps
fs/xfs/libxfs/xfs_btree.c | 33 +++++--
fs/xfs/libxfs/xfs_btree.h | 2 +-
fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +-
fs/xfs/libxfs/xfs_rtgroup.c | 2 +-
fs/xfs/libxfs/xfs_sb.c | 11 ++-
fs/xfs/libxfs/xfs_symlink_remote.c | 4 +-
fs/xfs/scrub/agheader.c | 77 +++++++++++----
fs/xfs/scrub/agheader_repair.c | 6 +-
fs/xfs/scrub/fscounters.c | 2 +-
fs/xfs/scrub/health.c | 57 ++++++-----
fs/xfs/scrub/ialloc.c | 4 +-
fs/xfs/scrub/metapath.c | 68 +++++--------
fs/xfs/scrub/refcount.c | 2 +-
fs/xfs/scrub/scrub.h | 6 ++
fs/xfs/scrub/symlink_repair.c | 3 +-
fs/xfs/scrub/tempfile.c | 22 ++++-
fs/xfs/scrub/trace.h | 2 +-
fs/xfs/xfs_bmap_util.c | 2 +-
fs/xfs/xfs_dquot.c | 195 +++++++++++++++++++++++++++++++------
fs/xfs/xfs_dquot.h | 6 +-
fs/xfs/xfs_dquot_item.c | 51 +++++++---
fs/xfs/xfs_dquot_item.h | 7 ++
fs/xfs/xfs_exchrange.c | 14 +--
fs/xfs/xfs_file.c | 8 ++
fs/xfs/xfs_fsmap.c | 38 +++++---
fs/xfs/xfs_inode.h | 2 +-
fs/xfs/xfs_qm.c | 102 +++++++++++++------
fs/xfs/xfs_qm.h | 1 +
fs/xfs/xfs_quota.h | 5 +-
fs/xfs/xfs_rtalloc.c | 2 +-
fs/xfs/xfs_trans.c | 58 ++++++-----
fs/xfs/xfs_trans_ail.c | 2 +-
fs/xfs/xfs_trans_dquot.c | 31 +++++-
33 files changed, 578 insertions(+), 251 deletions(-)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 54bbee190d42166209185d89070c58a343bf514b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024120223-stunner-letter-9d09@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 54bbee190d42166209185d89070c58a343bf514b Mon Sep 17 00:00:00 2001
From: Raghavendra Rao Ananta <rananta(a)google.com>
Date: Tue, 19 Nov 2024 16:52:29 -0800
Subject: [PATCH] KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow
status
DDI0487K.a D13.3.1 describes the PMU overflow condition, which evaluates
to true if any counter's global enable (PMCR_EL0.E), overflow flag
(PMOVSSET_EL0[n]), and interrupt enable (PMINTENSET_EL1[n]) are all 1.
Of note, this does not require a counter to be enabled
(i.e. PMCNTENSET_EL0[n] = 1) to generate an overflow.
Align kvm_pmu_overflow_status() with the reality of the architecture
and stop using PMCNTENSET_EL0 as part of the overflow condition. The
bug was discovered while running an SBSA PMU test [*], which only sets
PMCR.E, PMOVSSET<0>, PMINTENSET<0>, and expects an overflow interrupt.
Cc: stable(a)vger.kernel.org
Fixes: 76d883c4e640 ("arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register")
Link: https://github.com/ARM-software/sbsa-acs/blob/master/test_pool/pmu/operatin…
Signed-off-by: Raghavendra Rao Ananta <rananta(a)google.com>
[ oliver: massaged changelog ]
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 8ad62284fa23..3855cc9d0ca5 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -381,7 +381,6 @@ static u64 kvm_pmu_overflow_status(struct kvm_vcpu *vcpu)
if ((kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E)) {
reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0);
- reg &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
reg &= __vcpu_sys_reg(vcpu, PMINTENSET_EL1);
}
On a x86 system under test with 1780 CPUs, topology_span_sane() takes
around 8 seconds cumulatively for all the iterations. It is an expensive
operation which does the sanity of non-NUMA topology masks.
CPU topology is not something which changes very frequently hence make
this check optional for the systems where the topology is trusted and
need faster bootup.
Restrict this to sched_verbose kernel cmdline option so that this penalty
can be avoided for the systems who wants to avoid it.
Cc: stable(a)vger.kernel.org
Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
Signed-off-by: Saurabh Sengar <ssengar(a)linux.microsoft.com>
---
[V2]
- Use kernel cmdline param instead of compile time flag.
kernel/sched/topology.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..4ca63bff321d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2363,6 +2363,13 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
{
int i = cpu + 1;
+ /* Skip the topology sanity check for non-debug, as it is a time-consuming operatin */
+ if (!sched_debug_verbose) {
+ pr_info_once("%s: Skipping topology span sanity check. Use `sched_verbose` boot parameter to enable it.\n",
+ __func__);
+ return true;
+ }
+
/* NUMA levels are allowed to overlap */
if (tl->flags & SDTL_OVERLAP)
return true;
--
2.43.0
This reverts commit 377548f05bd0905db52a1d50e5b328b9b4eb049d.
Most SoC dtsi files have the display output interfaces disabled by
default, and only enabled on boards that utilize them. The MT8183
has it backwards: the display outputs are left enabled by default,
and only disabled at the board level.
Reverse the situation for the DPI output so that it follows the
normal scheme. For ease of backporting the DSI output is handled
in a separate patch.
Fixes: 009d855a26fd ("arm64: dts: mt8183: add dpi node to mt8183")
Fixes: 377548f05bd0 ("arm64: dts: mediatek: mt8183-kukui: Disable DPI display interface")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Chen-Yu Tsai <wenst(a)chromium.org>
---
arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi | 5 -----
arch/arm64/boot/dts/mediatek/mt8183.dtsi | 1 +
2 files changed, 1 insertion(+), 5 deletions(-)
diff --git a/arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi b/arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
index 07ae3c8e897b..22924f61ec9e 100644
--- a/arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
@@ -290,11 +290,6 @@ dsi_out: endpoint {
};
};
-&dpi0 {
- /* TODO Re-enable after DP to Type-C port muxing can be described */
- status = "disabled";
-};
-
&gic {
mediatek,broken-save-restore-fw;
};
diff --git a/arch/arm64/boot/dts/mediatek/mt8183.dtsi b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
index 1afeeb1155f5..8f31fc9050ec 100644
--- a/arch/arm64/boot/dts/mediatek/mt8183.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
@@ -1845,6 +1845,7 @@ dpi0: dpi@14015000 {
<&mmsys CLK_MM_DPI_MM>,
<&apmixedsys CLK_APMIXED_TVDPLL>;
clock-names = "pixel", "engine", "pll";
+ status = "disabled";
port {
dpi_out: endpoint { };
--
2.47.0.163.g1226f6d8fa-goog
The patch titled
Subject: mm/codetag: clear tags before swap
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-codetag-clear-tags-before-swap.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: David Wang <00107082(a)163.com>
Subject: mm/codetag: clear tags before swap
Date: Fri, 13 Dec 2024 09:33:32 +0800
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is set, kernel WARN would be
triggered when calling __alloc_tag_ref_set() during swap:
alloc_tag was not cleared (got tag for mm/filemap.c:1951)
WARNING: CPU: 0 PID: 816 at ./include/linux/alloc_tag.h...
Clear code tags before swap can fix the warning. And this patch also fix
a potential invalid address dereference in alloc_tag_add_check() when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is set and ref->ct is CODETAG_EMPTY,
which is defined as ((void *)1).
Link: https://lkml.kernel.org/r/20241213013332.89910-1-00107082@163.com
Fixes: 51f43d5d82ed ("mm/codetag: swap tags when migrate pages")
Signed-off-by: David Wang <00107082(a)163.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412112227.df61ebb-lkp@intel.com
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/alloc_tag.h | 2 +-
lib/alloc_tag.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)
--- a/include/linux/alloc_tag.h~mm-codetag-clear-tags-before-swap
+++ a/include/linux/alloc_tag.h
@@ -140,7 +140,7 @@ static inline struct alloc_tag_counters
#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag)
{
- WARN_ONCE(ref && ref->ct,
+ WARN_ONCE(ref && ref->ct && !is_codetag_empty(ref),
"alloc_tag was not cleared (got tag for %s:%u)\n",
ref->ct->filename, ref->ct->lineno);
--- a/lib/alloc_tag.c~mm-codetag-clear-tags-before-swap
+++ a/lib/alloc_tag.c
@@ -209,6 +209,13 @@ void pgalloc_tag_swap(struct folio *new,
return;
}
+ /*
+ * Clear tag references to avoid debug warning when using
+ * __alloc_tag_ref_set() with non-empty reference.
+ */
+ set_codetag_empty(&ref_old);
+ set_codetag_empty(&ref_new);
+
/* swap tags */
__alloc_tag_ref_set(&ref_old, tag_new);
update_page_tag_ref(handle_old, &ref_old);
_
Patches currently in -mm which might be from 00107082(a)163.com are
mm-codetag-clear-tags-before-swap.patch
I am Tomasz Chmielewski, a Portfolio Manager and Chartered
Financial Analyst affiliated with Iwoca Poland Sp. Z OO in
Poland. I have the privilege of working with distinguished
investors who are eager to support your company's current
initiatives, thereby broadening their investment portfolios. If
this proposal aligns with your interests, I invite you to
respond, and I will gladly share more information to assist you.
Yours sincerely,
Tomasz Chmielewski Warsaw, Mazowieckie,
Poland.
The patch titled
Subject: mm: convert partially_mapped set/clear operations to be atomic
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-convert-partially_mapped-set-clear-operations-to-be-atomic.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Usama Arif <usamaarif642(a)gmail.com>
Subject: mm: convert partially_mapped set/clear operations to be atomic
Date: Thu, 12 Dec 2024 18:33:51 +0000
Other page flags in the 2nd page, like PG_hwpoison and PG_anon_exclusive
can get modified concurrently. Changes to other page flags might be lost
if they are happening at the same time as non-atomic partially_mapped
operations. Hence, make partially_mapped operations atomic.
Link: https://lkml.kernel.org/r/20241212183351.1345389-1-usamaarif642@gmail.com
Fixes: 8422acdc97ed ("mm: introduce a pageflag for partially mapped folios")
Reported-by: David Hildenbrand <david(a)redhat.com>
Link: https://lore.kernel.org/all/e53b04ad-1827-43a2-a1ab-864c7efecf6e@redhat.com/
Signed-off-by: Usama Arif <usamaarif642(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Domenico Cerasuolo <cerasuolodomenico(a)gmail.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Cc: Nico Pache <npache(a)redhat.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/page-flags.h | 12 ++----------
mm/huge_memory.c | 8 ++++----
2 files changed, 6 insertions(+), 14 deletions(-)
--- a/include/linux/page-flags.h~mm-convert-partially_mapped-set-clear-operations-to-be-atomic
+++ a/include/linux/page-flags.h
@@ -862,18 +862,10 @@ static inline void ClearPageCompound(str
ClearPageHead(page);
}
FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
-FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
-/*
- * PG_partially_mapped is protected by deferred_split split_queue_lock,
- * so its safe to use non-atomic set/clear.
- */
-__FOLIO_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
-__FOLIO_CLEAR_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
+FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
#else
FOLIO_FLAG_FALSE(large_rmappable)
-FOLIO_TEST_FLAG_FALSE(partially_mapped)
-__FOLIO_SET_FLAG_NOOP(partially_mapped)
-__FOLIO_CLEAR_FLAG_NOOP(partially_mapped)
+FOLIO_FLAG_FALSE(partially_mapped)
#endif
#define PG_head_mask ((1UL << PG_head))
--- a/mm/huge_memory.c~mm-convert-partially_mapped-set-clear-operations-to-be-atomic
+++ a/mm/huge_memory.c
@@ -3577,7 +3577,7 @@ int split_huge_page_to_list_to_order(str
!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
@@ -3689,7 +3689,7 @@ bool __folio_unqueue_deferred_split(stru
if (!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
@@ -3733,7 +3733,7 @@ void deferred_split_folio(struct folio *
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (partially_mapped) {
if (!folio_test_partially_mapped(folio)) {
- __folio_set_partially_mapped(folio);
+ folio_set_partially_mapped(folio);
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
@@ -3826,7 +3826,7 @@ static unsigned long deferred_split_scan
} else {
/* We lost race with folio_put() */
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
_
Patches currently in -mm which might be from usamaarif642(a)gmail.com are
mm-convert-partially_mapped-set-clear-operations-to-be-atomic.patch
The patch titled
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
Date: Fri, 13 Dec 2024 01:43:28 +0900
When block_invalidatepage was converted to block_invalidate_folio, the
fallback to block_invalidatepage in folio_invalidate() if the
address_space_operations method invalidatepage (currently
invalidate_folio) was not set, was removed.
Unfortunately, some pseudo-inodes in nilfs2 use empty_aops set by
inode_init_always_gfp() as is, or explicitly set it to
address_space_operations. Therefore, with this change,
block_invalidatepage() is no longer called from folio_invalidate(), and as
a result, the buffer_head structures attached to these pages/folios are no
longer freed via try_to_free_buffers().
Thus, these buffer heads are now leaked by truncate_inode_pages(), which
cleans up the page cache from inode evict(), etc.
Three types of caches use empty_aops: gc inode caches and the DAT shadow
inode used by GC, and b-tree node caches. Of these, b-tree node caches
explicitly call invalidate_mapping_pages() during cleanup, which involves
calling try_to_free_buffers(), so the leak was not visible during normal
operation but worsened when GC was performed.
Fix this issue by using address_space_operations with invalidate_folio set
to block_invalidate_folio instead of empty_aops, which will ensure the
same behavior as before.
Link: https://lkml.kernel.org/r/20241212164556.21338-1-konishi.ryusuke@gmail.com
Fixes: 7ba13abbd31e ("fs: Turn block_invalidatepage into block_invalidate_folio")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/btnode.c | 1 +
fs/nilfs2/gcinode.c | 2 +-
fs/nilfs2/inode.c | 5 +++++
fs/nilfs2/nilfs.h | 1 +
4 files changed, 8 insertions(+), 1 deletion(-)
--- a/fs/nilfs2/btnode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/btnode.c
@@ -35,6 +35,7 @@ void nilfs_init_btnc_inode(struct inode
ii->i_flags = 0;
memset(&ii->i_bmap_data, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(btnc_inode->i_mapping, GFP_NOFS);
+ btnc_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
}
void nilfs_btnode_cache_clear(struct address_space *btnc)
--- a/fs/nilfs2/gcinode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/gcinode.c
@@ -163,7 +163,7 @@ int nilfs_init_gcinode(struct inode *ino
inode->i_mode = S_IFREG;
mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
- inode->i_mapping->a_ops = &empty_aops;
+ inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
ii->i_flags = 0;
nilfs_bmap_init_gc(ii->i_bmap);
--- a/fs/nilfs2/inode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/inode.c
@@ -276,6 +276,10 @@ const struct address_space_operations ni
.is_partially_uptodate = block_is_partially_uptodate,
};
+const struct address_space_operations nilfs_buffer_cache_aops = {
+ .invalidate_folio = block_invalidate_folio,
+};
+
static int nilfs_insert_inode_locked(struct inode *inode,
struct nilfs_root *root,
unsigned long ino)
@@ -681,6 +685,7 @@ struct inode *nilfs_iget_for_shadow(stru
NILFS_I(s_inode)->i_flags = 0;
memset(NILFS_I(s_inode)->i_bmap, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(s_inode->i_mapping, GFP_NOFS);
+ s_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
err = nilfs_attach_btree_node_cache(s_inode);
if (unlikely(err)) {
--- a/fs/nilfs2/nilfs.h~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/nilfs.h
@@ -401,6 +401,7 @@ extern const struct file_operations nilf
extern const struct inode_operations nilfs_file_inode_operations;
extern const struct file_operations nilfs_file_operations;
extern const struct address_space_operations nilfs_aops;
+extern const struct address_space_operations nilfs_buffer_cache_aops;
extern const struct inode_operations nilfs_dir_inode_operations;
extern const struct inode_operations nilfs_special_inode_operations;
extern const struct inode_operations nilfs_symlink_inode_operations;
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages.patch
The patch titled
Subject: zram: fix panic when using ext4 over zram
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
zram-panic-when-use-ext4-over-zram.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: caiqingfu <caiqingfu(a)ruijie.com.cn>
Subject: zram: fix panic when using ext4 over zram
Date: Fri, 29 Nov 2024 19:57:35 +0800
[ 52.073080 ] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 52.073511 ] Modules linked in:
[ 52.074094 ] CPU: 0 UID: 0 PID: 3825 Comm: a.out Not tainted 6.12.0-07749-g28eb75e178d3-dirty #3
[ 52.074672 ] Hardware name: linux,dummy-virt (DT)
[ 52.075128 ] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 52.075619 ] pc : obj_malloc+0x5c/0x160
[ 52.076402 ] lr : zs_malloc+0x200/0x570
[ 52.076630 ] sp : ffff80008dd335f0
[ 52.076797 ] x29: ffff80008dd335f0 x28: ffff000004104a00 x27: ffff000004dfc400
[ 52.077319 ] x26: 000000000000ca18 x25: ffff00003fcaf0e0 x24: ffff000006925cf0
[ 52.077785 ] x23: 0000000000000c0a x22: ffff0000032ee780 x21: ffff000006925cf0
[ 52.078257 ] x20: 0000000000088000 x19: 0000000000000000 x18: 0000000000fffc18
[ 52.078701 ] x17: 00000000fffffffd x16: 0000000000000803 x15: 00000000fffffffe
[ 52.079203 ] x14: 000000001824429d x13: ffff000006e84000 x12: ffff000006e83fec
[ 52.079711 ] x11: ffff000006e83000 x10: 00000000000002a5 x9 : ffff000006e83ff3
[ 52.080269 ] x8 : 0000000000000001 x7 : 0000000017e80000 x6 : 0000000000017e80
[ 52.080724 ] x5 : 0000000000000003 x4 : ffff00000402a5e8 x3 : 0000000000000066
[ 52.081081 ] x2 : ffff000006925cf0 x1 : ffff00000402a5e8 x0 : ffff000004104a00
[ 52.081595 ] Call trace:
[ 52.081925 ] obj_malloc+0x5c/0x160 (P)
[ 52.082220 ] zs_malloc+0x200/0x570 (L)
[ 52.082504 ] zs_malloc+0x200/0x570
[ 52.082716 ] zram_submit_bio+0x788/0x9e8
[ 52.083017 ] __submit_bio+0x1c4/0x338
[ 52.083343 ] submit_bio_noacct_nocheck+0x128/0x2c0
[ 52.083518 ] submit_bio_noacct+0x1c8/0x308
[ 52.083722 ] submit_bio+0xa8/0x14c
[ 52.083942 ] submit_bh_wbc+0x140/0x1bc
[ 52.084088 ] __block_write_full_folio+0x23c/0x5f0
[ 52.084232 ] block_write_full_folio+0x134/0x21c
[ 52.084524 ] write_cache_pages+0x64/0xd4
[ 52.084778 ] blkdev_writepages+0x50/0x8c
[ 52.085040 ] do_writepages+0x80/0x2b0
[ 52.085292 ] filemap_fdatawrite_wbc+0x6c/0x90
[ 52.085597 ] __filemap_fdatawrite_range+0x64/0x94
[ 52.085900 ] filemap_fdatawrite+0x1c/0x28
[ 52.086158 ] sync_bdevs+0x170/0x17c
[ 52.086374 ] ksys_sync+0x6c/0xb8
[ 52.086597 ] __arm64_sys_sync+0x10/0x20
[ 52.086847 ] invoke_syscall+0x44/0x100
[ 52.087230 ] el0_svc_common.constprop.0+0x40/0xe0
[ 52.087550 ] do_el0_svc+0x1c/0x28
[ 52.087690 ] el0_svc+0x30/0xd0
[ 52.087818 ] el0t_64_sync_handler+0xc8/0xcc
[ 52.088046 ] el0t_64_sync+0x198/0x19c
[ 52.088500 ] Code: 110004a5 6b0500df f9401273 54000160 (f9401664)
[ 52.089097 ] ---[ end trace 0000000000000000 ]---
When using ext4 on zram, the following panic occasionally occurs under
high memory usage
The reason is that when the handle is obtained using the slow path, it
will be re-compressed. If the data in the page changes, the compressed
length may exceed the previous one. Overflow occurred when writing to
zs_object, which then caused the panic.
Comment the fast path and force the slow path. Adding a large number of
read and write file systems can quickly reproduce it.
The solution is to re-obtain the handle after re-compression if the length
is different from the previous one.
Link: https://lkml.kernel.org/r/20241129115735.136033-1-baicaiaichibaicai@gmail.c…
Signed-off-by: caiqingfu <caiqingfu(a)ruijie.com.cn>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--- a/drivers/block/zram/zram_drv.c~zram-panic-when-use-ext4-over-zram
+++ a/drivers/block/zram/zram_drv.c
@@ -1633,6 +1633,7 @@ static int zram_write_page(struct zram *
unsigned long alloced_pages;
unsigned long handle = -ENOMEM;
unsigned int comp_len = 0;
+ unsigned int last_comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
unsigned long element = 0;
@@ -1664,6 +1665,11 @@ compress_again:
if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
+
+ if (last_comp_len && (last_comp_len != comp_len)) {
+ zs_free(zram->mem_pool, handle);
+ handle = (unsigned long)ERR_PTR(-ENOMEM);
+ }
/*
* handle allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
@@ -1692,8 +1698,10 @@ compress_again:
if (IS_ERR_VALUE(handle))
return PTR_ERR((void *)handle);
- if (comp_len != PAGE_SIZE)
+ if (comp_len != PAGE_SIZE) {
+ last_comp_len = comp_len;
goto compress_again;
+ }
/*
* If the page is not compressible, you need to acquire the
* lock and execute the code below. The zcomp_stream_get()
_
Patches currently in -mm which might be from caiqingfu(a)ruijie.com.cn are
zram-panic-when-use-ext4-over-zram.patch
Since 5.16 and prior to 6.13 KVM can't be used with FSDAX
guest memory (PMD pages). To reproduce the issue you need to reserve
guest memory with `memmap=` cmdline, create and mount FS in DAX mode
(tested both XFS and ext4), see doc link below. ndctl command for test:
ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M
Then pass memory object to qemu like:
-m 8G -object memory-backend-file,id=ram0,size=8G,\
mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \
-numa node,memdev=ram0,cpus=0-1
QEMU fails to run guest with error: kvm run failed Bad address
and there are two warnings in dmesg:
WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and
WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from
faultin_pfn() to release_pfn_clean(), e.g. see
commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level")
But kvm_page_fault structure made pfn part of mutable state, so
now release_pfn_clean() can take hugepage-adjusted pfn.
And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax.
Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to
different folios, so we're getting get_page/put_page imbalance.
To solve this preserve faultin pfn in separate local variable
and pass it in kvm_release_pfn_clean().
Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by
commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns")
and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html
Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault")
Co-developed-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
v1 -> v2:
* Instead of new struct field prefer local variable to snapshot faultin pfn
as suggested by Sean Christopherson.
* Tested patch for 6.1 and 6.12
arch/x86/kvm/mmu/mmu.c | 5 ++++-
arch/x86/kvm/mmu/paging_tmpl.h | 5 ++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13134954e24d..d392022dcb89 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4245,6 +4245,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
unsigned long mmu_seq;
+ kvm_pfn_t orig_pfn;
int r;
fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4272,6 +4273,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r != RET_PF_CONTINUE)
return r;
+ orig_pfn = fault->pfn;
+
r = RET_PF_RETRY;
if (is_tdp_mmu_fault)
@@ -4296,7 +4299,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(orig_pfn);
return r;
}
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1f4f5e703f13..685560a45bf6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -790,6 +790,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct guest_walker walker;
+ kvm_pfn_t orig_pfn;
int r;
unsigned long mmu_seq;
bool is_self_change_mapping;
@@ -868,6 +869,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
walker.pte_access &= ~ACC_EXEC_MASK;
}
+ orig_pfn = fault->pfn;
+
r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
@@ -881,7 +884,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(orig_pfn);
return r;
}
--
2.34.1
This patch series is to fix bug for APIs
- devm_pci_epc_destroy().
- pci_epf_remove_vepf().
and simplify APIs below:
- pci_epc_get().
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v3:
- Remove stable tag of patch 1/3
- Add one more patch 3/3
- Link to v2: https://lore.kernel.org/all/20241102-pci-epc-core_fix-v2-0-0785f8435be5@qui…
Changes in v2:
- Correct tile and commit message for patch 1/2.
- Add one more patch 2/2 to simplify API pci_epc_get().
- Link to v1: https://lore.kernel.org/r/20241020-pci-epc-core_fix-v1-1-3899705e3537@quici…
---
Zijun Hu (3):
PCI: endpoint: Fix that API devm_pci_epc_destroy() fails to destroy the EPC device
PCI: endpoint: Simplify API pci_epc_get() implementation
PCI: endpoint: Fix API pci_epf_add_vepf() returning -EBUSY error
drivers/pci/endpoint/pci-epc-core.c | 23 +++++++----------------
drivers/pci/endpoint/pci-epf-core.c | 1 +
2 files changed, 8 insertions(+), 16 deletions(-)
---
base-commit: 11066801dd4b7c4d75fce65c812723a80c1481ae
change-id: 20241020-pci-epc-core_fix-a92512fa9d19
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
While testing the encoded read feature the following crash was observed
and it can be reliably reproduced:
[ 2916.441731] Oops: general protection fault, probably for non-canonical address 0xa3f64e06d5eee2c7: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 2916.441736] CPU: 5 UID: 0 PID: 592 Comm: kworker/u38:4 Kdump: loaded Not tainted 6.13.0-rc1+ #4
[ 2916.441739] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 2916.441740] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[ 2916.441777] RIP: 0010:__wake_up_common+0x29/0xa0
[ 2916.441808] RSP: 0018:ffffaaec0128fd80 EFLAGS: 00010216
[ 2916.441810] RAX: 0000000000000001 RBX: ffff95a6429cf020 RCX: 0000000000000000
[ 2916.441811] RDX: a3f64e06d5eee2c7 RSI: 0000000000000003 RDI: ffff95a6429cf000
^^^^^^^^^^^^^^^^
This comes from `priv->wait.head.next`
[ 2916.441823] Call Trace:
[ 2916.441833] <TASK>
[ 2916.441881] ? __wake_up_common+0x29/0xa0
[ 2916.441883] __wake_up_common_lock+0x37/0x60
[ 2916.441887] btrfs_encoded_read_endio+0x73/0x90 [btrfs] <<< UAF of `priv` object,
[ 2916.441921] btrfs_check_read_bio+0x321/0x500 [btrfs] details below.
[ 2916.441947] process_scheduled_works+0xc1/0x410
[ 2916.441960] worker_thread+0x105/0x240
crash> btrfs_encoded_read_private.wait.head ffff95a6429cf000 # `priv` from RDI ^^
wait.head = {
next = 0xa3f64e06d5eee2c7, # Corrupted as the object was already freed/reused.
prev = 0xffff95a6429cf020 # Stale data still point to itself (`&priv->wait.head`
} also in RBX ^^) ie. the list was free.
Possibly, this is easier (or even only?) reproducible on preemptible kernel.
It just happened to build an RT kernel for additional testing coverage.
Enabling slab debug gives us further related details, mostly confirming
what's expected:
[11:23:07] =============================================================================
[11:23:07] BUG kmalloc-64 (Not tainted): Poison overwritten
[11:23:07] -----------------------------------------------------------------------------
[11:23:07] 0xffff8fc7c5b6b542-0xffff8fc7c5b6b543 @offset=5442. First byte 0x4 instead of 0x6b
^
That makes two bytes into the `priv->wait.lock`
[11:23:07] FIX kmalloc-64: Restoring Poison 0xffff8fc7c5b6b542-0xffff8fc7c5b6b543=0x6b
[11:23:07] Allocated in btrfs_encoded_read_regular_fill_pages+0x5e/0x260 [btrfs] age=4 cpu=0 pid=18295
[11:23:07] __kmalloc_cache_noprof+0x81/0x2a0
[11:23:07] btrfs_encoded_read_regular_fill_pages+0x5e/0x260 [btrfs]
[11:23:07] btrfs_encoded_read_regular+0xee/0x200 [btrfs]
[11:23:07] btrfs_ioctl_encoded_read+0x477/0x600 [btrfs]
[11:23:07] btrfs_ioctl+0xefe/0x2a00 [btrfs]
[11:23:07] __x64_sys_ioctl+0xa3/0xc0
[11:23:07] do_syscall_64+0x74/0x180
[11:23:07] entry_SYSCALL_64_after_hwframe+0x76/0x7e
9121 unsigned long i = 0;
9122 struct btrfs_bio *bbio;
9123 int ret;
9124
* 9125 priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
9126 if (!priv)
9127 return -ENOMEM;
9128
9129 init_waitqueue_head(&priv->wait);
[11:23:07] Freed in btrfs_encoded_read_regular_fill_pages+0x1f9/0x260 [btrfs] age=4 cpu=0 pid=18295
[11:23:07] btrfs_encoded_read_regular_fill_pages+0x1f9/0x260 [btrfs]
[11:23:07] btrfs_encoded_read_regular+0xee/0x200 [btrfs]
[11:23:07] btrfs_ioctl_encoded_read+0x477/0x600 [btrfs]
[11:23:07] btrfs_ioctl+0xefe/0x2a00 [btrfs]
[11:23:07] __x64_sys_ioctl+0xa3/0xc0
[11:23:07] do_syscall_64+0x74/0x180
[11:23:07] entry_SYSCALL_64_after_hwframe+0x76/0x7e
9171 if (atomic_dec_return(&priv->pending) != 0)
9172 io_wait_event(priv->wait, !atomic_read(&priv->pending));
9173 /* See btrfs_encoded_read_endio() for ordering. */
9174 ret = blk_status_to_errno(READ_ONCE(priv->status));
* 9175 kfree(priv);
9176 return ret;
9177 }
9178 }
`priv` was freed here but then after that it was further used. The report
is comming soon after, see below. Note that the report is a few seconds
delayed by the RCU stall timeout. (It is the same example as with the
GPF crash above, just that one was reported right away without any delay).
Due to the poison this time instead of the GPF exception as observed above
the UAF caused a CPU hard lockup (reported by the RCU stall check as this
was a VM):
[11:23:28] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[11:23:28] rcu: 0-...!: (1 GPs behind) idle=48b4/1/0x4000000000000000 softirq=0/0 fqs=5 rcuc=5254 jiffies(starved)
[11:23:28] rcu: (detected by 1, t=5252 jiffies, g=1631241, q=250054 ncpus=8)
[11:23:28] Sending NMI from CPU 1 to CPUs 0:
[11:23:28] NMI backtrace for cpu 0
[11:23:28] CPU: 0 UID: 0 PID: 21445 Comm: kworker/u33:3 Kdump: loaded Tainted: G B 6.13.0-rc1+ #4
[11:23:28] Tainted: [B]=BAD_PAGE
[11:23:28] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[11:23:28] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[11:23:28] RIP: 0010:native_halt+0xa/0x10
[11:23:28] RSP: 0018:ffffb42ec277bc48 EFLAGS: 00000046
[11:23:28] Call Trace:
[11:23:28] <TASK>
[11:23:28] kvm_wait+0x53/0x60
[11:23:28] __pv_queued_spin_lock_slowpath+0x2ea/0x350
[11:23:28] _raw_spin_lock_irq+0x2b/0x40
[11:23:28] rtlock_slowlock_locked+0x1f3/0xce0
[11:23:28] rt_spin_lock+0x7b/0xb0
[11:23:28] __wake_up_common_lock+0x23/0x60
[11:23:28] btrfs_encoded_read_endio+0x73/0x90 [btrfs] <<< UAF of `priv` object.
[11:23:28] btrfs_check_read_bio+0x321/0x500 [btrfs]
[11:23:28] process_scheduled_works+0xc1/0x410
[11:23:28] worker_thread+0x105/0x240
9105 if (priv->uring_ctx) {
9106 int err = blk_status_to_errno(READ_ONCE(priv->status));
9107 btrfs_uring_read_extent_endio(priv->uring_ctx, err);
9108 kfree(priv);
9109 } else {
* 9110 wake_up(&priv->wait); <<< So we know UAF/GPF happens here.
9111 }
9112 }
9113 bio_put(&bbio->bio);
Now, the wait queue here does not really guarantee a proper
synchronization between `btrfs_encoded_read_regular_fill_pages()`
and `btrfs_encoded_read_endio()` which eventually results in various
use-afer-free effects like general protection fault or CPU hard lockup.
Using plain wait queue without additional instrumentation on top of the
`pending` counter is simply insufficient in this context. The reason wait
queue fails here is because the lifespan of that structure is only within
the `btrfs_encoded_read_regular_fill_pages()` function. In such a case
plain wait queue cannot be used to synchronize for it's own destruction.
Fix this by correctly using completion instead.
Also, while the lifespan of the structures in sync case is strictly
limited within the `..._fill_pages()` function, there is no need to
allocate from slab. Stack can be safely used instead.
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable(a)vger.kernel.org # 5.18+
Signed-off-by: Daniel Vacek <neelx(a)suse.com>
---
fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++----------------------
1 file changed, 33 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fa648ab6fe806..61e0fd5c6a15f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9078,7 +9078,7 @@ static ssize_t btrfs_encoded_read_inline(
}
struct btrfs_encoded_read_private {
- wait_queue_head_t wait;
+ struct completion *sync_read;
void *uring_ctx;
atomic_t pending;
blk_status_t status;
@@ -9090,23 +9090,22 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
if (bbio->bio.bi_status) {
/*
- * The memory barrier implied by the atomic_dec_return() here
- * pairs with the memory barrier implied by the
- * atomic_dec_return() or io_wait_event() in
- * btrfs_encoded_read_regular_fill_pages() to ensure that this
- * write is observed before the load of status in
- * btrfs_encoded_read_regular_fill_pages().
+ * The memory barrier implied by the
+ * atomic_dec_and_test() here pairs with the memory
+ * barrier implied by the atomic_dec_and_test() in
+ * btrfs_encoded_read_regular_fill_pages() to ensure
+ * that this write is observed before the load of
+ * status in btrfs_encoded_read_regular_fill_pages().
*/
WRITE_ONCE(priv->status, bbio->bio.bi_status);
}
if (atomic_dec_and_test(&priv->pending)) {
- int err = blk_status_to_errno(READ_ONCE(priv->status));
-
if (priv->uring_ctx) {
+ int err = blk_status_to_errno(READ_ONCE(priv->status));
btrfs_uring_read_extent_endio(priv->uring_ctx, err);
kfree(priv);
} else {
- wake_up(&priv->wait);
+ complete(priv->sync_read);
}
}
bio_put(&bbio->bio);
@@ -9117,16 +9116,21 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
struct page **pages, void *uring_ctx)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct btrfs_encoded_read_private *priv;
+ struct completion sync_read;
+ struct btrfs_encoded_read_private sync_priv, *priv;
unsigned long i = 0;
struct btrfs_bio *bbio;
- int ret;
- priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
- if (!priv)
- return -ENOMEM;
+ if (uring_ctx) {
+ priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
+ if (!priv)
+ return -ENOMEM;
+ } else {
+ priv = &sync_priv;
+ init_completion(&sync_read);
+ priv->sync_read = &sync_read;
+ }
- init_waitqueue_head(&priv->wait);
atomic_set(&priv->pending, 1);
priv->status = 0;
priv->uring_ctx = uring_ctx;
@@ -9158,23 +9162,23 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
atomic_inc(&priv->pending);
btrfs_submit_bbio(bbio, 0);
- if (uring_ctx) {
- if (atomic_dec_return(&priv->pending) == 0) {
- ret = blk_status_to_errno(READ_ONCE(priv->status));
- btrfs_uring_read_extent_endio(uring_ctx, ret);
+ if (atomic_dec_and_test(&priv->pending)) {
+ if (uring_ctx) {
+ int err = blk_status_to_errno(READ_ONCE(priv->status));
+ btrfs_uring_read_extent_endio(uring_ctx, err);
kfree(priv);
- return ret;
+ return err;
+ } else {
+ complete(&sync_read);
}
+ }
+ if (uring_ctx)
return -EIOCBQUEUED;
- } else {
- if (atomic_dec_return(&priv->pending) != 0)
- io_wait_event(priv->wait, !atomic_read(&priv->pending));
- /* See btrfs_encoded_read_endio() for ordering. */
- ret = blk_status_to_errno(READ_ONCE(priv->status));
- kfree(priv);
- return ret;
- }
+
+ wait_for_completion_io(&sync_read);
+ /* See btrfs_encoded_read_endio() for ordering. */
+ return blk_status_to_errno(READ_ONCE(priv->status));
}
ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
--
2.45.2
From: Shu Han <ebpqwerty472123(a)gmail.com>
[ Upstream commit ea7e2d5e49c05e5db1922387b09ca74aa40f46e2 ]
The remap_file_pages syscall handler calls do_mmap() directly, which
doesn't contain the LSM security check. And if the process has called
personality(READ_IMPLIES_EXEC) before and remap_file_pages() is called for
RW pages, this will actually result in remapping the pages to RWX,
bypassing a W^X policy enforced by SELinux.
So we should check prot by security_mmap_file LSM hook in the
remap_file_pages syscall handler before do_mmap() is called. Otherwise, it
potentially permits an attacker to bypass a W^X policy enforced by
SELinux.
The bypass is similar to CVE-2016-10044, which bypass the same thing via
AIO and can be found in [1].
The PoC:
$ cat > test.c
int main(void) {
size_t pagesz = sysconf(_SC_PAGE_SIZE);
int mfd = syscall(SYS_memfd_create, "test", 0);
const char *buf = mmap(NULL, 4 * pagesz, PROT_READ | PROT_WRITE,
MAP_SHARED, mfd, 0);
unsigned int old = syscall(SYS_personality, 0xffffffff);
syscall(SYS_personality, READ_IMPLIES_EXEC | old);
syscall(SYS_remap_file_pages, buf, pagesz, 0, 2, 0);
syscall(SYS_personality, old);
// show the RWX page exists even if W^X policy is enforced
int fd = open("/proc/self/maps", O_RDONLY);
unsigned char buf2[1024];
while (1) {
int ret = read(fd, buf2, 1024);
if (ret <= 0) break;
write(1, buf2, ret);
}
close(fd);
}
$ gcc test.c -o test
$ ./test | grep rwx
7f1836c34000-7f1836c35000 rwxs 00002000 00:01 2050 /memfd:test (deleted)
Link: https://project-zero.issues.chromium.org/issues/42452389 [1]
Cc: stable(a)vger.kernel.org
Signed-off-by: Shu Han <ebpqwerty472123(a)gmail.com>
Acked-by: Stephen Smalley <stephen.smalley.work(a)gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul(a)paul-moore.com>
[ Resolve merge conflict in mm/mmap.c. ]
Signed-off-by: Bin Lan <bin.lan.cn(a)windriver.com>
---
mm/mmap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a9933ede542..ebc3583fa612 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3021,8 +3021,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
flags |= MAP_LOCKED;
file = get_file(vma->vm_file);
+ ret = security_mmap_file(vma->vm_file, prot, flags);
+ if (ret)
+ goto out_fput;
ret = do_mmap(vma->vm_file, start, size,
prot, flags, pgoff, &populate, NULL);
+out_fput:
fput(file);
out:
mmap_write_unlock(mm);
--
2.43.0
[ Sasha's backport helper bot ]
Hi,
The upstream commit SHA1 provided is correct: fcf6a49d79923a234844b8efe830a61f3f0584e4
WARNING: Author mismatch between patch and upstream commit:
Backport author: <gregkh(a)linuxfoundation.org>
Commit author: Wayne Lin <wayne.lin(a)amd.com>
Status in newer kernel trees:
6.12.y | Present (exact SHA1)
6.6.y | Present (different SHA1: c7e65cab54a8)
6.1.y | Not found
Note: The patch differs from the upstream commit:
---
1: fcf6a49d79923 ! 1: 79f06b6c107fd drm/amd/display: Don't refer to dc_sink in is_dsc_need_re_compute
@@
## Metadata ##
-Author: Wayne Lin <wayne.lin(a)amd.com>
+Author: gregkh(a)linuxfoundation.org <gregkh(a)linuxfoundation.org>
## Commit message ##
- drm/amd/display: Don't refer to dc_sink in is_dsc_need_re_compute
+ Patch "[PATCH 6.1.y] drm/amd/display: Don't refer to dc_sink in is_dsc_need_re_compute" has been added to the 5.4-stable tree
+
+ This is a note to let you know that I've just added the patch titled
+
+ [PATCH 6.1.y] drm/amd/display: Don't refer to dc_sink in is_dsc_need_re_compute
+
+ to the 5.4-stable tree which can be found at:
+ http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
+
+ The filename of the patch is:
+ drm-amd-display-don-t-refer-to-dc_sink-in-is_dsc_need_re_compute.patch
+ and it can be found in the queue-5.4 subdirectory.
+
+ If you, or anyone else, feels it should not be added to the stable tree,
+ please let <stable(a)vger.kernel.org> know about it.
+
+ From jianqi.ren.cn(a)windriver.com Thu Dec 12 13:11:21 2024
+ From: <jianqi.ren.cn(a)windriver.com>
+ Date: Wed, 11 Dec 2024 18:15:44 +0800
+ Subject: [PATCH 6.1.y] drm/amd/display: Don't refer to dc_sink in is_dsc_need_re_compute
+ To: <wayne.lin(a)amd.com>, <gregkh(a)linuxfoundation.org>
+ Cc: <patches(a)lists.linux.dev>, <jerry.zuo(a)amd.com>, <zaeem.mohamed(a)amd.com>, <daniel.wheeler(a)amd.com>, <alexander.deucher(a)amd.com>, <stable(a)vger.kernel.org>, <harry.wentland(a)amd.com>, <sunpeng.li(a)amd.com>, <Rodrigo.Siqueira(a)amd.com>, <christian.koenig(a)amd.com>, <airlied(a)gmail.com>, <daniel(a)ffwll.ch>, <Jerry.Zuo(a)amd.com>, <amd-gfx(a)lists.freedesktop.org>, <dri-devel(a)lists.freedesktop.org>, <linux-kernel(a)vger.kernel.org>
+ Message-ID: <20241211101544.2121147-1-jianqi.ren.cn(a)windriver.com>
+
+ From: Wayne Lin <wayne.lin(a)amd.com>
+
+ [ Upstream commit fcf6a49d79923a234844b8efe830a61f3f0584e4 ]
[Why]
When unplug one of monitors connected after mst hub, encounter null pointer dereference.
@@ Commit message
Signed-off-by: Wayne Lin <wayne.lin(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
+ Signed-off-by: Jianqi Ren <jianqi.ren.cn(a)windriver.com>
## drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c ##
@@ drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c: amdgpu_dm_mst_connector_early_unregister(struct drm_connector *connector)
@@ drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c: dm_dp_mst_detect(st
amdgpu_dm_set_mst_status(&aconnector->mst_status,
MST_REMOTE_EDID | MST_ALLOCATE_NEW_PAYLOAD | MST_CLEAR_ALLOCATED_PAYLOAD,
-@@ drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c: static bool is_dsc_need_re_compute(
- if (!aconnector || !aconnector->dsc_aux)
- continue;
-
-- /*
-- * check if cached virtual MST DSC caps are available and DSC is supported
-- * as per specifications in their Virtual DPCD registers.
-- */
-- if (!(aconnector->dc_sink->dsc_caps.dsc_dec_caps.is_dsc_supported ||
-- aconnector->dc_link->dpcd_caps.dsc_caps.dsc_basic_caps.fields.dsc_support.DSC_PASSTHROUGH_SUPPORT))
-- continue;
--
- stream_on_link[new_stream_on_link_num] = aconnector;
- new_stream_on_link_num++;
-
---
Results of testing on various branches:
| Branch | Patch Apply | Build Test |
|---------------------------|-------------|------------|
| stable/linux-6.1.y | Success | Success |
| stable/linux-5.4.y | Failed | N/A |
út 10. 12. 2024 v 22:04 odesílatel Sasha Levin <sashal(a)kernel.org> napsal:
>
> This is a note to let you know that I've just added the patch titled
>
> rtla/timerlat: Make timerlat_top_cpu->*_count unsigned long long
>
> to the 6.6-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> rtla-timerlat-make-timerlat_top_cpu-_count-unsigned-.patch
> and it can be found in the queue-6.6 subdirectory.
>
Could you also add "rtla/timerlat: Make timerlat_hist_cpu->*_count
unsigned long long", too (76b3102148135945b013797fac9b20), just like
we already have in-queue for 6.12? It makes no sense to do one fix but
not the other (clearly autosel AI won't take over the world yet).
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 0b8030ad5be8c39c4ad0f27fa740b3140a31023b
> Author: Tomas Glozar <tglozar(a)redhat.com>
> Date: Fri Oct 11 14:10:14 2024 +0200
>
> rtla/timerlat: Make timerlat_top_cpu->*_count unsigned long long
>
> [ Upstream commit 4eba4723c5254ba8251ecb7094a5078d5c300646 ]
>
> Most fields of struct timerlat_top_cpu are unsigned long long, but the
> fields {irq,thread,user}_count are int (32-bit signed).
>
> This leads to overflow when tracing on a large number of CPUs for a long
> enough time:
> $ rtla timerlat top -a20 -c 1-127 -d 12h
> ...
> 0 12:00:00 | IRQ Timer Latency (us) | Thread Timer Latency (us)
> CPU COUNT | cur min avg max | cur min avg max
> 1 #43200096 | 0 0 1 2 | 3 2 6 12
> ...
> 127 #43200096 | 0 0 1 2 | 3 2 5 11
> ALL #119144 e4 | 0 5 4 | 2 28 16
>
> The average latency should be 0-1 for IRQ and 5-6 for thread, but is
> reported as 5 and 28, about 4 to 5 times more, due to the count
> overflowing when summed over all CPUs: 43200096 * 127 = 5486412192,
> however, 1191444898 (= 5486412192 mod MAX_INT) is reported instead, as
> seen on the last line of the output, and the averages are thus ~4.6
> times higher than they should be (5486412192 / 1191444898 = ~4.6).
>
> Fix the issue by changing {irq,thread,user}_count fields to unsigned
> long long, similarly to other fields in struct timerlat_top_cpu and to
> the count variable in timerlat_top_print_sum.
>
> Link: https://lore.kernel.org/20241011121015.2868751-1-tglozar@redhat.com
> Reported-by: Attila Fazekas <afazekas(a)redhat.com>
> Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
> index a84f43857de14..0915092057f85 100644
> --- a/tools/tracing/rtla/src/timerlat_top.c
> +++ b/tools/tracing/rtla/src/timerlat_top.c
> @@ -49,9 +49,9 @@ struct timerlat_top_params {
> };
>
> struct timerlat_top_cpu {
> - int irq_count;
> - int thread_count;
> - int user_count;
> + unsigned long long irq_count;
> + unsigned long long thread_count;
> + unsigned long long user_count;
>
> unsigned long long cur_irq;
> unsigned long long min_irq;
> @@ -237,7 +237,7 @@ static void timerlat_top_print(struct osnoise_tool *top, int cpu)
> /*
> * Unless trace is being lost, IRQ counter is always the max.
> */
> - trace_seq_printf(s, "%3d #%-9d |", cpu, cpu_data->irq_count);
> + trace_seq_printf(s, "%3d #%-9llu |", cpu, cpu_data->irq_count);
>
> if (!cpu_data->irq_count) {
> trace_seq_printf(s, "%s %s %s %s |", no_value, no_value, no_value, no_value);
>
Thanks,
Tomas
The hid-sensor-hub creates the individual device structs and transfers them
to the created mfd platform-devices via the platform_data in the mfd_cell.
Before e651a1da442a ("HID: hid-sensor-hub: Allow parallel synchronous reads")
the sensor-hub was managing access centrally, with one "completion" in the
hub's data structure, which needed to be finished on removal at the latest.
The mentioned commit then moved this central management to each hid sensor
device, resulting on a completion in each struct hid_sensor_hub_device.
The remove procedure was adapted to go through all sensor devices and
finish any pending "completion".
What this didn't take into account was, platform_device_add_data() that is
used by mfd_add{_hotplug}_devices() does a kmemdup on the submitted
platform-data. So the data the platform-device gets is a copy of the
original data, meaning that the device worked on a different completion
than what sensor_hub_remove() currently wants to access.
To fix that, use device_for_each_child() to go through each child-device
similar to how mfd_remove_devices() unregisters the devices later and
with that get the live platform_data to finalize the correct completion.
Fixes: e651a1da442a ("HID: hid-sensor-hub: Allow parallel synchronous reads")
Cc: stable(a)vger.kernel.org
Acked-by: Benjamin Tissoires <bentiss(a)kernel.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada(a)linux.intel.com>
Signed-off-by: Heiko Stuebner <heiko(a)sntech.de>
---
drivers/hid/hid-sensor-hub.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/drivers/hid/hid-sensor-hub.c b/drivers/hid/hid-sensor-hub.c
index 7bd86eef6ec7..4c94c03cb573 100644
--- a/drivers/hid/hid-sensor-hub.c
+++ b/drivers/hid/hid-sensor-hub.c
@@ -730,23 +730,30 @@ static int sensor_hub_probe(struct hid_device *hdev,
return ret;
}
+static int sensor_hub_finalize_pending_fn(struct device *dev, void *data)
+{
+ struct hid_sensor_hub_device *hsdev = dev->platform_data;
+
+ if (hsdev->pending.status)
+ complete(&hsdev->pending.ready);
+
+ return 0;
+}
+
static void sensor_hub_remove(struct hid_device *hdev)
{
struct sensor_hub_data *data = hid_get_drvdata(hdev);
unsigned long flags;
- int i;
hid_dbg(hdev, " hardware removed\n");
hid_hw_close(hdev);
hid_hw_stop(hdev);
+
spin_lock_irqsave(&data->lock, flags);
- for (i = 0; i < data->hid_sensor_client_cnt; ++i) {
- struct hid_sensor_hub_device *hsdev =
- data->hid_sensor_hub_client_devs[i].platform_data;
- if (hsdev->pending.status)
- complete(&hsdev->pending.ready);
- }
+ device_for_each_child(&hdev->dev, NULL,
+ sensor_hub_finalize_pending_fn);
spin_unlock_irqrestore(&data->lock, flags);
+
mfd_remove_devices(&hdev->dev);
mutex_destroy(&data->mutex);
}
--
2.45.2
The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.
Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.
Fixes: dcfe7673787b ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()")
Signed-off-by: Robert Hodaszi <robert.hodaszi(a)digi.com>
---
Cc: stable(a)vger.kernel.org
---
net/dsa/tag_ocelot_8021q.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c
index 8e8b1bef6af6..11ea8cfd6266 100644
--- a/net/dsa/tag_ocelot_8021q.c
+++ b/net/dsa/tag_ocelot_8021q.c
@@ -79,7 +79,7 @@ static struct sk_buff *ocelot_xmit(struct sk_buff *skb,
static struct sk_buff *ocelot_rcv(struct sk_buff *skb,
struct net_device *netdev)
{
- int src_port, switch_id;
+ int src_port = -1, switch_id = -1;
dsa_8021q_rcv(skb, &src_port, &switch_id, NULL, NULL);
--
2.43.0
From: Joe Damato <jdamato(a)fastly.com>
[ Upstream commit 08062af0a52107a243f7608fd972edb54ca5b7f8 ]
In commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature")
napi_defer_irqs was added to net_device and napi_defer_irqs_count was
added to napi_struct, both as type int.
This value never goes below zero, so there is not reason for it to be a
signed int. Change the type for both from int to u32, and add an
overflow check to sysfs to limit the value to S32_MAX.
The limit of S32_MAX was chosen because the practical limit before this
patch was S32_MAX (anything larger was an overflow) and thus there are
no behavioral changes introduced. If the extra bit is needed in the
future, the limit can be raised.
Before this patch:
$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
$ cat /sys/class/net/eth4/napi_defer_hard_irqs
-2147483647
After this patch:
$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
bash: line 0: echo: write error: Numerical result out of range
Similarly, /sys/class/net/XXXXX/tx_queue_len is defined as unsigned:
include/linux/netdevice.h: unsigned int tx_queue_len;
And has an overflow check:
dev_change_tx_queue_len(..., unsigned long new_len):
if (new_len != (unsigned int)new_len)
return -ERANGE;
Suggested-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Joe Damato <jdamato(a)fastly.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Link: https://patch.msgid.link/20240904153431.307932-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Jianqi Ren <jianqi.ren.cn(a)windriver.com>
---
include/linux/netdevice.h | 4 ++--
net/core/net-sysfs.c | 6 +++++-
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fbbd0df1106b..8379e938cd89 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -352,7 +352,7 @@ struct napi_struct {
unsigned long state;
int weight;
- int defer_hard_irqs_count;
+ u32 defer_hard_irqs_count;
unsigned long gro_bitmask;
int (*poll)(struct napi_struct *, int);
#ifdef CONFIG_NETPOLL
@@ -2193,7 +2193,7 @@ struct net_device {
struct bpf_prog __rcu *xdp_prog;
unsigned long gro_flush_timeout;
- int napi_defer_hard_irqs;
+ u32 napi_defer_hard_irqs;
#define GRO_LEGACY_MAX_SIZE 65536u
/* TCP minimal MSS is 8 (TCP_MIN_GSO_SIZE),
* and shinfo->gso_segs is a 16bit field.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 8a06f97320e0..4ce57e75d139 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -30,6 +30,7 @@
#ifdef CONFIG_SYSFS
static const char fmt_hex[] = "%#x\n";
static const char fmt_dec[] = "%d\n";
+static const char fmt_uint[] = "%u\n";
static const char fmt_ulong[] = "%lu\n";
static const char fmt_u64[] = "%llu\n";
@@ -405,6 +406,9 @@ NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
{
+ if (val > S32_MAX)
+ return -ERANGE;
+
WRITE_ONCE(dev->napi_defer_hard_irqs, val);
return 0;
}
@@ -418,7 +422,7 @@ static ssize_t napi_defer_hard_irqs_store(struct device *dev,
return netdev_store(dev, attr, buf, len, change_napi_defer_hard_irqs);
}
-NETDEVICE_SHOW_RW(napi_defer_hard_irqs, fmt_dec);
+NETDEVICE_SHOW_RW(napi_defer_hard_irqs, fmt_uint);
static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t len)
--
2.25.1
Changes in v8:
- Picks up change I agreed with Vlad but failed to cherry-pick into my b4
tree - Vlad/Bod
- Rewords the commit log for patch #3. As I read it I decided I might
translate bits of it from thought-stream into English - Bod
- Link to v7: https://lore.kernel.org/r/20241211-b4-linux-next-24-11-18-clock-multiple-po…
Changes in v7:
- Expand commit log in patch #3
I've discussed with Bjorn on IRC and video what to put into the log here
and captured most of what we discussed.
Mostly the point here is voting for voltages in the power-domain list
is up to the drivers to do with performance states/opp-tables not for the
GDSC code. - Bjorn/Bryan
- Link to v6: https://lore.kernel.org/r/20241129-b4-linux-next-24-11-18-clock-multiple-po…
Changes in v6:
- Passes NULL to second parameter of devm_pm_domain_attach_list - Vlad
- Link to v5: https://lore.kernel.org/r/20241128-b4-linux-next-24-11-18-clock-multiple-po…
Changes in v5:
- In-lines devm_pm_domain_attach_list() in probe() directly - Vlad
- Link to v4: https://lore.kernel.org/r/20241127-b4-linux-next-24-11-18-clock-multiple-po…
v4:
- Adds Bjorn's RB to first patch - Bjorn
- Drops the 'd' in "and int" - Bjorn
- Amends commit log of patch 3 to capture a number of open questions -
Bjorn
- Link to v3: https://lore.kernel.org/r/20241126-b4-linux-next-24-11-18-clock-multiple-po…
v3:
- Fixes commit log "per which" - Bryan
- Link to v2: https://lore.kernel.org/r/20241125-b4-linux-next-24-11-18-clock-multiple-po…
v2:
The main change in this version is Bjorn's pointing out that pm_runtime_*
inside of the gdsc_enable/gdsc_disable path would be recursive and cause a
lockdep splat. Dmitry alluded to this too.
Bjorn pointed to stuff being done lower in the gdsc_register() routine that
might be a starting point.
I iterated around that idea and came up with patch #3. When a gdsc has no
parent and the pd_list is non-NULL then attach that orphan GDSC to the
clock controller power-domain list.
Existing subdomain code in gdsc_register() will connect the parent GDSCs in
the clock-controller to the clock-controller subdomain, the new code here
does that same job for a list of power-domains the clock controller depends
on.
To Dmitry's point about MMCX and MCX dependencies for the registers inside
of the clock controller, I have switched off all references in a test dtsi
and confirmed that accessing the clock-controller regs themselves isn't
required.
On the second point I also verified my test branch with lockdep on which
was a concern with the pm_domain version of this solution but I wanted to
cover it anyway with the new approach for completeness sake.
Here's the item-by-item list of changes:
- Adds a patch to capture pm_genpd_add_subdomain() result code - Bryan
- Changes changelog of second patch to remove singleton and generally
to make the commit log easier to understand - Bjorn
- Uses demv_pm_domain_attach_list - Vlad
- Changes error check to if (ret < 0 && ret != -EEXIST) - Vlad
- Retains passing &pd_data instead of NULL - because NULL doesn't do
the same thing - Bryan/Vlad
- Retains standalone function qcom_cc_pds_attach() because the pd_data
enumeration looks neater in a standalone function - Bryan/Vlad
- Drops pm_runtime in favour of gdsc_add_subdomain_list() for each
power-domain in the pd_list.
The pd_list will be whatever is pointed to by power-domains = <>
in the dtsi - Bjorn
- Link to v1: https://lore.kernel.org/r/20241118-b4-linux-next-24-11-18-clock-multiple-po…
v1:
On x1e80100 and it's SKUs the Camera Clock Controller - CAMCC has
multiple power-domains which power it. Usually with a single power-domain
the core platform code will automatically switch on the singleton
power-domain for you. If you have multiple power-domains for a device, in
this case the clock controller, you need to switch those power-domains
on/off yourself.
The clock controllers can also contain Global Distributed
Switch Controllers - GDSCs which themselves can be referenced from dtsi
nodes ultimately triggering a gdsc_en() in drivers/clk/qcom/gdsc.c.
As an example:
cci0: cci@ac4a000 {
power-domains = <&camcc TITAN_TOP_GDSC>;
};
This series adds the support to attach a power-domain list to the
clock-controllers and the GDSCs those controllers provide so that in the
case of the above example gdsc_toggle_logic() will trigger the power-domain
list with pm_runtime_resume_and_get() and pm_runtime_put_sync()
respectively.
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
Bryan O'Donoghue (3):
clk: qcom: gdsc: Capture pm_genpd_add_subdomain result code
clk: qcom: common: Add support for power-domain attachment
clk: qcom: Support attaching GDSCs to multiple parents
drivers/clk/qcom/common.c | 6 ++++++
drivers/clk/qcom/gdsc.c | 41 +++++++++++++++++++++++++++++++++++++++--
drivers/clk/qcom/gdsc.h | 1 +
3 files changed, 46 insertions(+), 2 deletions(-)
---
base-commit: 744cf71b8bdfcdd77aaf58395e068b7457634b2c
change-id: 20241118-b4-linux-next-24-11-18-clock-multiple-power-domains-a5f994dc452a
Best regards,
--
Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 64506b3d23a337e98a74b18dcb10c8619365f2bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121240-props-brittle-f872@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 64506b3d23a337e98a74b18dcb10c8619365f2bd Mon Sep 17 00:00:00 2001
From: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Date: Mon, 11 Nov 2024 23:18:31 +0530
Subject: [PATCH] scsi: ufs: qcom: Only free platform MSIs when ESI is enabled
Otherwise, it will result in a NULL pointer dereference as below:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
Call trace:
mutex_lock+0xc/0x54
platform_device_msi_free_irqs_all+0x14/0x20
ufs_qcom_remove+0x34/0x48 [ufs_qcom]
platform_remove+0x28/0x44
device_remove+0x4c/0x80
device_release_driver_internal+0xd8/0x178
driver_detach+0x50/0x9c
bus_remove_driver+0x6c/0xbc
driver_unregister+0x30/0x60
platform_driver_unregister+0x14/0x20
ufs_qcom_pltform_exit+0x18/0xb94 [ufs_qcom]
__arm64_sys_delete_module+0x180/0x260
invoke_syscall+0x44/0x100
el0_svc_common.constprop.0+0xc0/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x34/0xdc
el0t_64_sync_handler+0xc0/0xc4
el0t_64_sync+0x190/0x194
Cc: stable(a)vger.kernel.org # 6.3
Fixes: 519b6274a777 ("scsi: ufs: qcom: Add MCQ ESI config vendor specific ops")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/20241111-ufs_bug_fix-v1-2-45ad8b62f02e@linaro.org
Reviewed-by: Bean Huo <beanhuo(a)micron.com>
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/ufs/host/ufs-qcom.c b/drivers/ufs/host/ufs-qcom.c
index 3b592492e152..5220ec78021d 100644
--- a/drivers/ufs/host/ufs-qcom.c
+++ b/drivers/ufs/host/ufs-qcom.c
@@ -1861,10 +1861,12 @@ static int ufs_qcom_probe(struct platform_device *pdev)
static void ufs_qcom_remove(struct platform_device *pdev)
{
struct ufs_hba *hba = platform_get_drvdata(pdev);
+ struct ufs_qcom_host *host = ufshcd_get_variant(hba);
pm_runtime_get_sync(&(pdev)->dev);
ufshcd_remove(hba);
- platform_device_msi_free_irqs_all(hba->dev);
+ if (host->esi_enabled)
+ platform_device_msi_free_irqs_all(hba->dev);
}
static const struct of_device_id ufs_qcom_of_match[] __maybe_unused = {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121206-varnish-jackpot-7d74@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121205-could-retype-331c@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121204-among-product-771b@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121227-undermine-effort-1e7c@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 76031d9536a076bf023bedbdb1b4317fc801dd67
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121226-cathedral-decimeter-88cb@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76031d9536a076bf023bedbdb1b4317fc801dd67 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 3 Dec 2024 11:16:30 +0100
Subject: [PATCH] clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.
It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.
max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns
If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.
Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.
For the case at hand this results in a tripping point of 1174405120ns.
Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.
Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.
Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.n…
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index ef1b16da6ad5..65b7c41471c3 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -49,6 +49,7 @@ struct module;
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
+ * @max_raw_delta: Maximum safe delta value for negative motion detection
* @name: Pointer to clocksource name
* @list: List head for registration (internal)
* @freq_khz: Clocksource frequency in khz.
@@ -109,6 +110,7 @@ struct clocksource {
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
+ u64 max_raw_delta;
const char *name;
struct list_head list;
u32 freq_khz;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index aab6472853fa..7304d7cf47f2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -24,7 +24,7 @@ static void clocksource_enqueue(struct clocksource *cs);
static noinline u64 cycles_to_nsec_safe(struct clocksource *cs, u64 start, u64 end)
{
- u64 delta = clocksource_delta(end, start, cs->mask);
+ u64 delta = clocksource_delta(end, start, cs->mask, cs->max_raw_delta);
if (likely(delta < cs->max_cycles))
return clocksource_cyc2ns(delta, cs->mult, cs->shift);
@@ -993,6 +993,15 @@ static inline void clocksource_update_max_deferment(struct clocksource *cs)
cs->max_idle_ns = clocks_calc_max_nsecs(cs->mult, cs->shift,
cs->maxadj, cs->mask,
&cs->max_cycles);
+
+ /*
+ * Threshold for detecting negative motion in clocksource_delta().
+ *
+ * Allow for 0.875 of the counter width so that overly long idle
+ * sleeps, which go slightly over mask/2, do not trigger the
+ * negative motion detection.
+ */
+ cs->max_raw_delta = (cs->mask >> 1) + (cs->mask >> 2) + (cs->mask >> 3);
}
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0ca85ff4fbb4..3d128825d343 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -755,7 +755,8 @@ static void timekeeping_forward_now(struct timekeeper *tk)
u64 cycle_now, delta;
cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -2230,7 +2231,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
return false;
offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
- tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
+ tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
/* Check if there's really nothing to do */
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h
index 63e600e943a7..8c9079108ffb 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -30,15 +30,15 @@ static inline void timekeeping_inc_mg_floor_swaps(void)
#endif
-static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
+static inline u64 clocksource_delta(u64 now, u64 last, u64 mask, u64 max_delta)
{
u64 ret = (now - last) & mask;
/*
- * Prevent time going backwards by checking the MSB of mask in
- * the result. If set, return 0.
+ * Prevent time going backwards by checking the result against
+ * @max_delta. If greater, return 0.
*/
- return ret & ~(mask >> 1) ? 0 : ret;
+ return ret > max_delta ? 0 : ret;
}
/* Semi public for serialization of non timekeeper VDSO updates. */
This reverts commit dfe6c5692fb5 ("ocfs2: fix the la space leak when
unmounting an ocfs2 volume").
In commit dfe6c5692fb5, the commit log "This bug has existed since the
initial OCFS2 code." is wrong. The correct introduction commit is
30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
The influence of commit dfe6c5692fb5 is that it provides a correct
fix for the latest kernel. however, it shouldn't be pushed to stable
branches. Let's use this commit to revert all branches that include
dfe6c5692fb5 and use a new fix method to fix commit 30dd3478c3cd.
Fixes: dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume")
Signed-off-by: Heming Zhao <heming.zhao(a)suse.com>
Cc: <stable(a)vger.kernel.org>
---
fs/ocfs2/localalloc.c | 19 -------------------
1 file changed, 19 deletions(-)
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 8ac42ea81a17..5df34561c551 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -1002,25 +1002,6 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
start = bit_off + 1;
}
- /* clear the contiguous bits until the end boundary */
- if (count) {
- blkno = la_start_blk +
- ocfs2_clusters_to_blocks(osb->sb,
- start - count);
-
- trace_ocfs2_sync_local_to_main_free(
- count, start - count,
- (unsigned long long)la_start_blk,
- (unsigned long long)blkno);
-
- status = ocfs2_release_clusters(handle,
- main_bm_inode,
- main_bm_bh, blkno,
- count);
- if (status < 0)
- mlog_errno(status);
- }
-
bail:
if (status)
mlog_errno(status);
--
2.43.0
If the caller of vmap() specifies VM_MAP_PUT_PAGES (currently only the
i915 driver), we will decrement nr_vmalloc_pages and MEMCG_VMALLOC in
vfree(). These counters are incremented by vmalloc() but not by vmap()
so this will cause an underflow. Check the VM_MAP_PUT_PAGES flag before
decrementing either counter.
Fixes: b944afc9d64d (mm: add a VM_MAP_PUT_PAGES flag for vmap)
Cc: stable(a)vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
---
mm/vmalloc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f009b21705c1..5c88d0e90c20 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3374,7 +3374,8 @@ void vfree(const void *addr)
struct page *page = vm->pages[i];
BUG_ON(!page);
- mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
+ if (!(vm->flags & VM_MAP_PUT_PAGES))
+ mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
/*
* High-order allocs for huge vmallocs are split, so
* can be freed as an array of order-0 allocations
@@ -3382,7 +3383,8 @@ void vfree(const void *addr)
__free_page(page);
cond_resched();
}
- atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
+ if (!(vm->flags & VM_MAP_PUT_PAGES))
+ atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
kvfree(vm->pages);
kfree(vm);
}
--
2.45.2
From: Wayne Lin <wayne.lin(a)amd.com>
[ Upstream commit fcf6a49d79923a234844b8efe830a61f3f0584e4 ]
[Why]
When unplug one of monitors connected after mst hub, encounter null pointer dereference.
It's due to dc_sink get released immediately in early_unregister() or detect_ctx(). When
commit new state which directly referring to info stored in dc_sink will cause null pointer
dereference.
[how]
Remove redundant checking condition. Relevant condition should already be covered by checking
if dsc_aux is null or not. Also reset dsc_aux to NULL when the connector is disconnected.
Reviewed-by: Jerry Zuo <jerry.zuo(a)amd.com>
Acked-by: Zaeem Mohamed <zaeem.mohamed(a)amd.com>
Signed-off-by: Wayne Lin <wayne.lin(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Jianqi Ren <jianqi.ren.cn(a)windriver.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
index 1acef5f3838f..a1619f4569cf 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
@@ -183,6 +183,8 @@ amdgpu_dm_mst_connector_early_unregister(struct drm_connector *connector)
dc_sink_release(dc_sink);
aconnector->dc_sink = NULL;
aconnector->edid = NULL;
+ aconnector->dsc_aux = NULL;
+ port->passthrough_aux = NULL;
}
aconnector->mst_status = MST_STATUS_DEFAULT;
@@ -487,6 +489,8 @@ dm_dp_mst_detect(struct drm_connector *connector,
dc_sink_release(aconnector->dc_sink);
aconnector->dc_sink = NULL;
aconnector->edid = NULL;
+ aconnector->dsc_aux = NULL;
+ port->passthrough_aux = NULL;
amdgpu_dm_set_mst_status(&aconnector->mst_status,
MST_REMOTE_EDID | MST_ALLOCATE_NEW_PAYLOAD | MST_CLEAR_ALLOCATED_PAYLOAD,
--
2.25.1
Netpoll will explicitly pass the polling call with a budget of 0 to
indicate it's clearing the Tx path only. For the gve_rx_poll and
gve_xdp_poll, they were mistakenly taking the 0 budget as the indication
to do all the work. Add check to avoid the rx path and xdp path being
called when budget is 0. And also avoid napi_complete_done being called
when budget is 0 for netpoll.
The original fix was merged here:
https://lore.kernel.org/r/20231114004144.2022268-1-ziweixiao@google.com
Resend it since the original one was not cleanly applied to 5.15 kernel.
commit 278a370c1766 ("gve: Fixes for napi_poll when budget is 0")
Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Ziwei Xiao <ziweixiao(a)google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
---
Changes in v2:
* Add the original git commit id
---
drivers/net/ethernet/google/gve/gve_main.c | 7 +++++++
drivers/net/ethernet/google/gve/gve_rx.c | 4 ----
drivers/net/ethernet/google/gve/gve_tx.c | 4 ----
3 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index bf8a4a7c43f7..c3f1959533a8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -198,6 +198,10 @@ static int gve_napi_poll(struct napi_struct *napi, int budget)
if (block->tx)
reschedule |= gve_tx_poll(block, budget);
+
+ if (!budget)
+ return 0;
+
if (block->rx)
reschedule |= gve_rx_poll(block, budget);
@@ -246,6 +250,9 @@ static int gve_napi_poll_dqo(struct napi_struct *napi, int budget)
if (block->tx)
reschedule |= gve_tx_poll_dqo(block, /*do_clean=*/true);
+ if (!budget)
+ return 0;
+
if (block->rx) {
work_done = gve_rx_poll_dqo(block, budget);
reschedule |= work_done == budget;
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 94941d4e4744..368e0e770178 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -599,10 +599,6 @@ bool gve_rx_poll(struct gve_notify_block *block, int budget)
feat = block->napi.dev->features;
- /* If budget is 0, do all the work */
- if (budget == 0)
- budget = INT_MAX;
-
if (budget > 0)
repoll |= gve_clean_rx_done(rx, budget, feat);
else
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index 665ac795a1ad..d56b8356f1f3 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -691,10 +691,6 @@ bool gve_tx_poll(struct gve_notify_block *block, int budget)
u32 nic_done;
u32 to_do;
- /* If budget is 0, do all the work */
- if (budget == 0)
- budget = INT_MAX;
-
/* Find out how much work there is to be done */
tx->last_nic_done = gve_tx_load_event_counter(priv, tx);
nic_done = be32_to_cpu(tx->last_nic_done);
--
2.47.0.338.g60cca15819-goog
From: Juntong Deng <juntong.deng(a)outlook.com>
commit 7ad4e0a4f61c57c3ca291ee010a9d677d0199fba upstream.
In gfs2_put_super(), whether withdrawn or not, the quota should
be cleaned up by gfs2_quota_cleanup().
Otherwise, struct gfs2_sbd will be freed before gfs2_qd_dealloc (rcu
callback) has run for all gfs2_quota_data objects, resulting in
use-after-free.
Also, gfs2_destroy_threads() and gfs2_quota_cleanup() is already called
by gfs2_make_fs_ro(), so in gfs2_put_super(), after calling
gfs2_make_fs_ro(), there is no need to call them again.
Reported-by: syzbot+29c47e9e51895928698c(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=29c47e9e51895928698c
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Clayton Casciato <majortomtosourcecontrol(a)gmail.com>
Signed-off-by: Guocai He <guocai.he.cn(a)windriver.com>
---
This commit is backporting 7ad4e0a4f61c7ad4e0a4f61c57c3ca291ee010a9d677d0199fba to the branch linux-5.15.y to
solve the CVE-2024-52760. Please merge this commit to linux-5.15.y.
fs/gfs2/super.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 268651ac9fc8..98158559893f 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -590,6 +590,8 @@ static void gfs2_put_super(struct super_block *sb)
if (!sb_rdonly(sb)) {
gfs2_make_fs_ro(sdp);
+ } else {
+ gfs2_quota_cleanup(sdp);
}
WARN_ON(gfs2_withdrawing(sdp));
--
2.34.1
From: Juntong Deng <juntong.deng(a)outlook.com>
commit 7ad4e0a4f61c57c3ca291ee010a9d677d0199fba upstream.
In gfs2_put_super(), whether withdrawn or not, the quota should
be cleaned up by gfs2_quota_cleanup().
Otherwise, struct gfs2_sbd will be freed before gfs2_qd_dealloc (rcu
callback) has run for all gfs2_quota_data objects, resulting in
use-after-free.
Also, gfs2_destroy_threads() and gfs2_quota_cleanup() is already called
by gfs2_make_fs_ro(), so in gfs2_put_super(), after calling
gfs2_make_fs_ro(), there is no need to call them again.
Reported-by: syzbot+29c47e9e51895928698c(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=29c47e9e51895928698c
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Clayton Casciato <majortomtosourcecontrol(a)gmail.com>
Signed-off-by: Guocai He <guocai.he.cn(a)windriver.com>
---
This commit is backporting 7ad4e0a4f61c7ad4e0a4f61c57c3ca291ee010a9d677d0199fba to the branch linux-5.15.y to
solve the CVE-2024-52760. Please merge this commit to linux-5.15.y.
fs/gfs2/super.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 268651ac9fc8..98158559893f 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -590,6 +590,8 @@ static void gfs2_put_super(struct super_block *sb)
if (!sb_rdonly(sb)) {
gfs2_make_fs_ro(sdp);
+ } else {
+ gfs2_quota_cleanup(sdp);
}
WARN_ON(gfs2_withdrawing(sdp));
--
2.34.1
NULL-dereference is possible in amd_pstate_adjust_perf in 6.6 stable
release.
The problem has been fixed by the following upstream patch that was adapted
to 6.6. The patch couldn't be applied clearly but the changes made are
minor.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Hi,
ocfs2 on a drbd device, writing something to it, then unmount ends up in:
[ 1135.766639] OCFS2: ERROR (device drbd0): ocfs2_block_group_clear_bits:
Group descriptor # 4128768 has bit count 32256 but claims 33222 are freed.
num_bits 996
[ 1135.766645] On-disk corruption discovered. Please run fsck.ocfs2 once
the filesystem is unmounted.
[ 1135.766647] (umount,10751,3):_ocfs2_free_suballoc_bits:2490 ERROR:
status = -30
[ 1135.766650] (umount,10751,3):_ocfs2_free_clusters:2573 ERROR: status =
-30
[ 1135.766652] (umount,10751,3):ocfs2_sync_local_to_main:1027 ERROR:
status = -30
[ 1135.766654] (umount,10751,3):ocfs2_sync_local_to_main:1032 ERROR:
status = -30
[ 1135.766656] (umount,10751,3):ocfs2_shutdown_local_alloc:449 ERROR:
status = -30
[ 1135.965908] ocfs2: Unmounting device (147,0) on (node 2)
This is since 6.6.55, reverting this patch helps:
commit e7a801014726a691d4aa6e3839b3f0940ea41591
Author: Heming Zhao <heming.zhao(a)suse.com>
Date: Fri Jul 19 19:43:10 2024 +0800
ocfs2: fix the la space leak when unmounting an ocfs2 volume
commit dfe6c5692fb525e5e90cefe306ee0dffae13d35f upstream.
Linux 6.1.119 is also broken, but 6.12.4 is fine.
I guess there is something missing?
Thomas
út 10. 12. 2024 v 21:44 odesílatel Sasha Levin <sashal(a)kernel.org> napsal:
>
> This is a note to let you know that I've just added the patch titled
>
> rtla/utils: Add idle state disabling via libcpupower
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> rtla-utils-add-idle-state-disabling-via-libcpupower.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
This is a part of a patchset implementing a new feature, rtla idle
state disabling, see [1]. It seems it was included in this stable
queue and also the one for 6.6 by mistake.
Also, the patch by itself does not do anything, because it depends on
preceding commits from the patchset to enable HAVE_LIBCPUPOWER_SUPPORT
and on following commits to actually implement the functionality in
the rtla command line interface.
Perhaps AUTOSEL picked it due to some merge conflicts with other
patches? I don't know of any though.
[1] - https://lore.kernel.org/linux-trace-kernel/20241017140914.3200454-1-tglozar…
>
>
>
> commit 354bcd3b3efd600f4d23cee6898a6968659bb3a9
> Author: Tomas Glozar <tglozar(a)redhat.com>
> Date: Thu Oct 17 16:09:11 2024 +0200
>
> rtla/utils: Add idle state disabling via libcpupower
>
> [ Upstream commit 083d29d3784319e9e9fab3ac02683a7b26ae3480 ]
>
> Add functions to utils.c to disable idle states through functions of
> libcpupower. This will serve as the basis for disabling idle states
> per cpu when running timerlat.
>
> Link: https://lore.kernel.org/20241017140914.3200454-4-tglozar@redhat.com
> Signed-off-by: Tomas Glozar <tglozar(a)redhat.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/tools/tracing/rtla/src/utils.c b/tools/tracing/rtla/src/utils.c
> index 0735fcb827ed7..230f9fc7502dd 100644
> --- a/tools/tracing/rtla/src/utils.c
> +++ b/tools/tracing/rtla/src/utils.c
> @@ -4,6 +4,9 @@
> */
>
> #define _GNU_SOURCE
> +#ifdef HAVE_LIBCPUPOWER_SUPPORT
> +#include <cpuidle.h>
> +#endif /* HAVE_LIBCPUPOWER_SUPPORT */
> #include <dirent.h>
> #include <stdarg.h>
> #include <stdlib.h>
> @@ -519,6 +522,153 @@ int set_cpu_dma_latency(int32_t latency)
> return fd;
> }
>
> +#ifdef HAVE_LIBCPUPOWER_SUPPORT
> +static unsigned int **saved_cpu_idle_disable_state;
> +static size_t saved_cpu_idle_disable_state_alloc_ctr;
> +
> +/*
> + * save_cpu_idle_state_disable - save disable for all idle states of a cpu
> + *
> + * Saves the current disable of all idle states of a cpu, to be subsequently
> + * restored via restore_cpu_idle_disable_state.
> + *
> + * Return: idle state count on success, negative on error
> + */
> +int save_cpu_idle_disable_state(unsigned int cpu)
> +{
> + unsigned int nr_states;
> + unsigned int state;
> + int disabled;
> + int nr_cpus;
> +
> + nr_states = cpuidle_state_count(cpu);
> +
> + if (nr_states == 0)
> + return 0;
> +
> + if (saved_cpu_idle_disable_state == NULL) {
> + nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> + saved_cpu_idle_disable_state = calloc(nr_cpus, sizeof(unsigned int *));
> + if (!saved_cpu_idle_disable_state)
> + return -1;
> + }
> +
> + saved_cpu_idle_disable_state[cpu] = calloc(nr_states, sizeof(unsigned int));
> + if (!saved_cpu_idle_disable_state[cpu])
> + return -1;
> + saved_cpu_idle_disable_state_alloc_ctr++;
> +
> + for (state = 0; state < nr_states; state++) {
> + disabled = cpuidle_is_state_disabled(cpu, state);
> + if (disabled < 0)
> + return disabled;
> + saved_cpu_idle_disable_state[cpu][state] = disabled;
> + }
> +
> + return nr_states;
> +}
> +
> +/*
> + * restore_cpu_idle_disable_state - restore disable for all idle states of a cpu
> + *
> + * Restores the current disable state of all idle states of a cpu that was
> + * previously saved by save_cpu_idle_disable_state.
> + *
> + * Return: idle state count on success, negative on error
> + */
> +int restore_cpu_idle_disable_state(unsigned int cpu)
> +{
> + unsigned int nr_states;
> + unsigned int state;
> + int disabled;
> + int result;
> +
> + nr_states = cpuidle_state_count(cpu);
> +
> + if (nr_states == 0)
> + return 0;
> +
> + if (!saved_cpu_idle_disable_state)
> + return -1;
> +
> + for (state = 0; state < nr_states; state++) {
> + if (!saved_cpu_idle_disable_state[cpu])
> + return -1;
> + disabled = saved_cpu_idle_disable_state[cpu][state];
> + result = cpuidle_state_disable(cpu, state, disabled);
> + if (result < 0)
> + return result;
> + }
> +
> + free(saved_cpu_idle_disable_state[cpu]);
> + saved_cpu_idle_disable_state[cpu] = NULL;
> + saved_cpu_idle_disable_state_alloc_ctr--;
> + if (saved_cpu_idle_disable_state_alloc_ctr == 0) {
> + free(saved_cpu_idle_disable_state);
> + saved_cpu_idle_disable_state = NULL;
> + }
> +
> + return nr_states;
> +}
> +
> +/*
> + * free_cpu_idle_disable_states - free saved idle state disable for all cpus
> + *
> + * Frees the memory used for storing cpu idle state disable for all cpus
> + * and states.
> + *
> + * Normally, the memory is freed automatically in
> + * restore_cpu_idle_disable_state; this is mostly for cleaning up after an
> + * error.
> + */
> +void free_cpu_idle_disable_states(void)
> +{
> + int cpu;
> + int nr_cpus;
> +
> + if (!saved_cpu_idle_disable_state)
> + return;
> +
> + nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> +
> + for (cpu = 0; cpu < nr_cpus; cpu++) {
> + free(saved_cpu_idle_disable_state[cpu]);
> + saved_cpu_idle_disable_state[cpu] = NULL;
> + }
> +
> + free(saved_cpu_idle_disable_state);
> + saved_cpu_idle_disable_state = NULL;
> +}
> +
> +/*
> + * set_deepest_cpu_idle_state - limit idle state of cpu
> + *
> + * Disables all idle states deeper than the one given in
> + * deepest_state (assuming states with higher number are deeper).
> + *
> + * This is used to reduce the exit from idle latency. Unlike
> + * set_cpu_dma_latency, it can disable idle states per cpu.
> + *
> + * Return: idle state count on success, negative on error
> + */
> +int set_deepest_cpu_idle_state(unsigned int cpu, unsigned int deepest_state)
> +{
> + unsigned int nr_states;
> + unsigned int state;
> + int result;
> +
> + nr_states = cpuidle_state_count(cpu);
> +
> + for (state = deepest_state + 1; state < nr_states; state++) {
> + result = cpuidle_state_disable(cpu, state, 1);
> + if (result < 0)
> + return result;
> + }
> +
> + return nr_states;
> +}
> +#endif /* HAVE_LIBCPUPOWER_SUPPORT */
> +
> #define _STR(x) #x
> #define STR(x) _STR(x)
>
> diff --git a/tools/tracing/rtla/src/utils.h b/tools/tracing/rtla/src/utils.h
> index 99c9cf81bcd02..101d4799a0090 100644
> --- a/tools/tracing/rtla/src/utils.h
> +++ b/tools/tracing/rtla/src/utils.h
> @@ -66,6 +66,19 @@ int set_comm_sched_attr(const char *comm_prefix, struct sched_attr *attr);
> int set_comm_cgroup(const char *comm_prefix, const char *cgroup);
> int set_pid_cgroup(pid_t pid, const char *cgroup);
> int set_cpu_dma_latency(int32_t latency);
> +#ifdef HAVE_LIBCPUPOWER_SUPPORT
> +int save_cpu_idle_disable_state(unsigned int cpu);
> +int restore_cpu_idle_disable_state(unsigned int cpu);
> +void free_cpu_idle_disable_states(void);
> +int set_deepest_cpu_idle_state(unsigned int cpu, unsigned int state);
> +static inline int have_libcpupower_support(void) { return 1; }
> +#else
> +static inline int save_cpu_idle_disable_state(unsigned int cpu) { return -1; }
> +static inline int restore_cpu_idle_disable_state(unsigned int cpu) { return -1; }
> +static inline void free_cpu_idle_disable_states(void) { }
> +static inline int set_deepest_cpu_idle_state(unsigned int cpu, unsigned int state) { return -1; }
> +static inline int have_libcpupower_support(void) { return 0; }
> +#endif /* HAVE_LIBCPUPOWER_SUPPORT */
> int auto_house_keeping(cpu_set_t *monitored_cpus);
>
> #define ns_to_usf(x) (((double)x/1000))
>
Tomas
After upgrading from 6.6.52 to 6.6.58, tapping on the touchpad stopped
working. The problem is still present in 6.6.59.
I see the following in dmesg output; the first line was not there
previously:
[ 2.129282] hid-multitouch 0018:27C6:01E0.0001: The byte is not expected for fixing the report descriptor. It's possible that the touchpad firmware is not suitable for applying the fix. got: 9
[ 2.137479] input: GXTP5140:00 27C6:01E0 as /devices/platform/AMDI0010:00/i2c-0/i2c-GXTP5140:00/0018:27C6:01E0.0001/input/input10
[ 2.137680] input: GXTP5140:00 27C6:01E0 as /devices/platform/AMDI0010:00/i2c-0/i2c-GXTP5140:00/0018:27C6:01E0.0001/input/input11
[ 2.137921] hid-multitouch 0018:27C6:01E0.0001: input,hidraw0: I2C HID v1.00 Mouse [GXTP5140:00 27C6:01E0] on i2c-GXTP5140:00
Hardware is a Lenovo ThinkPad L15 Gen 4.
The problem goes away when reverting this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/d…
See also Gentoo bug report: https://bugs.gentoo.org/942797
Hi,
Can you add the below to 6.1-stable? Thanks!
commit 3181e22fb79910c7071e84a43af93ac89e8a7106
Author: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon Jan 9 14:46:10 2023 +0000
io_uring: wake up optimisations
Commit 3181e22fb79910c7071e84a43af93ac89e8a7106 upstream.
Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/60ad9768ec74435a0ddaa6eec0ffa7729474f69f.16732742…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4f0ae938b146..0b1361663267 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -582,6 +582,16 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
io_cqring_ev_posted(ctx);
}
+static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+ __releases(ctx->completion_lock)
+{
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ io_commit_cqring_flush(ctx);
+ if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+ __io_cqring_wake(ctx);
+}
+
void io_cq_unlock_post(struct io_ring_ctx *ctx)
{
__io_cq_unlock_post(ctx);
@@ -1339,7 +1349,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
if (!(req->flags & REQ_F_CQE_SKIP))
__io_fill_cqe_req(ctx, req);
}
- __io_cq_unlock_post(ctx);
+ __io_cq_unlock_post_flush(ctx);
io_free_batch_list(ctx, state->compl_reqs.first);
INIT_WQ_LIST(&state->compl_reqs);
--
Jens Axboe
From: Ard Biesheuvel <ardb(a)kernel.org>
When the host stage1 is configured for LPA2, the value currently being
programmed into TCR_EL2.T0SZ may be invalid unless LPA2 is configured
at HYP as well. This means kvm_lpa2_is_enabled() is not the right
condition to test when setting TCR_EL2.DS, as it will return false if
LPA2 is only available for stage 1 but not for stage 2.
Similary, programming TCR_EL2.PS based on a limited IPA range due to
lack of stage2 LPA2 support could potentially result in problems.
So use lpa2_is_enabled() instead, and set the PS field according to the
host's IPS, which is capped at 48 bits if LPA2 support is absent or
disabled. Whether or not we can make meaningful use of such a
configuration is a different question.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
---
arch/arm64/kvm/arm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a102c3aebdbc..7b2735ad32e9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1990,8 +1990,7 @@ static int kvm_init_vector_slots(void)
static void __init cpu_prepare_hyp_mode(int cpu, u32 hyp_va_bits)
{
struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu);
- u64 mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
- unsigned long tcr;
+ unsigned long tcr, ips;
/*
* Calculate the raw per-cpu offset without a translation from the
@@ -2005,6 +2004,7 @@ static void __init cpu_prepare_hyp_mode(int cpu, u32 hyp_va_bits)
params->mair_el2 = read_sysreg(mair_el1);
tcr = read_sysreg(tcr_el1);
+ ips = FIELD_GET(TCR_IPS_MASK, tcr);
if (cpus_have_final_cap(ARM64_KVM_HVHE)) {
tcr |= TCR_EPD1_MASK;
} else {
@@ -2014,8 +2014,8 @@ static void __init cpu_prepare_hyp_mode(int cpu, u32 hyp_va_bits)
tcr &= ~TCR_T0SZ_MASK;
tcr |= TCR_T0SZ(hyp_va_bits);
tcr &= ~TCR_EL2_PS_MASK;
- tcr |= FIELD_PREP(TCR_EL2_PS_MASK, kvm_get_parange(mmfr0));
- if (kvm_lpa2_is_enabled())
+ tcr |= FIELD_PREP(TCR_EL2_PS_MASK, ips);
+ if (lpa2_is_enabled())
tcr |= TCR_EL2_DS;
params->tcr_el2 = tcr;
--
2.47.1.613.gc27f4b7a9f-goog
From: Ard Biesheuvel <ardb(a)kernel.org>
Currently, LPA2 kernel support implies support for up to 52 bits of
physical addressing, and this is reflected in global definitions such as
PHYS_MASK_SHIFT and MAX_PHYSMEM_BITS.
This is potentially problematic, given that LPA2 hardware support is
modeled as a CPU feature which can be overridden, and with LPA2 hardware
support turned off, attempting to map physical regions with address bits
[51:48] set (which may exist on LPA2 capable systems booting with
arm64.nolva) will result in corrupted mappings with a truncated output
address and bogus shareability attributes.
This means that the accepted physical address range in the mapping
routines should be at most 48 bits wide when LPA2 support is configured
but not enabled at runtime.
Fixes: 352b0395b505 ("arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
---
arch/arm64/include/asm/pgtable-hwdef.h | 6 ------
arch/arm64/include/asm/pgtable-prot.h | 7 +++++++
arch/arm64/include/asm/sparsemem.h | 5 ++++-
3 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index c78a988cca93..a9136cc551cc 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -222,12 +222,6 @@
*/
#define S1_TABLE_AP (_AT(pmdval_t, 3) << 61)
-/*
- * Highest possible physical address supported.
- */
-#define PHYS_MASK_SHIFT (CONFIG_ARM64_PA_BITS)
-#define PHYS_MASK ((UL(1) << PHYS_MASK_SHIFT) - 1)
-
#define TTBR_CNP_BIT (UL(1) << 0)
/*
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 9f9cf13bbd95..a95f1f77bb39 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -81,6 +81,7 @@ extern unsigned long prot_ns_shared;
#define lpa2_is_enabled() false
#define PTE_MAYBE_SHARED PTE_SHARED
#define PMD_MAYBE_SHARED PMD_SECT_S
+#define PHYS_MASK_SHIFT (CONFIG_ARM64_PA_BITS)
#else
static inline bool __pure lpa2_is_enabled(void)
{
@@ -89,8 +90,14 @@ static inline bool __pure lpa2_is_enabled(void)
#define PTE_MAYBE_SHARED (lpa2_is_enabled() ? 0 : PTE_SHARED)
#define PMD_MAYBE_SHARED (lpa2_is_enabled() ? 0 : PMD_SECT_S)
+#define PHYS_MASK_SHIFT (lpa2_is_enabled() ? CONFIG_ARM64_PA_BITS : 48)
#endif
+/*
+ * Highest possible physical address supported.
+ */
+#define PHYS_MASK ((UL(1) << PHYS_MASK_SHIFT) - 1)
+
/*
* If we have userspace only BTI we don't want to mark kernel pages
* guarded even if the system does support BTI.
diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h
index 8a8acc220371..84783efdc9d1 100644
--- a/arch/arm64/include/asm/sparsemem.h
+++ b/arch/arm64/include/asm/sparsemem.h
@@ -5,7 +5,10 @@
#ifndef __ASM_SPARSEMEM_H
#define __ASM_SPARSEMEM_H
-#define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS
+#include <asm/pgtable-prot.h>
+
+#define MAX_PHYSMEM_BITS PHYS_MASK_SHIFT
+#define MAX_POSSIBLE_PHYSMEM_BITS (52)
/*
* Section size must be at least 512MB for 64K base
--
2.47.1.613.gc27f4b7a9f-goog
The PE Reset State "0" obtained from RTAS calls
ibm_read_slot_reset_[state|state2] indicates that
the Reset is deactivated and the PE is not in the MMIO
Stopped or DMA Stopped state.
With PE Reset State "0", the MMIO and DMA is allowed for
the PE. The function pseries_eeh_get_state() is currently
not indicating that to the caller because of which the
drivers are unable to resume the MMIO and DMA activity.
The patch fixes that by reflecting what is actually allowed.
Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()")
Signed-off-by: Narayana Murty N <nnmlinux(a)linux.ibm.com>
---
Changelog:
V1:https://lore.kernel.org/all/20241107042027.338065-1-nnmlinux@linux.ibm.c…
--added Fixes tag for "powerpc/pseries: Cleanup on
pseries_eeh_get_state()".
---
arch/powerpc/platforms/pseries/eeh_pseries.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 1893f66371fa..b12ef382fec7 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -580,8 +580,10 @@ static int pseries_eeh_get_state(struct eeh_pe *pe, int *delay)
switch(rets[0]) {
case 0:
- result = EEH_STATE_MMIO_ACTIVE |
- EEH_STATE_DMA_ACTIVE;
+ result = EEH_STATE_MMIO_ACTIVE |
+ EEH_STATE_DMA_ACTIVE |
+ EEH_STATE_MMIO_ENABLED |
+ EEH_STATE_DMA_ENABLED;
break;
case 1:
result = EEH_STATE_RESET_ACTIVE |
--
2.47.1