The differences in the vendor-approved CPU and GPU OPPs for the standard
Rockchip RK3588 variant [1] and the industrial Rockchip RK3588J variant [2]
come from the latter, presumably, supporting an extended temperature range
that's usually associated with industrial applications, despite the two SoC
variant datasheets specifying the same upper limit for the allowed ambient
temperature for both variants. However, the lower temperature limit is
specified much lower for the RK3588J variant. [1][2]
To be on the safe side and to ensure maximum longevity of the RK3588J SoCs,
only the CPU and GPU OPPs that are declared by the vendor to be always safe
for this SoC variant may be provided. As explained by the vendor [3] and
according to the RK3588J datasheet, [2] higher-frequency/higher-voltage
CPU and GPU OPPs can be used as well, but at the risk of reducing the SoC
lifetime expectancy. Presumably, using the higher OPPs may be safe only
when not enjoying the assumed extended temperature range that the RK3588J,
as an SoC variant targeted specifically at higher-temperature, industrial
applications, is made (or binned) for.
Anyone able to keep their RK3588J-based board outside the above-presumed
extended temperature range at all times, and willing to take the associated
risk of possibly reducing the SoC lifetime expectancy, is free to apply
a DT overlay that adds the higher CPU and GPU OPPs.
With all this and the downstream RK3588(J) DT definitions [4][5] in mind,
let's delete the RK3588J CPU and GPU OPPs that are not considered belonging
to the normal operation mode for this SoC variant. To quote the RK3588J
datasheet [2], "normal mode means the chipset works under safety voltage
and frequency; for the industrial environment, highly recommend to keep in
normal mode, the lifetime is reasonably guaranteed", while "overdrive mode
brings higher frequency, and the voltage will increase accordingly; under
the overdrive mode for a long time, the chipset may shorten the lifetime,
especially in high-temperature condition".
To sum the RK3588J datasheet [2] and the vendor-provided DTs up, [4][5]
the maximum allowed CPU core, GPU and NPU frequencies are as follows:
IP core | Normal mode | Overdrive mode
------------+-------------+----------------
Cortex-A55 | 1,296 MHz | 1,704 MHz
Cortex-A76 | 1,608 MHz | 2,016 MHz
GPU | 700 MHz | 850 MHz
NPU | 800 MHz | 950 MHz
Unfortunately, when it comes to the actual voltages for the RK3588J CPU and
GPU OPPs, there's a discrepancy between the RK3588J datasheet [2] and the
downstream kernel code. [4][5] The RK3588J datasheet states that "the max.
working voltage of CPU/GPU/NPU is 0.75 V under the normal mode", while the
downstream kernel code actually allows voltage ranges that go up to 0.95 V,
which is still within the voltage range allowed by the datasheet. However,
the RK3588J datasheet also tells us to "strictly refer to the software
configuration of SDK and the hardware reference design", so let's embrace
the voltage ranges provided by the downstream kernel code, which also
prevents the undesirable theoretical outcome of ending up with no usable
OPPs on a particular board, as a result of the board's voltage regulator(s)
being unable to deliver the exact voltages, for whatever reason.
The above-described voltage ranges for the RK3588J CPU OPPs remain taken
from the downstream kernel code [4][5] by picking the highest, worst-bin
values, which ensure that all RK3588J bins will work reliably. Yes, with
some power inevitably wasted as unnecessarily generated heat, but the
reliability is paramount, together with the longevity. This deficiency
may be revisited separately at some point in the future.
The provided RK3588J CPU OPPs follow the slightly debatable "provide only
the highest-frequency OPP from the same-voltage group" approach that's been
established earlier, [6] as a result of the "same-voltage, lower-frequency"
OPPs being considered inefficient from the IPA governor's standpoint, which
may also be revisited separately at some point in the future.
[1] https://wiki.friendlyelec.com/wiki/images/e/ee/Rockchip_RK3588_Datasheet_V1…
[2] https://wmsc.lcsc.com/wmsc/upload/file/pdf/v2/lcsc/2403201054_Rockchip-RK35…
[3] https://lore.kernel.org/linux-rockchip/e55125ed-64fb-455e-b1e4-cebe2cf006e4…
[4] https://raw.githubusercontent.com/rockchip-linux/kernel/604cec4004abe5a96c7…
[5] https://raw.githubusercontent.com/rockchip-linux/kernel/604cec4004abe5a96c7…
[6] https://lore.kernel.org/all/20240229-rk-dts-additions-v3-5-6afe8473a631@gma…
Fixes: 667885a68658 ("arm64: dts: rockchip: Add OPP data for CPU cores on RK3588j")
Fixes: a7b2070505a2 ("arm64: dts: rockchip: Split GPU OPPs of RK3588 and RK3588j")
Cc: stable(a)vger.kernel.org
Cc: Heiko Stuebner <heiko(a)sntech.de>
Cc: Alexey Charkov <alchark(a)gmail.com>
Helped-by: Quentin Schulz <quentin.schulz(a)cherry.de>
Reviewed-by: Quentin Schulz <quentin.schulz(a)cherry.de>
Signed-off-by: Dragan Simic <dsimic(a)manjaro.org>
---
Notes:
Changes in v2:
- Reworded and expanded the patch description a bit, to include some
more information and to make it more clear what are the implied
speculations and assumptions, and what are the available official
statements from Rockchip, as suggested by Quentin [7]
- Collected Reviewed-by tag from Quentin [7]
Link to v1: https://lore.kernel.org/linux-rockchip/f929da061de35925ea591c969f985430e23c…
[7] https://lore.kernel.org/linux-rockchip/71b7c81b-6a4e-442b-a661-04d63639962a…
arch/arm64/boot/dts/rockchip/rk3588j.dtsi | 53 ++++++++---------------
1 file changed, 17 insertions(+), 36 deletions(-)
diff --git a/arch/arm64/boot/dts/rockchip/rk3588j.dtsi b/arch/arm64/boot/dts/rockchip/rk3588j.dtsi
index bce72bac4503..3045cb3bd68c 100644
--- a/arch/arm64/boot/dts/rockchip/rk3588j.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3588j.dtsi
@@ -11,74 +11,59 @@ cluster0_opp_table: opp-table-cluster0 {
compatible = "operating-points-v2";
opp-shared;
- opp-1416000000 {
- opp-hz = /bits/ 64 <1416000000>;
+ opp-1200000000 {
+ opp-hz = /bits/ 64 <1200000000>;
opp-microvolt = <750000 750000 950000>;
clock-latency-ns = <40000>;
opp-suspend;
};
- opp-1608000000 {
- opp-hz = /bits/ 64 <1608000000>;
- opp-microvolt = <887500 887500 950000>;
- clock-latency-ns = <40000>;
- };
- opp-1704000000 {
- opp-hz = /bits/ 64 <1704000000>;
- opp-microvolt = <937500 937500 950000>;
+ opp-1296000000 {
+ opp-hz = /bits/ 64 <1296000000>;
+ opp-microvolt = <775000 775000 950000>;
clock-latency-ns = <40000>;
};
};
cluster1_opp_table: opp-table-cluster1 {
compatible = "operating-points-v2";
opp-shared;
+ opp-1200000000{
+ opp-hz = /bits/ 64 <1200000000>;
+ opp-microvolt = <750000 750000 950000>;
+ clock-latency-ns = <40000>;
+ };
opp-1416000000 {
opp-hz = /bits/ 64 <1416000000>;
- opp-microvolt = <750000 750000 950000>;
+ opp-microvolt = <762500 762500 950000>;
clock-latency-ns = <40000>;
};
opp-1608000000 {
opp-hz = /bits/ 64 <1608000000>;
opp-microvolt = <787500 787500 950000>;
clock-latency-ns = <40000>;
};
- opp-1800000000 {
- opp-hz = /bits/ 64 <1800000000>;
- opp-microvolt = <875000 875000 950000>;
- clock-latency-ns = <40000>;
- };
- opp-2016000000 {
- opp-hz = /bits/ 64 <2016000000>;
- opp-microvolt = <950000 950000 950000>;
- clock-latency-ns = <40000>;
- };
};
cluster2_opp_table: opp-table-cluster2 {
compatible = "operating-points-v2";
opp-shared;
+ opp-1200000000{
+ opp-hz = /bits/ 64 <1200000000>;
+ opp-microvolt = <750000 750000 950000>;
+ clock-latency-ns = <40000>;
+ };
opp-1416000000 {
opp-hz = /bits/ 64 <1416000000>;
- opp-microvolt = <750000 750000 950000>;
+ opp-microvolt = <762500 762500 950000>;
clock-latency-ns = <40000>;
};
opp-1608000000 {
opp-hz = /bits/ 64 <1608000000>;
opp-microvolt = <787500 787500 950000>;
clock-latency-ns = <40000>;
};
- opp-1800000000 {
- opp-hz = /bits/ 64 <1800000000>;
- opp-microvolt = <875000 875000 950000>;
- clock-latency-ns = <40000>;
- };
- opp-2016000000 {
- opp-hz = /bits/ 64 <2016000000>;
- opp-microvolt = <950000 950000 950000>;
- clock-latency-ns = <40000>;
- };
};
gpu_opp_table: opp-table {
@@ -104,10 +89,6 @@ opp-700000000 {
opp-hz = /bits/ 64 <700000000>;
opp-microvolt = <750000 750000 850000>;
};
- opp-850000000 {
- opp-hz = /bits/ 64 <800000000>;
- opp-microvolt = <787500 787500 850000>;
- };
};
};
__VA_OPT__ is a macro that is useful when some arguments can be present
or not to entirely skip some part of a definition. Unfortunately, it
is a too recent addition that some of the still supported old GCC
versions do not know about, and is anyway not part of C11 that is the
version used in the kernel.
Find a trick to remove this macro, typically '__VA_ARGS__ + 0' is a
workaround used in netlink.h which works very well here, as we either
expect:
- 0
- A positive value
- No value, which means the field should be 0.
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503181330.YcDXGy7F-lkp@intel.com/
Fixes: 7ce0d16d5802 ("mtd: spinand: Add an optional frequency to read from cache macros")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
include/linux/mtd/spinand.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mtd/spinand.h b/include/linux/mtd/spinand.h
index 83301ef11aa9..0bb06aeffa62 100644
--- a/include/linux/mtd/spinand.h
+++ b/include/linux/mtd/spinand.h
@@ -67,7 +67,7 @@
SPI_MEM_OP_ADDR(2, addr, 1), \
SPI_MEM_OP_DUMMY(ndummy, 1), \
SPI_MEM_OP_DATA_IN(len, buf, 1), \
- __VA_OPT__(SPI_MEM_OP_MAX_FREQ(__VA_ARGS__)))
+ SPI_MEM_OP_MAX_FREQ(__VA_ARGS__ + 0))
#define SPINAND_PAGE_READ_FROM_CACHE_FAST_1S_1S_1S_OP(addr, ndummy, buf, len) \
SPI_MEM_OP(SPI_MEM_OP_CMD(0x0b, 1), \
--
2.48.1
In r852_ready(), the dev get from r852_get_dev() need to be checked.
An unstable device should not be ready. A proper implementation can
be found in r852_read_byte(). Add a status check and return 0 when it is
unstable.
Fixes: 50a487e7719c ("mtd: rawnand: Pass a nand_chip object to chip->dev_ready()")
Cc: stable(a)vger.kernel.org # v4.20+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/mtd/nand/raw/r852.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/mtd/nand/raw/r852.c b/drivers/mtd/nand/raw/r852.c
index b07c2f8b4035..918974d088cf 100644
--- a/drivers/mtd/nand/raw/r852.c
+++ b/drivers/mtd/nand/raw/r852.c
@@ -387,6 +387,9 @@ static int r852_wait(struct nand_chip *chip)
static int r852_ready(struct nand_chip *chip)
{
struct r852_device *dev = r852_get_dev(nand_to_mtd(chip));
+ if (dev->card_unstable)
+ return 0;
+
return !(r852_read_reg(dev, R852_CARD_STA) & R852_CARD_STA_BUSY);
}
--
2.42.0.windows.2
In INFTL_findwriteunit(), the return value of inftl_read_oob()
need to be checked. A proper implementation can be
found in INFTL_deleteblock(). The status will be set as
SECTOR_IGNORE to break from the while-loop correctly
if the inftl_read_oob() fails.
Fixes: 8593fbc68b0d ("[MTD] Rework the out of band handling completely")
Cc: stable(a)vger.kernel.org # v2.6+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/mtd/inftlcore.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/mtd/inftlcore.c b/drivers/mtd/inftlcore.c
index 9739387cff8c..58c6e1743f5c 100644
--- a/drivers/mtd/inftlcore.c
+++ b/drivers/mtd/inftlcore.c
@@ -482,10 +482,11 @@ static inline u16 INFTL_findwriteunit(struct INFTLrecord *inftl, unsigned block)
silly = MAX_LOOPS;
while (thisEUN <= inftl->lastEUN) {
- inftl_read_oob(mtd, (thisEUN * inftl->EraseSize) +
- blockofs, 8, &retlen, (char *)&bci);
-
- status = bci.Status | bci.Status1;
+ if (inftl_read_oob(mtd, (thisEUN * inftl->EraseSize) +
+ blockofs, 8, &retlen, (char *)&bci) < 0)
+ status = SECTOR_IGNORE;
+ else
+ status = bci.Status | bci.Status1;
pr_debug("INFTL: status of block %d in EUN %d is %x\n",
block , writeEUN, status);
--
2.42.0.windows.2
The sdio_read32() calls sd_read(), but does not handle the error if
sd_read() fails. This could lead to subsequent operations processing
invalid data. A proper implementation can be found in sdio_readN().
Add error handling for the sd_read() to free tmpbuf and return error
code if sd_read() fails. This ensure that the memcpy() is only performed
when the read operation is successful.
Fixes: 554c0a3abf21 ("staging: Add rtl8723bs sdio wifi driver")
Cc: stable(a)vger.kernel.org # v4.12+
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
v5: Fix error code
v4: Add change log and fix error code
v3: Add Cc flag
v2: Change code to initialize val
drivers/staging/rtl8723bs/hal/sdio_ops.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/staging/rtl8723bs/hal/sdio_ops.c b/drivers/staging/rtl8723bs/hal/sdio_ops.c
index 21e9f1858745..d79d41727042 100644
--- a/drivers/staging/rtl8723bs/hal/sdio_ops.c
+++ b/drivers/staging/rtl8723bs/hal/sdio_ops.c
@@ -185,7 +185,12 @@ static u32 sdio_read32(struct intf_hdl *intfhdl, u32 addr)
return SDIO_ERR_VAL32;
ftaddr &= ~(u16)0x3;
- sd_read(intfhdl, ftaddr, 8, tmpbuf);
+ err = sd_read(intfhdl, ftaddr, 8, tmpbuf);
+ if (err) {
+ kfree(tmpbuf);
+ return SDIO_ERR_VAL32;
+ }
+
memcpy(&le_tmp, tmpbuf + shift, 4);
val = le32_to_cpu(le_tmp);
--
2.42.0.windows.2
The JIT compile of ldimm instructions can be anywhere between 1-5
instructions long depending on the value being loaded.
arch_bpf_trampoline_size() provides JIT size of the BPF trampoline
before the buffer for JIT'ing it is allocated. BPF trampoline JIT
code has ldimm instructions that need to load the value of pointer
to struct bpf_tramp_image. But this pointer value is not same while
calling arch_bpf_trampoline_size() & arch_prepare_bpf_trampoline().
So, the size arrived at using arch_bpf_trampoline_size() can vary
from the size needed in arch_prepare_bpf_trampoline(). When the
number of ldimm instructions emitted in arch_bpf_trampoline_size()
is less than the number of ldimm instructions emitted during the
actual JIT compile of trampoline, the below warning is produced:
WARNING: CPU: 8 PID: 204190 at arch/powerpc/net/bpf_jit_comp.c:981 __arch_prepare_bpf_trampoline.isra.0+0xd2c/0xdcc
which is:
/* Make sure the trampoline generation logic doesn't overflow */
if (image && WARN_ON_ONCE(&image[ctx->idx] >
(u32 *)rw_image_end - BPF_INSN_SAFETY)) {
Pass NULL as the first argument to __arch_prepare_bpf_trampoline()
call from arch_bpf_trampoline_size() function, to differentiate it
from how arch_prepare_bpf_trampoline() calls it and ensure maximum
possible instructions are emitted in arch_bpf_trampoline_size() for
ldimm instructions that load a different value during the actual JIT
compile of BPF trampoline.
Fixes: d243b62b7bd3 ("powerpc64/bpf: Add support for bpf trampolines")
Reported-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Closes: https://lore.kernel.org/all/6168bfc8-659f-4b5a-a6fb-90a916dde3b3@linux.ibm.…
Cc: stable(a)vger.kernel.org # v6.13+
Signed-off-by: Hari Bathini <hbathini(a)linux.ibm.com>
---
* Removed a redundant '/' accidently added in a comment and resending.
arch/powerpc/net/bpf_jit_comp.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 2991bb171a9b..c94717ccb2bd 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -833,7 +833,12 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
EMIT(PPC_RAW_STL(_R26, _R1, nvr_off + SZL));
if (flags & BPF_TRAMP_F_CALL_ORIG) {
- PPC_LI_ADDR(_R3, (unsigned long)im);
+ /*
+ * Emit maximum possible instructions while getting the size of
+ * bpf trampoline to ensure trampoline JIT code doesn't overflow.
+ */
+ PPC_LI_ADDR(_R3, im ? (unsigned long)im :
+ (unsigned long)(~(1UL << (BITS_PER_LONG - 1))));
ret = bpf_jit_emit_func_call_rel(image, ro_image, ctx,
(unsigned long)__bpf_tramp_enter);
if (ret)
@@ -889,7 +894,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
bpf_trampoline_restore_tail_call_cnt(image, ctx, func_frame_offset, r4_off);
/* Reserve space to patch branch instruction to skip fexit progs */
- im->ip_after_call = &((u32 *)ro_image)[ctx->idx];
+ if (im)
+ im->ip_after_call = &((u32 *)ro_image)[ctx->idx];
EMIT(PPC_RAW_NOP());
}
@@ -912,8 +918,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
}
if (flags & BPF_TRAMP_F_CALL_ORIG) {
- im->ip_epilogue = &((u32 *)ro_image)[ctx->idx];
- PPC_LI_ADDR(_R3, im);
+ if (im)
+ im->ip_epilogue = &((u32 *)ro_image)[ctx->idx];
+ /*
+ * Emit maximum possible instructions while getting the size of
+ * bpf trampoline to ensure trampoline JIT code doesn't overflow.
+ */
+ PPC_LI_ADDR(_R3, im ? (unsigned long)im :
+ (unsigned long)(~(1UL << (BITS_PER_LONG - 1))));
ret = bpf_jit_emit_func_call_rel(image, ro_image, ctx,
(unsigned long)__bpf_tramp_exit);
if (ret)
@@ -972,7 +984,6 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
struct bpf_tramp_links *tlinks, void *func_addr)
{
- struct bpf_tramp_image im;
void *image;
int ret;
@@ -988,7 +999,13 @@ int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags,
if (!image)
return -ENOMEM;
- ret = __arch_prepare_bpf_trampoline(&im, image, image + PAGE_SIZE, image,
+ /*
+ * Pass NULL as bpf_tramp_image pointer to differentiate the intent to get the
+ * buffer size for trampoline here. This differentiation helps in accounting for
+ * maximum possible instructions if the JIT code size is likely to vary during
+ * the actual JIT compile of the trampoline.
+ */
+ ret = __arch_prepare_bpf_trampoline(NULL, image, image + PAGE_SIZE, image,
m, flags, tlinks, func_addr);
bpf_jit_free_exec(image);
--
2.48.1
From: Mario Limonciello <mario.limonciello(a)amd.com>
On some platforms it has been observed that STT limits are not being applied
properly causing poor performance as power limits are set too low.
STT limits that are sent to the platform are supposed to be in Q8.8
format. Convert them before sending.
Reported-by: Yijun Shen <Yijun.Shen(a)dell.com>
Fixes: 7c45534afa443 ("platform/x86/amd/pmf: Add support for PMF Policy Binary")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
drivers/platform/x86/amd/pmf/tee-if.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/platform/x86/amd/pmf/tee-if.c b/drivers/platform/x86/amd/pmf/tee-if.c
index 5d513161d7302..9a51258df0564 100644
--- a/drivers/platform/x86/amd/pmf/tee-if.c
+++ b/drivers/platform/x86/amd/pmf/tee-if.c
@@ -123,7 +123,7 @@ static void amd_pmf_apply_policies(struct amd_pmf_dev *dev, struct ta_pmf_enact_
case PMF_POLICY_STT_SKINTEMP_APU:
if (dev->prev_data->stt_skintemp_apu != val) {
- amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, val, NULL);
+ amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, val << 8, NULL);
dev_dbg(dev->dev, "update STT_SKINTEMP_APU: %u\n", val);
dev->prev_data->stt_skintemp_apu = val;
}
@@ -131,7 +131,7 @@ static void amd_pmf_apply_policies(struct amd_pmf_dev *dev, struct ta_pmf_enact_
case PMF_POLICY_STT_SKINTEMP_HS2:
if (dev->prev_data->stt_skintemp_hs2 != val) {
- amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, val, NULL);
+ amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, val << 8, NULL);
dev_dbg(dev->dev, "update STT_SKINTEMP_HS2: %u\n", val);
dev->prev_data->stt_skintemp_hs2 = val;
}
--
2.43.0
Hi Greg,
My apologies! I'll test-build it before submitting again. Thanks for
pointing out my mistake.
build is failing and there is a need for a dependent patch.
Hardik
On Mon, Apr 7, 2025 at 10:43 AM Hardik Gohil <hgohil(a)mvista.com> wrote:
>
> Hi Greg,
>
> My apologies! I'll test-build it before submitting again. Thanks for pointing out my mistake.
>
> build is failing and there is a need for a dependent patch.
>
> Hardik
>
>
> On Thu, Apr 3, 2025 at 7:30 PM Greg KH <gregkh(a)linuxfoundation.org> wrote:
>>
>> On Thu, Apr 03, 2025 at 03:36:05PM +0530, Hardik Gohil wrote:
>> > Hello Greg,
>> >
>> > This patch applies cleanly to the v5.4 kernel.
>> >
>> > dmaengine: ti: edma: Add some null pointer checks to the edma_probe
>> >
>> > [upstream commit 6e2276203ac9ff10fc76917ec9813c660f627369]
>> >
>>
>> You obviously did not test-build this change :(
>>
>> Please always do so when asking for patches to be backported, otherwise
>> we get grumpy as it breaks our workflow...
>>
>> thanks,
>>
>> greg k-h
The patch titled
Subject: mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
Date: Fri, 21 Mar 2025 10:09:37 +0000
Currently, if a VMA merge fails due to an OOM condition arising on commit
merge or a failure to duplicate anon_vma's, we report this so the caller
can handle it.
However there are cases where the caller is only ostensibly trying a
merge, and doesn't mind if it fails due to this condition.
Since we do not want to introduce an implicit assumption that we only
actually modify VMAs after OOM conditions might arise, add a 'give up on
oom' option and make an explicit contract that, should this flag be set, we
absolutely will not modify any VMAs should OOM arise and just bail out.
Since it'd be very unusual for a user to try to vma_modify() with this flag
set but be specifying a range within a VMA which ends up being split (which
can fail due to rlimit issues, not only OOM), we add a debug warning for
this condition.
The motivating reason for this is uffd release - syzkaller (and Pedro
Falcato's VERY astute analysis) found a way in which an injected fault on
allocation, triggering an OOM condition on commit merge, would result in
uffd code becoming confused and treating an error value as if it were a VMA
pointer.
To avoid this, we make use of this new VMG flag to ensure that this never
occurs, utilising the fact that, should we be clearing entire VMAs, we do
not wish an OOM event to be reported to us.
Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for
his insightful and intelligent analysis of the situation, both of whom were
instrumental in this fix.
Link: https://lkml.kernel.org/r/20250321100937.46634-1-lorenzo.stoakes@oracle.com
Reported-by: syzbot+20ed41006cf9d842c2b5(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001e.GAE@google.com/
Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Suggested-by: Pedro Falcato <pfalcato(a)suse.de>
Suggested-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 13 +++++++++--
mm/vma.c | 51 +++++++++++++++++++++++++++++++++++++++++----
mm/vma.h | 9 +++++++
3 files changed, 66 insertions(+), 7 deletions(-)
--- a/mm/userfaultfd.c~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/userfaultfd.c
@@ -1902,6 +1902,14 @@ struct vm_area_struct *userfaultfd_clear
unsigned long end)
{
struct vm_area_struct *ret;
+ bool give_up_on_oom = false;
+
+ /*
+ * If we are modifying only and not splitting, just give up on the merge
+ * if OOM prevents us from merging successfully.
+ */
+ if (start == vma->vm_start && end == vma->vm_end)
+ give_up_on_oom = true;
/* Reset ptes for the whole vma range if wr-protected */
if (userfaultfd_wp(vma))
@@ -1909,7 +1917,7 @@ struct vm_area_struct *userfaultfd_clear
ret = vma_modify_flags_uffd(vmi, prev, vma, start, end,
vma->vm_flags & ~__VM_UFFD_FLAGS,
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, give_up_on_oom);
/*
* In the vma_merge() successful mprotect-like case 8:
@@ -1960,7 +1968,8 @@ int userfaultfd_register_range(struct us
new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end,
new_flags,
- (struct vm_userfaultfd_ctx){ctx});
+ (struct vm_userfaultfd_ctx){ctx},
+ /* give_up_on_oom = */false);
if (IS_ERR(vma))
return PTR_ERR(vma);
--- a/mm/vma.c~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/vma.c
@@ -666,6 +666,9 @@ static void vmg_adjust_set_range(struct
/*
* Actually perform the VMA merge operation.
*
+ * IMPORTANT: We guarantee that, should vmg->give_up_on_oom is set, to not
+ * modify any VMAs or cause inconsistent state should an OOM condition arise.
+ *
* Returns 0 on success, or an error value on failure.
*/
static int commit_merge(struct vma_merge_struct *vmg)
@@ -685,6 +688,12 @@ static int commit_merge(struct vma_merge
init_multi_vma_prep(&vp, vma, vmg);
+ /*
+ * If vmg->give_up_on_oom is set, we're safe, because we don't actually
+ * manipulate any VMAs until we succeed at preallocation.
+ *
+ * Past this point, we will not return an error.
+ */
if (vma_iter_prealloc(vmg->vmi, vma))
return -ENOMEM;
@@ -915,7 +924,13 @@ static __must_check struct vm_area_struc
if (anon_dup)
unlink_anon_vmas(anon_dup);
- vmg->state = VMA_MERGE_ERROR_NOMEM;
+ /*
+ * We've cleaned up any cloned anon_vma's, no VMAs have been
+ * modified, no harm no foul if the user requests that we not
+ * report this and just give up, leaving the VMAs unmerged.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return NULL;
}
@@ -926,7 +941,15 @@ static __must_check struct vm_area_struc
abort:
vma_iter_set(vmg->vmi, start);
vma_iter_load(vmg->vmi);
- vmg->state = VMA_MERGE_ERROR_NOMEM;
+
+ /*
+ * This means we have failed to clone anon_vma's correctly, but no
+ * actual changes to VMAs have occurred, so no harm no foul - if the
+ * user doesn't want this reported and instead just wants to give up on
+ * the merge, allow it.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return NULL;
}
@@ -1068,6 +1091,10 @@ int vma_expand(struct vma_merge_struct *
/* This should already have been checked by this point. */
VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
vma_start_write(next);
+ /*
+ * In this case we don't report OOM, so vmg->give_up_on_mm is
+ * safe.
+ */
ret = dup_anon_vma(middle, next, &anon_dup);
if (ret)
return ret;
@@ -1090,9 +1117,15 @@ int vma_expand(struct vma_merge_struct *
return 0;
nomem:
- vmg->state = VMA_MERGE_ERROR_NOMEM;
if (anon_dup)
unlink_anon_vmas(anon_dup);
+ /*
+ * If the user requests that we just give upon OOM, we are safe to do so
+ * here, as commit merge provides this contract to us. Nothing has been
+ * changed - no harm no foul, just don't report it.
+ */
+ if (!vmg->give_up_on_oom)
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return -ENOMEM;
}
@@ -1534,6 +1567,13 @@ static struct vm_area_struct *vma_modify
if (vmg_nomem(vmg))
return ERR_PTR(-ENOMEM);
+ /*
+ * Split can fail for reasons other than OOM, so if the user requests
+ * this it's probably a mistake.
+ */
+ VM_WARN_ON(vmg->give_up_on_oom &&
+ (vma->vm_start != start || vma->vm_end != end));
+
/* Split any preceding portion of the VMA. */
if (vma->vm_start < start) {
int err = split_vma(vmg->vmi, vma, start, 1);
@@ -1602,12 +1642,15 @@ struct vm_area_struct
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long new_flags,
- struct vm_userfaultfd_ctx new_ctx)
+ struct vm_userfaultfd_ctx new_ctx,
+ bool give_up_on_oom)
{
VMG_VMA_STATE(vmg, vmi, prev, vma, start, end);
vmg.flags = new_flags;
vmg.uffd_ctx = new_ctx;
+ if (give_up_on_oom)
+ vmg.give_up_on_oom = true;
return vma_modify(&vmg);
}
--- a/mm/vma.h~mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release
+++ a/mm/vma.h
@@ -114,6 +114,12 @@ struct vma_merge_struct {
*/
bool just_expand :1;
+ /*
+ * If a merge is possible, but an OOM error occurs, give up and don't
+ * execute the merge, returning NULL.
+ */
+ bool give_up_on_oom :1;
+
/* Internal flags set during merge process: */
/*
@@ -255,7 +261,8 @@ __must_check struct vm_area_struct
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long new_flags,
- struct vm_userfaultfd_ctx new_ctx);
+ struct vm_userfaultfd_ctx new_ctx,
+ bool give_up_on_oom);
__must_check struct vm_area_struct
*vma_merge_new_range(struct vma_merge_struct *vmg);
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-vma-add-give_up_on_oom-option-on-modify-merge-use-in-uffd-release.patch