The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x dbf4ab821804df071c8b566d9813083125e6d97b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012605-emphases-explore-4dff@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
dbf4ab821804 ("serial: sc16is7xx: convert from _raw_ to _noinc_ regmap functions for FIFO")
3837a0379533 ("serial: sc16is7xx: improve regmap debugfs by using one regmap per port")
049994292834 ("serial: sc16is7xx: fix regression with GPIO configuration")
dabc54a45711 ("serial: sc16is7xx: remove obsolete out_thread label")
c8f71b49ee4d ("serial: sc16is7xx: setup GPIO controller later in probe")
267913ecf737 ("serial: sc16is7xx: Fill in rs485_supported")
21144bab4f11 ("sc16is7xx: Handle modem status lines")
cc4c1d05eb10 ("sc16is7xx: Properly resume TX after stop")
d4ab5487cc77 ("Merge 5.17-rc6 into tty-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dbf4ab821804df071c8b566d9813083125e6d97b Mon Sep 17 00:00:00 2001
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Date: Mon, 11 Dec 2023 12:13:52 -0500
Subject: [PATCH] serial: sc16is7xx: convert from _raw_ to _noinc_ regmap
functions for FIFO
The SC16IS7XX IC supports a burst mode to access the FIFOs where the
initial register address is sent ($00), followed by all the FIFO data
without having to resend the register address each time. In this mode, the
IC doesn't increment the register address for each R/W byte.
The regmap_raw_read() and regmap_raw_write() are functions which can
perform IO over multiple registers. They are currently used to read/write
from/to the FIFO, and although they operate correctly in this burst mode on
the SPI bus, they would corrupt the regmap cache if it was not disabled
manually. The reason is that when the R/W size is more than 1 byte, these
functions assume that the register address is incremented and handle the
cache accordingly.
Convert FIFO R/W functions to use the regmap _noinc_ versions in order to
remove the manual cache control which was a workaround when using the
_raw_ versions. FIFO registers are properly declared as volatile so
cache will not be used/updated for FIFO accesses.
Fixes: dfeae619d781 ("serial: sc16is7xx")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Link: https://lore.kernel.org/r/20231211171353.2901416-6-hugo@hugovil.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index 0bda9b74d096..7e4b9b52841d 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -381,9 +381,7 @@ static void sc16is7xx_fifo_read(struct uart_port *port, unsigned int rxlen)
struct sc16is7xx_port *s = dev_get_drvdata(port->dev);
struct sc16is7xx_one *one = to_sc16is7xx_one(port, port);
- regcache_cache_bypass(one->regmap, true);
- regmap_raw_read(one->regmap, SC16IS7XX_RHR_REG, s->buf, rxlen);
- regcache_cache_bypass(one->regmap, false);
+ regmap_noinc_read(one->regmap, SC16IS7XX_RHR_REG, s->buf, rxlen);
}
static void sc16is7xx_fifo_write(struct uart_port *port, u8 to_send)
@@ -398,9 +396,7 @@ static void sc16is7xx_fifo_write(struct uart_port *port, u8 to_send)
if (unlikely(!to_send))
return;
- regcache_cache_bypass(one->regmap, true);
- regmap_raw_write(one->regmap, SC16IS7XX_THR_REG, s->buf, to_send);
- regcache_cache_bypass(one->regmap, false);
+ regmap_noinc_write(one->regmap, SC16IS7XX_THR_REG, s->buf, to_send);
}
static void sc16is7xx_port_update(struct uart_port *port, u8 reg,
@@ -492,6 +488,11 @@ static bool sc16is7xx_regmap_precious(struct device *dev, unsigned int reg)
return false;
}
+static bool sc16is7xx_regmap_noinc(struct device *dev, unsigned int reg)
+{
+ return reg == SC16IS7XX_RHR_REG;
+}
+
static int sc16is7xx_set_baud(struct uart_port *port, int baud)
{
struct sc16is7xx_one *one = to_sc16is7xx_one(port, port);
@@ -1709,6 +1710,10 @@ static struct regmap_config regcfg = {
.cache_type = REGCACHE_RBTREE,
.volatile_reg = sc16is7xx_regmap_volatile,
.precious_reg = sc16is7xx_regmap_precious,
+ .writeable_noinc_reg = sc16is7xx_regmap_noinc,
+ .readable_noinc_reg = sc16is7xx_regmap_noinc,
+ .max_raw_read = SC16IS7XX_FIFO_SIZE,
+ .max_raw_write = SC16IS7XX_FIFO_SIZE,
.max_register = SC16IS7XX_EFCR_REG,
};
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 5ec8e8ea8b7783fab150cf86404fc38cb4db8800
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012608-bonanza-edition-993c@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
f1dc0db296bd ("mm: use __pfn_to_section() instead of open coding it")
0a9f9f623166 ("mm/sparse.c: only use subsection map in VMEMMAP case")
37bc15020a96 ("mm/sparse.c: introduce a new function clear_subsection_map()")
5d87255cadde ("mm/sparse.c: introduce new function fill_subsection_map()")
b943f045a9af ("mm/sparse: fix kernel crash with pfn_section_valid check")
d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
1f503443e7df ("mm/sparse.c: reset section's mem_map when fully deactivated")
8068df3b6037 ("mm/memory_hotplug: don't free usage map when removing a re-added early section")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5ec8e8ea8b7783fab150cf86404fc38cb4db8800 Mon Sep 17 00:00:00 2001
From: Charan Teja Kalla <quic_charante(a)quicinc.com>
Date: Fri, 13 Oct 2023 18:34:27 +0530
Subject: [PATCH] mm/sparsemem: fix race in accessing memory_section->usage
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end
pfn contains the device memory PFN's as well, the compaction triggered
will try on the device memory PFN's too though they end up in NOP(because
pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When
from other core, the section mappings are being removed for the
ZONE_DEVICE region, that the PFN in question belongs to, on which
compaction is currently being operated is resulting into the kernel crash
with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1].
compact_zone() memunmap_pages
------------- ---------------
__pageblock_pfn_to_page
......
(a)pfn_valid():
valid_section()//return true
(b)__remove_pages()->
sparse_remove_section()->
section_deactivate():
[Free the array ms->usage and set
ms->usage = NULL]
pfn_section_valid()
[Access ms->usage which
is NULL]
NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.
The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.
Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL
when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
Thanks to David/Pavan for their inputs on this patch.
[1] https://lore.kernel.org/linux-mm/994410bb-89aa-d987-1f50-f514903c55aa@quici…
On Snapdragon SoC, with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
issues daily while testing on a device farm.
For this particular issue below is the log. Though the below log is
not directly pointing to the pfn_section_valid(){ ms->usage;}, when we
loaded this dump on T32 lauterbach tool, it is pointing.
[ 540.578056] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[ 540.578068] Mem abort info:
[ 540.578070] ESR = 0x0000000096000005
[ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits
[ 540.578077] SET = 0, FnV = 0
[ 540.578080] EA = 0, S1PTW = 0
[ 540.578082] FSC = 0x05: level 1 translation fault
[ 540.578085] Data abort info:
[ 540.578086] ISV = 0, ISS = 0x00000005
[ 540.578088] CM = 0, WnR = 0
[ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--)
[ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579454] lr : compact_zone+0x994/0x1058
[ 540.579460] sp : ffffffc03579b510
[ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c
[ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640
[ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000
[ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140
[ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff
[ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001
[ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440
[ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4
[ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001
[ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800
[ 540.579524] Call trace:
[ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579533] compact_zone+0x994/0x1058
[ 540.579536] try_to_compact_pages+0x128/0x378
[ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0
[ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10
[ 540.579547] __alloc_pages+0x250/0x2d0
[ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
[ 540.579561] iommu_dma_alloc+0xa0/0x320
[ 540.579565] dma_alloc_attrs+0xd4/0x108
[quic_charante(a)quicinc.com: use kfree_rcu() in place of synchronize_rcu(), per David]
Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@q…
Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@q…
Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ec73582e7d27..2efd3be484fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1799,6 +1799,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
struct mem_section_usage {
+ struct rcu_head rcu;
#ifdef CONFIG_SPARSEMEM_VMEMMAP
DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
#endif
@@ -1992,7 +1993,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
- return test_bit(idx, ms->usage->subsection_map);
+ return test_bit(idx, READ_ONCE(ms->usage)->subsection_map);
}
#else
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
@@ -2016,6 +2017,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
static inline int pfn_valid(unsigned long pfn)
{
struct mem_section *ms;
+ int ret;
/*
* Ensure the upper PAGE_SHIFT bits are clear in the
@@ -2029,13 +2031,19 @@ static inline int pfn_valid(unsigned long pfn)
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
ms = __pfn_to_section(pfn);
- if (!valid_section(ms))
+ rcu_read_lock();
+ if (!valid_section(ms)) {
+ rcu_read_unlock();
return 0;
+ }
/*
* Traditionally early sections always returned pfn_valid() for
* the entire section-sized span.
*/
- return early_section(ms) || pfn_section_valid(ms, pfn);
+ ret = early_section(ms) || pfn_section_valid(ms, pfn);
+ rcu_read_unlock();
+
+ return ret;
}
#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..338cf946dee8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -791,6 +791,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
if (empty) {
unsigned long section_nr = pfn_to_section_nr(pfn);
+ /*
+ * Mark the section invalid so that valid_section()
+ * return false. This prevents code from dereferencing
+ * ms->usage array.
+ */
+ ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
+
/*
* When removing an early section, the usage map is kept (as the
* usage maps of other sections fall into the same page). It
@@ -799,16 +806,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
* was allocated during boot.
*/
if (!PageReserved(virt_to_page(ms->usage))) {
- kfree(ms->usage);
- ms->usage = NULL;
+ kfree_rcu(ms->usage, rcu);
+ WRITE_ONCE(ms->usage, NULL);
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
- /*
- * Mark the section invalid so that valid_section()
- * return false. This prevents code from dereferencing
- * ms->usage array.
- */
- ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
}
/*
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 5ec8e8ea8b7783fab150cf86404fc38cb4db8800
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012604-protegee-reassign-85ae@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
f1dc0db296bd ("mm: use __pfn_to_section() instead of open coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5ec8e8ea8b7783fab150cf86404fc38cb4db8800 Mon Sep 17 00:00:00 2001
From: Charan Teja Kalla <quic_charante(a)quicinc.com>
Date: Fri, 13 Oct 2023 18:34:27 +0530
Subject: [PATCH] mm/sparsemem: fix race in accessing memory_section->usage
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end
pfn contains the device memory PFN's as well, the compaction triggered
will try on the device memory PFN's too though they end up in NOP(because
pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When
from other core, the section mappings are being removed for the
ZONE_DEVICE region, that the PFN in question belongs to, on which
compaction is currently being operated is resulting into the kernel crash
with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1].
compact_zone() memunmap_pages
------------- ---------------
__pageblock_pfn_to_page
......
(a)pfn_valid():
valid_section()//return true
(b)__remove_pages()->
sparse_remove_section()->
section_deactivate():
[Free the array ms->usage and set
ms->usage = NULL]
pfn_section_valid()
[Access ms->usage which
is NULL]
NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.
The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.
Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL
when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
Thanks to David/Pavan for their inputs on this patch.
[1] https://lore.kernel.org/linux-mm/994410bb-89aa-d987-1f50-f514903c55aa@quici…
On Snapdragon SoC, with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
issues daily while testing on a device farm.
For this particular issue below is the log. Though the below log is
not directly pointing to the pfn_section_valid(){ ms->usage;}, when we
loaded this dump on T32 lauterbach tool, it is pointing.
[ 540.578056] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[ 540.578068] Mem abort info:
[ 540.578070] ESR = 0x0000000096000005
[ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits
[ 540.578077] SET = 0, FnV = 0
[ 540.578080] EA = 0, S1PTW = 0
[ 540.578082] FSC = 0x05: level 1 translation fault
[ 540.578085] Data abort info:
[ 540.578086] ISV = 0, ISS = 0x00000005
[ 540.578088] CM = 0, WnR = 0
[ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--)
[ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579454] lr : compact_zone+0x994/0x1058
[ 540.579460] sp : ffffffc03579b510
[ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c
[ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640
[ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000
[ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140
[ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff
[ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001
[ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440
[ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4
[ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001
[ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800
[ 540.579524] Call trace:
[ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579533] compact_zone+0x994/0x1058
[ 540.579536] try_to_compact_pages+0x128/0x378
[ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0
[ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10
[ 540.579547] __alloc_pages+0x250/0x2d0
[ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
[ 540.579561] iommu_dma_alloc+0xa0/0x320
[ 540.579565] dma_alloc_attrs+0xd4/0x108
[quic_charante(a)quicinc.com: use kfree_rcu() in place of synchronize_rcu(), per David]
Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@q…
Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@q…
Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ec73582e7d27..2efd3be484fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1799,6 +1799,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
struct mem_section_usage {
+ struct rcu_head rcu;
#ifdef CONFIG_SPARSEMEM_VMEMMAP
DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
#endif
@@ -1992,7 +1993,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
- return test_bit(idx, ms->usage->subsection_map);
+ return test_bit(idx, READ_ONCE(ms->usage)->subsection_map);
}
#else
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
@@ -2016,6 +2017,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
static inline int pfn_valid(unsigned long pfn)
{
struct mem_section *ms;
+ int ret;
/*
* Ensure the upper PAGE_SHIFT bits are clear in the
@@ -2029,13 +2031,19 @@ static inline int pfn_valid(unsigned long pfn)
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
ms = __pfn_to_section(pfn);
- if (!valid_section(ms))
+ rcu_read_lock();
+ if (!valid_section(ms)) {
+ rcu_read_unlock();
return 0;
+ }
/*
* Traditionally early sections always returned pfn_valid() for
* the entire section-sized span.
*/
- return early_section(ms) || pfn_section_valid(ms, pfn);
+ ret = early_section(ms) || pfn_section_valid(ms, pfn);
+ rcu_read_unlock();
+
+ return ret;
}
#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..338cf946dee8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -791,6 +791,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
if (empty) {
unsigned long section_nr = pfn_to_section_nr(pfn);
+ /*
+ * Mark the section invalid so that valid_section()
+ * return false. This prevents code from dereferencing
+ * ms->usage array.
+ */
+ ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
+
/*
* When removing an early section, the usage map is kept (as the
* usage maps of other sections fall into the same page). It
@@ -799,16 +806,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
* was allocated during boot.
*/
if (!PageReserved(virt_to_page(ms->usage))) {
- kfree(ms->usage);
- ms->usage = NULL;
+ kfree_rcu(ms->usage, rcu);
+ WRITE_ONCE(ms->usage, NULL);
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
- /*
- * Mark the section invalid so that valid_section()
- * return false. This prevents code from dereferencing
- * ms->usage array.
- */
- ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
}
/*
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 5ec8e8ea8b7783fab150cf86404fc38cb4db8800
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012600-swaddling-filled-c787@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
f1dc0db296bd ("mm: use __pfn_to_section() instead of open coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5ec8e8ea8b7783fab150cf86404fc38cb4db8800 Mon Sep 17 00:00:00 2001
From: Charan Teja Kalla <quic_charante(a)quicinc.com>
Date: Fri, 13 Oct 2023 18:34:27 +0530
Subject: [PATCH] mm/sparsemem: fix race in accessing memory_section->usage
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end
pfn contains the device memory PFN's as well, the compaction triggered
will try on the device memory PFN's too though they end up in NOP(because
pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When
from other core, the section mappings are being removed for the
ZONE_DEVICE region, that the PFN in question belongs to, on which
compaction is currently being operated is resulting into the kernel crash
with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1].
compact_zone() memunmap_pages
------------- ---------------
__pageblock_pfn_to_page
......
(a)pfn_valid():
valid_section()//return true
(b)__remove_pages()->
sparse_remove_section()->
section_deactivate():
[Free the array ms->usage and set
ms->usage = NULL]
pfn_section_valid()
[Access ms->usage which
is NULL]
NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.
The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.
Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL
when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
Thanks to David/Pavan for their inputs on this patch.
[1] https://lore.kernel.org/linux-mm/994410bb-89aa-d987-1f50-f514903c55aa@quici…
On Snapdragon SoC, with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
issues daily while testing on a device farm.
For this particular issue below is the log. Though the below log is
not directly pointing to the pfn_section_valid(){ ms->usage;}, when we
loaded this dump on T32 lauterbach tool, it is pointing.
[ 540.578056] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[ 540.578068] Mem abort info:
[ 540.578070] ESR = 0x0000000096000005
[ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits
[ 540.578077] SET = 0, FnV = 0
[ 540.578080] EA = 0, S1PTW = 0
[ 540.578082] FSC = 0x05: level 1 translation fault
[ 540.578085] Data abort info:
[ 540.578086] ISV = 0, ISS = 0x00000005
[ 540.578088] CM = 0, WnR = 0
[ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--)
[ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579454] lr : compact_zone+0x994/0x1058
[ 540.579460] sp : ffffffc03579b510
[ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c
[ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640
[ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000
[ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140
[ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff
[ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001
[ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440
[ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4
[ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001
[ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800
[ 540.579524] Call trace:
[ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579533] compact_zone+0x994/0x1058
[ 540.579536] try_to_compact_pages+0x128/0x378
[ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0
[ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10
[ 540.579547] __alloc_pages+0x250/0x2d0
[ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
[ 540.579561] iommu_dma_alloc+0xa0/0x320
[ 540.579565] dma_alloc_attrs+0xd4/0x108
[quic_charante(a)quicinc.com: use kfree_rcu() in place of synchronize_rcu(), per David]
Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@q…
Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@q…
Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ec73582e7d27..2efd3be484fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1799,6 +1799,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
struct mem_section_usage {
+ struct rcu_head rcu;
#ifdef CONFIG_SPARSEMEM_VMEMMAP
DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
#endif
@@ -1992,7 +1993,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
- return test_bit(idx, ms->usage->subsection_map);
+ return test_bit(idx, READ_ONCE(ms->usage)->subsection_map);
}
#else
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
@@ -2016,6 +2017,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
static inline int pfn_valid(unsigned long pfn)
{
struct mem_section *ms;
+ int ret;
/*
* Ensure the upper PAGE_SHIFT bits are clear in the
@@ -2029,13 +2031,19 @@ static inline int pfn_valid(unsigned long pfn)
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
ms = __pfn_to_section(pfn);
- if (!valid_section(ms))
+ rcu_read_lock();
+ if (!valid_section(ms)) {
+ rcu_read_unlock();
return 0;
+ }
/*
* Traditionally early sections always returned pfn_valid() for
* the entire section-sized span.
*/
- return early_section(ms) || pfn_section_valid(ms, pfn);
+ ret = early_section(ms) || pfn_section_valid(ms, pfn);
+ rcu_read_unlock();
+
+ return ret;
}
#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..338cf946dee8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -791,6 +791,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
if (empty) {
unsigned long section_nr = pfn_to_section_nr(pfn);
+ /*
+ * Mark the section invalid so that valid_section()
+ * return false. This prevents code from dereferencing
+ * ms->usage array.
+ */
+ ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
+
/*
* When removing an early section, the usage map is kept (as the
* usage maps of other sections fall into the same page). It
@@ -799,16 +806,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
* was allocated during boot.
*/
if (!PageReserved(virt_to_page(ms->usage))) {
- kfree(ms->usage);
- ms->usage = NULL;
+ kfree_rcu(ms->usage, rcu);
+ WRITE_ONCE(ms->usage, NULL);
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
- /*
- * Mark the section invalid so that valid_section()
- * return false. This prevents code from dereferencing
- * ms->usage array.
- */
- ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
}
/*
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x d1adb25df7111de83b64655a80b5a135adbded61
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012640-racing-flannels-8384@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
d1adb25df711 ("mm: migrate: fix getting incorrect page mapping during page migration")
eebb3dabbb5c ("mm: migrate: record the mlocked page status to remove unnecessary lru drain")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d1adb25df7111de83b64655a80b5a135adbded61 Mon Sep 17 00:00:00 2001
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Date: Fri, 15 Dec 2023 20:07:52 +0800
Subject: [PATCH] mm: migrate: fix getting incorrect page mapping during page
migration
When running stress-ng testing, we found below kernel crash after a few hours:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......
After analyzing the vmcore, I found this issue is caused by page migration.
The scenario is that, one thread is doing page migration, and we will use the
target page's ->mapping field to save 'anon_vma' pointer between page unmap and
page move, and now the target page is locked and refcount is 1.
Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers that
the refcount of this target page is 1, preventing the offline operation, thus
proceeding to dump the page. However, page_mapping() of the target page may
return an incorrect file mapping to crash the system in dump_mapping(), since
the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.
There are seveval ways to fix this issue:
(1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving
'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target
page has not built mappings yet.
(2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing
the system, however, there are still some PFN walkers that call page_mapping()
without holding the page lock, such as compaction.
(3) Using target page->private field to save the 'anon_vma' pointer and 2 bits
page state, just as page->mapping records an anonymous page, which can remove
the page_mapping() impact for PFN walkers and also seems a simple way.
So I choose option 3 to fix this issue, and this can also fix other potential
issues for PFN walkers, such as compaction.
Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.17026417…
Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang(a)intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Xu Yu <xuyu(a)linux.alibaba.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/migrate.c b/mm/migrate.c
index 397f2a6e34cb..bad3039d165e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1025,38 +1025,31 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
}
/*
- * To record some information during migration, we use some unused
- * fields (mapping and private) of struct folio of the newly allocated
- * destination folio. This is safe because nobody is using them
- * except us.
+ * To record some information during migration, we use unused private
+ * field of struct folio of the newly allocated destination folio.
+ * This is safe because nobody is using it except us.
*/
-union migration_ptr {
- struct anon_vma *anon_vma;
- struct address_space *mapping;
-};
-
enum {
PAGE_WAS_MAPPED = BIT(0),
PAGE_WAS_MLOCKED = BIT(1),
+ PAGE_OLD_STATES = PAGE_WAS_MAPPED | PAGE_WAS_MLOCKED,
};
static void __migrate_folio_record(struct folio *dst,
- unsigned long old_page_state,
+ int old_page_state,
struct anon_vma *anon_vma)
{
- union migration_ptr ptr = { .anon_vma = anon_vma };
- dst->mapping = ptr.mapping;
- dst->private = (void *)old_page_state;
+ dst->private = (void *)anon_vma + old_page_state;
}
static void __migrate_folio_extract(struct folio *dst,
int *old_page_state,
struct anon_vma **anon_vmap)
{
- union migration_ptr ptr = { .mapping = dst->mapping };
- *anon_vmap = ptr.anon_vma;
- *old_page_state = (unsigned long)dst->private;
- dst->mapping = NULL;
+ unsigned long private = (unsigned long)dst->private;
+
+ *anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
+ *old_page_state = private & PAGE_OLD_STATES;
dst->private = NULL;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 00bcfcd47a52f50f07a2e88d730d7931384cb073
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012616-armrest-racoon-14ea@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
00bcfcd47a52 ("selftests: mm: hugepage-vmemmap fails on 64K page size systems")
0183d777c29a ("selftests: mm: remove duplicate unneeded defines")
af605d26a8f2 ("selftests/mm: merge util.h into vm_util.h")
baa489fabd01 ("selftests/vm: rename selftests/vm to selftests/mm")
799fb82aa132 ("tools/vm: rename tools/vm to tools/mm")
93fb70aa5904 ("selftests/vm: add KSM unmerge tests")
7aca5ca15493 ("selftests/vm: anon_cow: prepare for non-anonymous COW tests")
65f199b2b40d ("vmalloc: add reviewers for vmalloc code")
e487ebbd1298 ("selftests/vm: anon_cow: add liburing test cases")
f4b5fd6946e2 ("selftests/vm: anon_cow: THP tests")
a905e82ae44b ("selftests/vm: factor out pagemap_is_populated() into vm_util")
69c66add5663 ("selftests/vm: anon_cow: test COW handling of anonymous memory")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 00bcfcd47a52f50f07a2e88d730d7931384cb073 Mon Sep 17 00:00:00 2001
From: Donet Tom <donettom(a)linux.vnet.ibm.com>
Date: Wed, 10 Jan 2024 14:03:35 +0530
Subject: [PATCH] selftests: mm: hugepage-vmemmap fails on 64K page size
systems
The kernel sefltest mm/hugepage-vmemmap fails on architectures which has
different page size other than 4K. In hugepage-vmemmap page size used is
4k so the pfn calculation will go wrong on systems which has different
page size .The length of MAP_HUGETLB memory must be hugepage aligned but
in hugepage-vmemmap map length is 2M so this will not get aligned if the
system has differnet hugepage size.
Added psize() to get the page size and default_huge_page_size() to
get the default hugepage size at run time, hugepage-vmemmap test pass
on powerpc with 64K page size and x86 with 4K page size.
Result on powerpc without patch (page size 64K)
*# ./hugepage-vmemmap
Returned address is 0x7effff000000 whose pfn is 0
Head page flags (100000000) is invalid
check_page_flags: Invalid argument
*#
Result on powerpc with patch (page size 64K)
*# ./hugepage-vmemmap
Returned address is 0x7effff000000 whose pfn is 600
*#
Result on x86 with patch (page size 4K)
*# ./hugepage-vmemmap
Returned address is 0x7fc7c2c00000 whose pfn is 1dac00
*#
Link: https://lkml.kernel.org/r/3b3a3ae37ba21218481c482a872bbf7526031600.17048657…
Fixes: b147c89cd429 ("selftests: vm: add a hugetlb test case")
Signed-off-by: Donet Tom <donettom(a)linux.vnet.ibm.com>
Reported-by: Geetika Moolchandani <geetika(a)linux.ibm.com>
Tested-by: Geetika Moolchandani <geetika(a)linux.ibm.com>
Acked-by: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/tools/testing/selftests/mm/hugepage-vmemmap.c b/tools/testing/selftests/mm/hugepage-vmemmap.c
index 5b354c209e93..894d28c3dd47 100644
--- a/tools/testing/selftests/mm/hugepage-vmemmap.c
+++ b/tools/testing/selftests/mm/hugepage-vmemmap.c
@@ -10,10 +10,7 @@
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
-
-#define MAP_LENGTH (2UL * 1024 * 1024)
-
-#define PAGE_SIZE 4096
+#include "vm_util.h"
#define PAGE_COMPOUND_HEAD (1UL << 15)
#define PAGE_COMPOUND_TAIL (1UL << 16)
@@ -39,6 +36,9 @@
#define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
#endif
+static size_t pagesize;
+static size_t maplength;
+
static void write_bytes(char *addr, size_t length)
{
unsigned long i;
@@ -56,7 +56,7 @@ static unsigned long virt_to_pfn(void *addr)
if (fd < 0)
return -1UL;
- lseek(fd, (unsigned long)addr / PAGE_SIZE * sizeof(pagemap), SEEK_SET);
+ lseek(fd, (unsigned long)addr / pagesize * sizeof(pagemap), SEEK_SET);
read(fd, &pagemap, sizeof(pagemap));
close(fd);
@@ -86,7 +86,7 @@ static int check_page_flags(unsigned long pfn)
* this also verifies kernel has correctly set the fake page_head to tail
* while hugetlb_free_vmemmap is enabled.
*/
- for (i = 1; i < MAP_LENGTH / PAGE_SIZE; i++) {
+ for (i = 1; i < maplength / pagesize; i++) {
read(fd, &pageflags, sizeof(pageflags));
if ((pageflags & TAIL_PAGE_FLAGS) != TAIL_PAGE_FLAGS ||
(pageflags & HEAD_PAGE_FLAGS) == HEAD_PAGE_FLAGS) {
@@ -106,18 +106,25 @@ int main(int argc, char **argv)
void *addr;
unsigned long pfn;
- addr = mmap(MAP_ADDR, MAP_LENGTH, PROT_READ | PROT_WRITE, MAP_FLAGS, -1, 0);
+ pagesize = psize();
+ maplength = default_huge_page_size();
+ if (!maplength) {
+ printf("Unable to determine huge page size\n");
+ exit(1);
+ }
+
+ addr = mmap(MAP_ADDR, maplength, PROT_READ | PROT_WRITE, MAP_FLAGS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
/* Trigger allocation of HugeTLB page. */
- write_bytes(addr, MAP_LENGTH);
+ write_bytes(addr, maplength);
pfn = virt_to_pfn(addr);
if (pfn == -1UL) {
- munmap(addr, MAP_LENGTH);
+ munmap(addr, maplength);
perror("virt_to_pfn");
exit(1);
}
@@ -125,13 +132,13 @@ int main(int argc, char **argv)
printf("Returned address is %p whose pfn is %lx\n", addr, pfn);
if (check_page_flags(pfn) < 0) {
- munmap(addr, MAP_LENGTH);
+ munmap(addr, maplength);
perror("check_page_flags");
exit(1);
}
/* munmap() length of MAP_HUGETLB memory must be hugepage aligned */
- if (munmap(addr, MAP_LENGTH)) {
+ if (munmap(addr, maplength)) {
perror("munmap");
exit(1);
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x e95fa7404716f6e25021e66067271a4ad8eb1486
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012650-unfrozen-jukebox-4be8@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
e95fa7404716 ("thermal: gov_power_allocator: avoid inability to reset a cdev")
94be1d27aa8d ("thermal: gov_power_allocator: Use trip pointers instead of trip indices")
2c7b4bfadef0 ("thermal: core: Store trip pointer in struct thermal_instance")
790930f44289 ("thermal: core: Introduce thermal_cooling_device_update()")
c43198af05cf ("thermal: core: Introduce thermal_cooling_device_present()")
5b8de18ee902 ("thermal/core: Move the thermal trip code to a dedicated file")
e6ec64f85237 ("thermal/drivers/qcom: Fix set_trip_temp() deadlock")
1fa86b0a3692 ("thermal/drivers/qcom: Use generic thermal_zone_get_trip() function")
7f725a23f2b7 ("thermal/core/governors: Use thermal_zone_get_trip() instead of ops functions")
2e38a2a981b2 ("thermal/core: Add a generic thermal_zone_set_trip() function")
7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function")
91b3aafc2238 ("thermal/core: Remove thermal_zone_set_trips()")
05eeee2b51b4 ("thermal/core: Protect sysfs accesses to thermal operations with thermal zone mutex")
1c439dec359c ("thermal/core: Introduce locked version of thermal_zone_device_update")
a365105c685c ("thermal: sysfs: Reuse cdev->max_state")
c408b3d1d9bb ("thermal: Validate new state in cur_state_store()")
597f500fde76 ("thermal/core: Add a check before calling set_trip_temp()")
a930da9bf583 ("thermal/core: Move the mutex inside the thermal_zone_device_update() function")
670a5e356cb6 ("thermal/core: Move the thermal zone lock out of the governors")
63561fe36b09 ("thermal/governors: Group the thermal zone lock inside the throttle function")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e95fa7404716f6e25021e66067271a4ad8eb1486 Mon Sep 17 00:00:00 2001
From: Di Shen <di.shen(a)unisoc.com>
Date: Wed, 10 Jan 2024 19:55:26 +0800
Subject: [PATCH] thermal: gov_power_allocator: avoid inability to reset a cdev
Commit 0952177f2a1f ("thermal/core/power_allocator: Update once
cooling devices when temp is low") adds an update flag to avoid
triggering a thermal event when there is no need, and the thermal
cdev is updated once when the temperature is low.
But when the trips are writable, and switch_on_temp is set to be a
higher value, the cooling device state may not be reset to 0,
because last_temperature is smaller than switch_on_temp.
For example:
First:
switch_on_temp=70 control_temp=85;
Then userspace change the trip_temp:
switch_on_temp=45 control_temp=55 cur_temp=54
Then userspace reset the trip_temp:
switch_on_temp=70 control_temp=85 cur_temp=57 last_temp=54
At this time, the cooling device state should be reset to 0.
However, because cur_temp(57) < switch_on_temp(70)
last_temp(54) < switch_on_temp(70) ----> update = false,
update is false, the cooling device state can not be reset.
Using the observation that tz->passive can also be regarded as the
temperature status, set the update flag to the tz->passive value.
When the temperature drops below switch_on for the first time, the
states of cooling devices can be reset once, and tz->passive is updated
to 0. In the next round, because tz->passive is 0, cdev->state will not
be updated.
By using the tz->passive value as the "update" flag, the issue above
can be solved, and the cooling devices can be updated only once when the
temperature is low.
Fixes: 0952177f2a1f ("thermal/core/power_allocator: Update once cooling devices when temp is low")
Cc: 5.13+ <stable(a)vger.kernel.org> # 5.13+
Suggested-by: Wei Wang <wvw(a)google.com>
Signed-off-by: Di Shen <di.shen(a)unisoc.com>
Reviewed-by: Lukasz Luba <lukasz.luba(a)arm.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 7b6aa265ff6a..81e061f183ad 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -762,7 +762,7 @@ static int power_allocator_throttle(struct thermal_zone_device *tz,
trip = params->trip_switch_on;
if (trip && tz->temperature < trip->temperature) {
- update = tz->last_temperature >= trip->temperature;
+ update = tz->passive;
tz->passive = 0;
reset_pid_controller(params);
allow_maximum_power(tz, update);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x e95fa7404716f6e25021e66067271a4ad8eb1486
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012648-impending-gala-16f2@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
e95fa7404716 ("thermal: gov_power_allocator: avoid inability to reset a cdev")
94be1d27aa8d ("thermal: gov_power_allocator: Use trip pointers instead of trip indices")
2c7b4bfadef0 ("thermal: core: Store trip pointer in struct thermal_instance")
790930f44289 ("thermal: core: Introduce thermal_cooling_device_update()")
c43198af05cf ("thermal: core: Introduce thermal_cooling_device_present()")
5b8de18ee902 ("thermal/core: Move the thermal trip code to a dedicated file")
e6ec64f85237 ("thermal/drivers/qcom: Fix set_trip_temp() deadlock")
1fa86b0a3692 ("thermal/drivers/qcom: Use generic thermal_zone_get_trip() function")
7f725a23f2b7 ("thermal/core/governors: Use thermal_zone_get_trip() instead of ops functions")
2e38a2a981b2 ("thermal/core: Add a generic thermal_zone_set_trip() function")
7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function")
91b3aafc2238 ("thermal/core: Remove thermal_zone_set_trips()")
05eeee2b51b4 ("thermal/core: Protect sysfs accesses to thermal operations with thermal zone mutex")
1c439dec359c ("thermal/core: Introduce locked version of thermal_zone_device_update")
a365105c685c ("thermal: sysfs: Reuse cdev->max_state")
c408b3d1d9bb ("thermal: Validate new state in cur_state_store()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e95fa7404716f6e25021e66067271a4ad8eb1486 Mon Sep 17 00:00:00 2001
From: Di Shen <di.shen(a)unisoc.com>
Date: Wed, 10 Jan 2024 19:55:26 +0800
Subject: [PATCH] thermal: gov_power_allocator: avoid inability to reset a cdev
Commit 0952177f2a1f ("thermal/core/power_allocator: Update once
cooling devices when temp is low") adds an update flag to avoid
triggering a thermal event when there is no need, and the thermal
cdev is updated once when the temperature is low.
But when the trips are writable, and switch_on_temp is set to be a
higher value, the cooling device state may not be reset to 0,
because last_temperature is smaller than switch_on_temp.
For example:
First:
switch_on_temp=70 control_temp=85;
Then userspace change the trip_temp:
switch_on_temp=45 control_temp=55 cur_temp=54
Then userspace reset the trip_temp:
switch_on_temp=70 control_temp=85 cur_temp=57 last_temp=54
At this time, the cooling device state should be reset to 0.
However, because cur_temp(57) < switch_on_temp(70)
last_temp(54) < switch_on_temp(70) ----> update = false,
update is false, the cooling device state can not be reset.
Using the observation that tz->passive can also be regarded as the
temperature status, set the update flag to the tz->passive value.
When the temperature drops below switch_on for the first time, the
states of cooling devices can be reset once, and tz->passive is updated
to 0. In the next round, because tz->passive is 0, cdev->state will not
be updated.
By using the tz->passive value as the "update" flag, the issue above
can be solved, and the cooling devices can be updated only once when the
temperature is low.
Fixes: 0952177f2a1f ("thermal/core/power_allocator: Update once cooling devices when temp is low")
Cc: 5.13+ <stable(a)vger.kernel.org> # 5.13+
Suggested-by: Wei Wang <wvw(a)google.com>
Signed-off-by: Di Shen <di.shen(a)unisoc.com>
Reviewed-by: Lukasz Luba <lukasz.luba(a)arm.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 7b6aa265ff6a..81e061f183ad 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -762,7 +762,7 @@ static int power_allocator_throttle(struct thermal_zone_device *tz,
trip = params->trip_switch_on;
if (trip && tz->temperature < trip->temperature) {
- update = tz->last_temperature >= trip->temperature;
+ update = tz->passive;
tz->passive = 0;
reset_pid_controller(params);
allow_maximum_power(tz, update);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x e95fa7404716f6e25021e66067271a4ad8eb1486
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024012646-bulb-spur-7ae9@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
e95fa7404716 ("thermal: gov_power_allocator: avoid inability to reset a cdev")
94be1d27aa8d ("thermal: gov_power_allocator: Use trip pointers instead of trip indices")
2c7b4bfadef0 ("thermal: core: Store trip pointer in struct thermal_instance")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e95fa7404716f6e25021e66067271a4ad8eb1486 Mon Sep 17 00:00:00 2001
From: Di Shen <di.shen(a)unisoc.com>
Date: Wed, 10 Jan 2024 19:55:26 +0800
Subject: [PATCH] thermal: gov_power_allocator: avoid inability to reset a cdev
Commit 0952177f2a1f ("thermal/core/power_allocator: Update once
cooling devices when temp is low") adds an update flag to avoid
triggering a thermal event when there is no need, and the thermal
cdev is updated once when the temperature is low.
But when the trips are writable, and switch_on_temp is set to be a
higher value, the cooling device state may not be reset to 0,
because last_temperature is smaller than switch_on_temp.
For example:
First:
switch_on_temp=70 control_temp=85;
Then userspace change the trip_temp:
switch_on_temp=45 control_temp=55 cur_temp=54
Then userspace reset the trip_temp:
switch_on_temp=70 control_temp=85 cur_temp=57 last_temp=54
At this time, the cooling device state should be reset to 0.
However, because cur_temp(57) < switch_on_temp(70)
last_temp(54) < switch_on_temp(70) ----> update = false,
update is false, the cooling device state can not be reset.
Using the observation that tz->passive can also be regarded as the
temperature status, set the update flag to the tz->passive value.
When the temperature drops below switch_on for the first time, the
states of cooling devices can be reset once, and tz->passive is updated
to 0. In the next round, because tz->passive is 0, cdev->state will not
be updated.
By using the tz->passive value as the "update" flag, the issue above
can be solved, and the cooling devices can be updated only once when the
temperature is low.
Fixes: 0952177f2a1f ("thermal/core/power_allocator: Update once cooling devices when temp is low")
Cc: 5.13+ <stable(a)vger.kernel.org> # 5.13+
Suggested-by: Wei Wang <wvw(a)google.com>
Signed-off-by: Di Shen <di.shen(a)unisoc.com>
Reviewed-by: Lukasz Luba <lukasz.luba(a)arm.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 7b6aa265ff6a..81e061f183ad 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -762,7 +762,7 @@ static int power_allocator_throttle(struct thermal_zone_device *tz,
trip = params->trip_switch_on;
if (trip && tz->temperature < trip->temperature) {
- update = tz->last_temperature >= trip->temperature;
+ update = tz->passive;
tz->passive = 0;
reset_pid_controller(params);
allow_maximum_power(tz, update);