commit 3cad1bc010416c6dd780643476bc59ed742436b9 upstream.
When fcntl_setlk() races with close(), it removes the created lock with
do_lock_file_wait().
However, LSMs can allow the first do_lock_file_wait() that created the lock
while denying the second do_lock_file_wait() that tries to remove the lock.
In theory (but AFAIK not in practice), posix_lock_file() could also fail to
remove a lock due to GFP_KERNEL allocation failure (when splitting a range
in the middle).
After the bug has been triggered, use-after-free reads will occur in
lock_get_status() when userspace reads /proc/locks. This can likely be used
to read arbitrary kernel memory, but can't corrupt kernel memory.
This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
enforcing mode and only works from some security contexts.
Fix it by calling locks_remove_posix() instead, which is designed to
reliably get rid of POSIX locks associated with the given file and
files_struct and is also used by filp_flush().
Fixes: c293621bbf67 ("[PATCH] stale POSIX lock handling")
Cc: stable(a)kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh(a)google.com>
Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@goog…
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
[stable fixup: ->c.flc_type was ->fl_type in older kernels]
Signed-off-by: Jann Horn <jannh(a)google.com>
---
fs/locks.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index fb717dae9029..31659a2d9862 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2381,8 +2381,9 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
error = do_lock_file_wait(filp, cmd, file_lock);
/*
- * Attempt to detect a close/fcntl race and recover by releasing the
- * lock that was just acquired. There is no need to do that when we're
+ * Detect close/fcntl races and recover by zapping all POSIX locks
+ * associated with this file and our files_struct, just like on
+ * filp_flush(). There is no need to do that when we're
* unlocking though, or for OFD locks.
*/
if (!error && file_lock->fl_type != F_UNLCK &&
@@ -2397,9 +2398,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
f = files_lookup_fd_locked(files, fd);
spin_unlock(&files->file_lock);
if (f != filp) {
- file_lock->fl_type = F_UNLCK;
- error = do_lock_file_wait(filp, cmd, file_lock);
- WARN_ON_ONCE(error);
+ locks_remove_posix(filp, files);
error = -EBADF;
}
}
base-commit: 2eaf5c0d81911ba05bace3a722cbcd708fdbbcba
--
2.45.2.1089.g2a221341d9-goog
Unaccepted memory is considered unusable free memory, which is not
counted as free on the zone watermark check. This causes
get_page_from_freelist() to accept more memory to hit the high
watermark, but it creates problems in the reclaim path.
The reclaim path encounters a failed zone watermark check and attempts
to reclaim memory. This is usually successful, but if there is little or
no reclaimable memory, it can result in endless reclaim with little to
no progress. This can occur early in the boot process, just after start
of the init process when the only reclaimable memory is the page cache
of the init executable and its libraries.
To address this issue, teach shrink_node() and shrink_zones() to accept
memory before attempting to reclaim.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Jianxiong Gao <jxgao(a)google.com>
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Cc: stable(a)vger.kernel.org # v6.5+
---
mm/internal.h | 9 +++++++++
mm/page_alloc.c | 8 +-------
mm/vmscan.c | 36 ++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index cc2c5e07fad3..ea55cbad061f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1515,4 +1515,13 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
void workingset_update_node(struct xa_node *node);
extern struct list_lru shadow_nodes;
+#ifdef CONFIG_UNACCEPTED_MEMORY
+bool try_to_accept_memory(struct zone *zone, unsigned int order);
+#else
+static inline bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+ return false;
+}
+#endif /* CONFIG_UNACCEPTED_MEMORY */
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..9a108c92245f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -287,7 +287,6 @@ EXPORT_SYMBOL(nr_online_nodes);
static bool page_contains_unaccepted(struct page *page, unsigned int order);
static void accept_page(struct page *page, unsigned int order);
-static bool try_to_accept_memory(struct zone *zone, unsigned int order);
static inline bool has_unaccepted_memory(void);
static bool __free_unaccepted(struct page *page);
@@ -6940,7 +6939,7 @@ static bool try_to_accept_memory_one(struct zone *zone)
return true;
}
-static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+bool try_to_accept_memory(struct zone *zone, unsigned int order)
{
long to_accept;
int ret = false;
@@ -6999,11 +6998,6 @@ static void accept_page(struct page *page, unsigned int order)
{
}
-static bool try_to_accept_memory(struct zone *zone, unsigned int order)
-{
- return false;
-}
-
static inline bool has_unaccepted_memory(void)
{
return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..b2af1263b1bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5900,12 +5900,44 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
}
+#ifdef CONFIG_UNACCEPTED_MEMORY
+static bool node_try_to_accept_memory(pg_data_t *pgdat, struct scan_control *sc)
+{
+ bool progress = false;
+ struct zone *zone;
+ int z;
+
+ for (z = 0; z <= sc->reclaim_idx; z++) {
+ zone = pgdat->node_zones + z;
+ if (!managed_zone(zone))
+ continue;
+
+ if (try_to_accept_memory(zone, sc->order))
+ progress = true;
+ }
+
+ return progress;
+}
+#else
+static inline bool node_try_to_accept_memory(pg_data_t *pgdat,
+ struct scan_control *sc)
+{
+ return false;
+}
+#endif /* CONFIG_UNACCEPTED_MEMORY */
+
static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed;
struct lruvec *target_lruvec;
bool reclaimable = false;
+ /* Try to accept memory before going for reclaim */
+ if (node_try_to_accept_memory(pgdat, sc)) {
+ if (!should_continue_reclaim(pgdat, 0, sc))
+ return;
+ }
+
if (lru_gen_enabled() && root_reclaim(sc)) {
lru_gen_shrink_node(pgdat, sc);
return;
@@ -6118,6 +6150,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL))
continue;
+ /* Try to accept memory before going for reclaim */
+ if (try_to_accept_memory(zone, sc->order))
+ continue;
+
/*
* If we already have plenty of memory free for
* compaction in this zone, don't free any more.
--
2.43.0
From: Gabriel Krisman Bertazi <krisman(a)collabora.com>
[ Upstream commit 124e7c61deb27d758df5ec0521c36cf08d417f7a ]
ext4_abort will eventually call ext4_errno_to_code, which translates the
errno to an EXT4_ERR specific error. This means that ext4_abort expects
an errno. By using EXT4_ERR_ here, it gets misinterpreted (as an errno),
and ends up saving EXT4_ERR_EBUSY on the superblock during an abort,
which makes no sense.
ESHUTDOWN will get properly translated to EXT4_ERR_SHUTDOWN, so use that
instead.
Signed-off-by: Gabriel Krisman Bertazi <krisman(a)collabora.com>
Link: https://lore.kernel.org/r/20211026173302.84000-1-krisman@collabora.com
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com>
---
fs/ext4/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 160e5824948270..0e8406f5bf0aa0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5820,7 +5820,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
}
if (ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
- ext4_abort(sb, EXT4_ERR_ESHUTDOWN, "Abort forced by user");
+ ext4_abort(sb, ESHUTDOWN, "Abort forced by user");
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
--
cgit 1.2.3-korg
From: Jason Xing <kernelxing(a)tencent.com>
[ Upstream commit 6648e613226e18897231ab5e42ffc29e63fa3365 ]
Fix NULL pointer data-races in sk_psock_skb_ingress_enqueue() which
syzbot reported [1].
[1]
BUG: KCSAN: data-race in sk_psock_drop / sk_psock_skb_ingress_enqueue
write to 0xffff88814b3278b8 of 8 bytes by task 10724 on cpu 1:
sk_psock_stop_verdict net/core/skmsg.c:1257 [inline]
sk_psock_drop+0x13e/0x1f0 net/core/skmsg.c:843
sk_psock_put include/linux/skmsg.h:459 [inline]
sock_map_close+0x1a7/0x260 net/core/sock_map.c:1648
unix_release+0x4b/0x80 net/unix/af_unix.c:1048
__sock_release net/socket.c:659 [inline]
sock_close+0x68/0x150 net/socket.c:1421
__fput+0x2c1/0x660 fs/file_table.c:422
__fput_sync+0x44/0x60 fs/file_table.c:507
__do_sys_close fs/open.c:1556 [inline]
__se_sys_close+0x101/0x1b0 fs/open.c:1541
__x64_sys_close+0x1f/0x30 fs/open.c:1541
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
read to 0xffff88814b3278b8 of 8 bytes by task 10713 on cpu 0:
sk_psock_data_ready include/linux/skmsg.h:464 [inline]
sk_psock_skb_ingress_enqueue+0x32d/0x390 net/core/skmsg.c:555
sk_psock_skb_ingress_self+0x185/0x1e0 net/core/skmsg.c:606
sk_psock_verdict_apply net/core/skmsg.c:1008 [inline]
sk_psock_verdict_recv+0x3e4/0x4a0 net/core/skmsg.c:1202
unix_read_skb net/unix/af_unix.c:2546 [inline]
unix_stream_read_skb+0x9e/0xf0 net/unix/af_unix.c:2682
sk_psock_verdict_data_ready+0x77/0x220 net/core/skmsg.c:1223
unix_stream_sendmsg+0x527/0x860 net/unix/af_unix.c:2339
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x140/0x180 net/socket.c:745
____sys_sendmsg+0x312/0x410 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x1e9/0x280 net/socket.c:2667
__do_sys_sendmsg net/socket.c:2676 [inline]
__se_sys_sendmsg net/socket.c:2674 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2674
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
value changed: 0xffffffff83d7feb0 -> 0x0000000000000000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10713 Comm: syz-executor.4 Tainted: G W 6.8.0-syzkaller-08951-gfe46a7dd189e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Prior to this, commit 4cd12c6065df ("bpf, sockmap: Fix NULL pointer
dereference in sk_psock_verdict_data_ready()") fixed one NULL pointer
similarly due to no protection of saved_data_ready. Here is another
different caller causing the same issue because of the same reason. So
we should protect it with sk_callback_lock read lock because the writer
side in the sk_psock_drop() uses "write_lock_bh(&sk->sk_callback_lock);".
To avoid errors that could happen in future, I move those two pairs of
lock into the sk_psock_data_ready(), which is suggested by John Fastabend.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: syzbot+aa8c8ec2538929f18f2d(a)syzkaller.appspotmail.com
Signed-off-by: Jason Xing <kernelxing(a)tencent.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend(a)gmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=aa8c8ec2538929f18f2d
Link: https://lore.kernel.org/all/20240329134037.92124-1-kerneljasonxing@gmail.com
Link: https://lore.kernel.org/bpf/20240404021001.94815-1-kerneljasonxing@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[Ashwin: Regenerated the patch for v5.10]
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat(a)broadcom.com>
---
include/linux/skmsg.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 1138dd3071db..a197c9a49e97 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -406,10 +406,12 @@ static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock)
static inline void sk_psock_data_ready(struct sock *sk, struct sk_psock *psock)
{
+ read_lock_bh(&sk->sk_callback_lock);
if (psock->parser.enabled)
psock->parser.saved_data_ready(sk);
else
sk->sk_data_ready(sk);
+ read_unlock_bh(&sk->sk_callback_lock);
}
static inline void psock_set_prog(struct bpf_prog **pprog,
--
2.45.1
From: Daniel Borkmann <daniel(a)iogearbox.net>
[ Upstream commit cfa1a2329a691ffd991fcf7248a57d752e712881 ]
The BPF ring buffer internally is implemented as a power-of-2 sized circular
buffer, with two logical and ever-increasing counters: consumer_pos is the
consumer counter to show which logical position the consumer consumed the
data, and producer_pos which is the producer counter denoting the amount of
data reserved by all producers.
Each time a record is reserved, the producer that "owns" the record will
successfully advance producer counter. In user space each time a record is
read, the consumer of the data advanced the consumer counter once it finished
processing. Both counters are stored in separate pages so that from user
space, the producer counter is read-only and the consumer counter is read-write.
One aspect that simplifies and thus speeds up the implementation of both
producers and consumers is how the data area is mapped twice contiguously
back-to-back in the virtual memory, allowing to not take any special measures
for samples that have to wrap around at the end of the circular buffer data
area, because the next page after the last data page would be first data page
again, and thus the sample will still appear completely contiguous in virtual
memory.
Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
book-keeping the length and offset, and is inaccessible to the BPF program.
Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
possible to make a second allocated memory chunk overlapping with the first
chunk and as a result, the BPF program is now able to edit first chunk's
header.
For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
[0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
allocate a chunk B with size 0x3000. This will succeed because consumer_pos
was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
page and could cause a crash.
Fix it by calculating the oldest pending_pos and check whether the range
from the oldest outstanding record to the newest would span beyond the ring
buffer size. If that is the case, then reject the request. We've tested with
the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
before/after the fix and while it seems a bit slower on some benchmarks, it
is still not significantly enough to matter.
Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg>
Reported-by: Muhammad Ramdhan <ramdhan(a)starlabs.sg>
Co-developed-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg>
Co-developed-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
Signed-off-by: Dominique Martinet <dominique.martinet(a)atmark-techno.com>
---
The only conflict with the patch was in the comment at top of the patch
(the commit that had changed this comment, 583c1f420173 ("bpf: Define
new BPF_MAP_TYPE_USER_RINGBUF map type"), has nothing to do with this
fix), so I went ahead with it.
I'm not familiar with the ringbuf code but it doesn't look too wrong to
me at first glance; and with this all stable branches are covered.
kernel/bpf/ringbuf.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 1e4bf23528a3..eac0026e2fa6 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -41,9 +41,12 @@ struct bpf_ringbuf {
* mapping consumer page as r/w, but restrict producer page to r/o.
* This protects producer position from being modified by user-space
* application and ruining in-kernel position tracking.
+ * Note that the pending counter is placed in the same
+ * page as the producer, so that it shares the same cache line.
*/
unsigned long consumer_pos __aligned(PAGE_SIZE);
unsigned long producer_pos __aligned(PAGE_SIZE);
+ unsigned long pending_pos;
char data[] __aligned(PAGE_SIZE);
};
@@ -145,6 +148,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
rb->mask = data_sz - 1;
rb->consumer_pos = 0;
rb->producer_pos = 0;
+ rb->pending_pos = 0;
return rb;
}
@@ -323,9 +327,9 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
{
- unsigned long cons_pos, prod_pos, new_prod_pos, flags;
- u32 len, pg_off;
+ unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
struct bpf_ringbuf_hdr *hdr;
+ u32 len, pg_off, tmp_size, hdr_len;
if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
return NULL;
@@ -343,13 +347,29 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
spin_lock_irqsave(&rb->spinlock, flags);
}
+ pend_pos = rb->pending_pos;
prod_pos = rb->producer_pos;
new_prod_pos = prod_pos + len;
- /* check for out of ringbuf space by ensuring producer position
- * doesn't advance more than (ringbuf_size - 1) ahead
+ while (pend_pos < prod_pos) {
+ hdr = (void *)rb->data + (pend_pos & rb->mask);
+ hdr_len = READ_ONCE(hdr->len);
+ if (hdr_len & BPF_RINGBUF_BUSY_BIT)
+ break;
+ tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
+ tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
+ pend_pos += tmp_size;
+ }
+ rb->pending_pos = pend_pos;
+
+ /* check for out of ringbuf space:
+ * - by ensuring producer position doesn't advance more than
+ * (ringbuf_size - 1) ahead
+ * - by ensuring oldest not yet committed record until newest
+ * record does not span more than (ringbuf_size - 1)
*/
- if (new_prod_pos - cons_pos > rb->mask) {
+ if (new_prod_pos - cons_pos > rb->mask ||
+ new_prod_pos - pend_pos > rb->mask) {
spin_unlock_irqrestore(&rb->spinlock, flags);
return NULL;
}
--
2.39.2
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 233323f9b9f828cd7cd5145ad811c1990b692542
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024071528-foothill-overdraft-d69a@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
233323f9b9f8 ("ACPI: processor_idle: Fix invalid comparison with insertion sort for latency")
0e6078c3c673 ("ACPI: processor idle: Use swap() instead of open coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 233323f9b9f828cd7cd5145ad811c1990b692542 Mon Sep 17 00:00:00 2001
From: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Date: Tue, 2 Jul 2024 04:56:39 +0800
Subject: [PATCH] ACPI: processor_idle: Fix invalid comparison with insertion
sort for latency
The acpi_cst_latency_cmp() comparison function currently used for
sorting C-state latencies does not satisfy transitivity, causing
incorrect sorting results.
Specifically, if there are two valid acpi_processor_cx elements A and B
and one invalid element C, it may occur that A < B, A = C, and B = C.
Sorting algorithms assume that if A < B and A = C, then C < B, leading
to incorrect ordering.
Given the small size of the array (<=8), we replace the library sort
function with a simple insertion sort that properly ignores invalid
elements and sorts valid ones based on latency. This change ensures
correct ordering of the C-state latencies.
Fixes: 65ea8f2c6e23 ("ACPI: processor idle: Fix up C-state latency if not ordered")
Reported-by: Julian Sikorski <belegdol(a)gmail.com>
Closes: https://lore.kernel.org/lkml/70674dc7-5586-4183-8953-8095567e73df@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Tested-by: Julian Sikorski <belegdol(a)gmail.com>
Cc: All applicable <stable(a)vger.kernel.org>
Link: https://patch.msgid.link/20240701205639.117194-1-visitorckw@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index bd6a7857ce05..831fa4a12159 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -16,7 +16,6 @@
#include <linux/acpi.h>
#include <linux/dmi.h>
#include <linux/sched.h> /* need_resched() */
-#include <linux/sort.h>
#include <linux/tick.h>
#include <linux/cpuidle.h>
#include <linux/cpu.h>
@@ -386,25 +385,24 @@ static void acpi_processor_power_verify_c3(struct acpi_processor *pr,
acpi_write_bit_register(ACPI_BITREG_BUS_MASTER_RLD, 1);
}
-static int acpi_cst_latency_cmp(const void *a, const void *b)
+static void acpi_cst_latency_sort(struct acpi_processor_cx *states, size_t length)
{
- const struct acpi_processor_cx *x = a, *y = b;
+ int i, j, k;
- if (!(x->valid && y->valid))
- return 0;
- if (x->latency > y->latency)
- return 1;
- if (x->latency < y->latency)
- return -1;
- return 0;
-}
-static void acpi_cst_latency_swap(void *a, void *b, int n)
-{
- struct acpi_processor_cx *x = a, *y = b;
+ for (i = 1; i < length; i++) {
+ if (!states[i].valid)
+ continue;
- if (!(x->valid && y->valid))
- return;
- swap(x->latency, y->latency);
+ for (j = i - 1, k = i; j >= 0; j--) {
+ if (!states[j].valid)
+ continue;
+
+ if (states[j].latency > states[k].latency)
+ swap(states[j].latency, states[k].latency);
+
+ k = j;
+ }
+ }
}
static int acpi_processor_power_verify(struct acpi_processor *pr)
@@ -449,10 +447,7 @@ static int acpi_processor_power_verify(struct acpi_processor *pr)
if (buggy_latency) {
pr_notice("FW issue: working around C-state latencies out of order\n");
- sort(&pr->power.states[1], max_cstate,
- sizeof(struct acpi_processor_cx),
- acpi_cst_latency_cmp,
- acpi_cst_latency_swap);
+ acpi_cst_latency_sort(&pr->power.states[1], max_cstate);
}
lapic_timer_propagate_broadcast(pr);
On Tue, Sep 26, 2023 at 9:09 AM Masahiro Yamada <masahiroy(a)kernel.org> wrote:
>
> The 32-bit ARM kernel stops working if the kernel grows to the point
> where veneers for __get_user_* are created.
>
> AAPCS32 [1] states, "Register r12 (IP) may be used by a linker as a
> scratch register between a routine and any subroutine it calls. It
> can also be used within a routine to hold intermediate values between
> subroutine calls."
>
> However, bl instructions buried within the inline asm are unpredictable
> for compilers; hence, "ip" must be added to the clobber list.
>
> This becomes critical when veneers for __get_user_* are created because
> veneers use the ip register since commit 02e541db0540 ("ARM: 8323/1:
> force linker to use PIC veneers").
>
> [1]: https://github.com/ARM-software/abi-aa/blob/2023Q1/aapcs32/aapcs32.rst
>
> Signed-off-by: Masahiro Yamada <masahiroy(a)kernel.org>
> Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
+ stable(a)vger.kernel.org
It seems like this (commit 24d3ba0a7b44c1617c27f5045eecc4f34752ab03
upstream) would be a good candidate for -stable?
The issue it fixes can manifest in lots of very strange ways, so it
would be good to avoid others getting tripped up by it on -stable
branches.
(Apologies for being a bit verbose in the following, I've included a
lot of details and breadcrumbs so others might find this if they run
into the same issues.)
I was recently looking into an arm32 issue, and found getting a custom
built kernel consistently working in qemu-system-arm to bisect issues
in the range of 5.15-6.6 was a bit difficult, as I would hit a couple
different odd errors.
For 5.15 I was seeing systemd fail to start in a fairly opaque way:
starting systemd-udevd.service - Rule-based Manager for Device
Events and Files.
systemd-udevd.service: Main process exited, code=exited, status=1/FAILURE
systemd-udevd.service: Failed with result 'exit-code'.
Failed to start systemd-udevd.service - Rule-based Manager for
Device Events and Files.
But further looking through the logs I found:
systemd[1]: Failed to open netlink: Operation not permitted
Despite lots of digging to try to understand what was going wrong, the
one thing that worked was switching to CONFIG_CC_OPTIMIZE_FOR_SIZE
(which I only tried as I came across this old thread:
https://lists.yoctoproject.org/g/linux-yocto/message/8035 ), this
seemed very suspicious, but I didn't have a lot of time to dig
further.
That resolved things until ~6.1, where I started seeing crashes at init:
[ 16.982562] Run /init as init process
[ 16.989311] Failed to execute /init (error -22)
[ 16.990017] Run /sbin/init as init process
[ 16.994737] Starting init: /sbin/init exists but couldn't execute
it (error -22)
That I bisected that failure down to being supposedly caused by commit
5750121ae738 ("kbuild: list sub-directories in ./Kbuild")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
And searching around that commit luckily led me to this change, which
finally seems to resolve the different issues I saw for 6.6, 6.1 and
5.15!
Now, In my rush to get something booting with qemu, I started with the
debian config but disabled modules, and didn't put much time into
getting rid of config options or drivers I wouldn't need. So the
kernel is pretty large. So maybe not super common, but I definitely
wouldn't want others to have to go down this debugging rabbit hole.
thanks
-john