Syzkaller reports suspicious RCU usage in xfrm_set_default in 5.10 stable
releases. The problem has been fixed by the following patch which can be
cleanly applied to 5.10 branch.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
6e5aedb9324a ("io_uring/poll: attempt request issue after racy poll wakeup")
443e57550670 ("io_uring: combine poll tw handlers")
047b6aef0966 ("io_uring: remove ctx variable in io_poll_check_events")
9b8c54755a2b ("io_uring: add io_aux_cqe which allows deferred completion")
973fc83f3a94 ("io_uring: defer all io_req_complete_failed")
c06c6c5d2767 ("io_uring: always lock in io_apoll_task_func")
1bec951c3809 ("io_uring: iopoll protect complete_post")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
e276ae344a77 ("io_uring: hold locks for io_req_complete_failed")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
cd42a53d25d4 ("io_uring/poll: remove outdated comments of caching")
e2ad599d1ed3 ("io_uring: allow multishot recv CQEs to overflow")
515e26961295 ("io_uring: revert "io_uring fix multishot accept ordering"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6e5aedb9324aab1c14a23fae3d8eeb64a679c20e Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 10 Jan 2023 10:44:37 -0700
Subject: [PATCH] io_uring/poll: attempt request issue after racy poll wakeup
If we have multiple requests waiting on the same target poll waitqueue,
then it's quite possible to get a request triggered and get disappointed
in not being able to make any progress with it. If we race in doing so,
we'll potentially leave the poll request on the internal tables, but
removed from the waitqueue. That means that any subsequent trigger of
the poll waitqueue will not kick that request into action, causing an
application to potentially wait for completion of a request that will
never happen.
Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather
than have complicated logic for how to re-arm a given type of request,
just punt it for a reissue.
While in there, move the 'ret' variable to the only section where it
gets used. This avoids confusion the scope of it.
Cc: stable(a)vger.kernel.org
Fixes: eb0089d629ba ("io_uring: single shot poll removal optimisation")
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/poll.c b/io_uring/poll.c
index cf6a70bd54e0..32e5fc8365e6 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -223,21 +223,22 @@ enum {
IOU_POLL_DONE = 0,
IOU_POLL_NO_ACTION = 1,
IOU_POLL_REMOVE_POLL_USE_RES = 2,
+ IOU_POLL_REISSUE = 3,
};
/*
* All poll tw should go through this. Checks for poll events, manages
* references, does rewait, etc.
*
- * Returns a negative error on failure. IOU_POLL_NO_ACTION when no action require,
- * which is either spurious wakeup or multishot CQE is served.
- * IOU_POLL_DONE when it's done with the request, then the mask is stored in req->cqe.res.
- * IOU_POLL_REMOVE_POLL_USE_RES indicates to remove multishot poll and that the result
- * is stored in req->cqe.
+ * Returns a negative error on failure. IOU_POLL_NO_ACTION when no action
+ * require, which is either spurious wakeup or multishot CQE is served.
+ * IOU_POLL_DONE when it's done with the request, then the mask is stored in
+ * req->cqe.res. IOU_POLL_REMOVE_POLL_USE_RES indicates to remove multishot
+ * poll and that the result is stored in req->cqe.
*/
static int io_poll_check_events(struct io_kiocb *req, bool *locked)
{
- int v, ret;
+ int v;
/* req->task == current here, checking PF_EXITING is safe */
if (unlikely(req->task->flags & PF_EXITING))
@@ -276,10 +277,15 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
if (!req->cqe.res) {
struct poll_table_struct pt = { ._key = req->apoll_events };
req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
+ /*
+ * We got woken with a mask, but someone else got to
+ * it first. The above vfs_poll() doesn't add us back
+ * to the waitqueue, so if we get nothing back, we
+ * should be safe and attempt a reissue.
+ */
+ if (unlikely(!req->cqe.res))
+ return IOU_POLL_REISSUE;
}
-
- if ((unlikely(!req->cqe.res)))
- continue;
if (req->apoll_events & EPOLLONESHOT)
return IOU_POLL_DONE;
@@ -294,7 +300,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
return IOU_POLL_REMOVE_POLL_USE_RES;
}
} else {
- ret = io_poll_issue(req, locked);
+ int ret = io_poll_issue(req, locked);
if (ret == IOU_STOP_MULTISHOT)
return IOU_POLL_REMOVE_POLL_USE_RES;
if (ret < 0)
@@ -330,6 +336,9 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
poll = io_kiocb_to_cmd(req, struct io_poll);
req->cqe.res = mangle_poll(req->cqe.res & poll->events);
+ } else if (ret == IOU_POLL_REISSUE) {
+ io_req_task_submit(req, locked);
+ return;
} else if (ret != IOU_POLL_REMOVE_POLL_USE_RES) {
req->cqe.res = ret;
req_set_fail(req);
@@ -342,7 +351,7 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
io_req_task_complete(req, locked);
- else if (ret == IOU_POLL_DONE)
+ else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
io_req_task_submit(req, locked);
else
io_req_defer_failed(req, ret);
Hi! Since kernel 6.0.18 resume from hibernate fails, the system hangs
and a hard reset is necessary. The CPU is a Ryzen 5600G, the system is
Linux From Scratch-11.1.
I found this in the system log:
[...]
Jan 12 19:30:03 LUX kernel: [ 50.248036] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:67:crtc-0] flip_done timed out
Jan 12 19:30:03 LUX kernel: [ 50.248040] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:70:crtc-1] flip_done timed out
Jan 12 19:30:14 LUX kernel: [ 60.488034] amdgpu 0000:30:00.0: [drm]
*ERROR* flip_done timed out
Jan 12 19:30:14 LUX kernel: [ 60.488040] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:67:crtc-0] commit wait timed out
^@^@^@^@^@^@^@^@^@^[...]@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 12 19:31:20 LUX
syslogd 1.5.1: restart.
[...]
Bisecting the problem turned up this:
~/Downloads/linux-stable-BLFS-11.1> git bisect bad
306df163069e78160e7a534b892c5cd6fefdd537 is the first bad commit
commit 306df163069e78160e7a534b892c5cd6fefdd537
Author: Alex Deucher <alexander.deucher(a)amd.com>
Date: Wed Dec 7 11:08:53 2022 -0500
drm/amdgpu: make display pinning more flexible (v2)
commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb upstream.
Only apply the static threshold for Stoney and Carrizo.
This hardware has certain requirements that don't allow
mixing of GTT and VRAM. Newer asics do not have these
requirements so we should be able to be more flexible
with where buffers end up.
[...]
Let me know if you need more info. Thanks.
Rainer Fiebig
--
The truth always turns out to be simpler than you thought.
Richard Feynman
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ae9dcb91c606 ("net: stmmac: add aux timestamps fifo clearance wait")
f4da56529da6 ("net: stmmac: Add support for external trigger timestamping")
8532f613bc78 ("net: stmmac: introduce MSI Interrupt routines for mac, safety, RX & TX")
7e1c520c0d20 ("net: stmmac: introduce DMA interrupt status masking per traffic direction")
341f67e424e5 ("net: stmmac: Add hardware supported cross-timestamp")
76da35dc99af ("stmmac: intel: Add PSE and PCH PTP clock source selection")
b4d45aee6635 ("net: stmmac: add platform level clocks management")
5ec55823438e ("net: stmmac: add clocks management for gmac driver")
7310fe538ea5 ("stmmac: intel: add pcs-xpcs for Intel mGbE controller")
20e07e2c3cf3 ("net: stmmac: Add PCI bus info to ethtool driver query output")
7cfc4486e7ea ("stmmac: intel: Configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing")
bff6f1db91e3 ("stmmac: intel: change all EHL/TGL to auto detect phy addr")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae9dcb91c6069e20b3b9505d79cbc89fd6e086f5 Mon Sep 17 00:00:00 2001
From: Noor Azura Ahmad Tarmizi <noor.azura.ahmad.tarmizi(a)intel.com>
Date: Wed, 11 Jan 2023 13:02:00 +0800
Subject: [PATCH] net: stmmac: add aux timestamps fifo clearance wait
Add timeout polling wait for auxiliary timestamps snapshot FIFO clear bit
(ATSFC) to clear. This is to ensure no residue fifo value is being read
erroneously.
Fixes: f4da56529da6 ("net: stmmac: Add support for external trigger timestamping")
Cc: <stable(a)vger.kernel.org> # 5.10.x
Signed-off-by: Noor Azura Ahmad Tarmizi <noor.azura.ahmad.tarmizi(a)intel.com>
Link: https://lore.kernel.org/r/20230111050200.2130-1-noor.azura.ahmad.tarmizi@in…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index fc06ddeac0d5..b4388ca8d211 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -210,7 +210,10 @@ static int stmmac_enable(struct ptp_clock_info *ptp,
}
writel(acr_value, ptpaddr + PTP_ACR);
mutex_unlock(&priv->aux_ts_lock);
- ret = 0;
+ /* wait for auxts fifo clear to finish */
+ ret = readl_poll_timeout(ptpaddr + PTP_ACR, acr_value,
+ !(acr_value & PTP_ACR_ATSFC),
+ 10, 10000);
break;
default:
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
87ca4f9efbd7 ("sched/core: Fix use-after-free bug in dup_user_cpus_ptr()")
8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
713a2e21a513 ("sched: Introduce affinity_context")
d664e399128b ("sched: Fix missing prototype warnings")
e81daa7b6489 ("sched/headers: Reorganize, clean up and optimize kernel/sched/build_utility.c dependencies")
0dda4eeb4849 ("sched/headers: Reorganize, clean up and optimize kernel/sched/build_policy.c dependencies")
c4ad6fcb67c4 ("sched/headers: Reorganize, clean up and optimize kernel/sched/fair.c dependencies")
e66f6481a8c7 ("sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies")
b9e9c6ca6e54 ("sched/headers: Standardize kernel/sched/sched.h header dependencies")
f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")
d90a2f160a1c ("sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h")
99cf983cc8bc ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
33c64734be34 ("sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY")
4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched() callers")
8a69fe0be143 ("sched/preempt: Refactor sched_dynamic_update()")
4c7485584d48 ("sched/preempt: Move PREEMPT_DYNAMIC logic later")
6ae71436cda7 ("Merge tag 'sched_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 87ca4f9efbd7cc649ff43b87970888f2812945b8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman(a)redhat.com>
Date: Fri, 30 Dec 2022 23:11:19 -0500
Subject: [PATCH] sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be
restricted on asymmetric systems"), the setting and clearing of
user_cpus_ptr are done under pi_lock for arm64 architecture. However,
dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
protection. Since sched_setaffinity() can be invoked from another
process, the process being modified may be undergoing fork() at
the same time. When racing with the clearing of user_cpus_ptr in
__set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
possibly double-free in arm64 kernel.
Commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask") fixes this problem as user_cpus_ptr, once set, will never
be cleared in a task's lifetime. However, this bug was re-introduced
in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
do_set_cpus_allowed(). This time, it will affect all arches.
Fix this bug by always clearing the user_cpus_ptr of the newly
cloned/forked task before the copying process starts and check the
user_cpus_ptr state of the source task under pi_lock.
Note to stable, this patch won't be applicable to stable releases.
Just copy the new dup_user_cpus_ptr() function over.
Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: David Wang 王标 <wangbiao3(a)xiaomi.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965d813c28ad..f9f6e5413dcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2612,19 +2612,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
int node)
{
+ cpumask_t *user_mask;
unsigned long flags;
- if (!src->user_cpus_ptr)
+ /*
+ * Always clear dst->user_cpus_ptr first as their user_cpus_ptr's
+ * may differ by now due to racing.
+ */
+ dst->user_cpus_ptr = NULL;
+
+ /*
+ * This check is racy and losing the race is a valid situation.
+ * It is not worth the extra overhead of taking the pi_lock on
+ * every fork/clone.
+ */
+ if (data_race(!src->user_cpus_ptr))
return 0;
- dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
- if (!dst->user_cpus_ptr)
+ user_mask = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+ if (!user_mask)
return -ENOMEM;
- /* Use pi_lock to protect content of user_cpus_ptr */
+ /*
+ * Use pi_lock to protect content of user_cpus_ptr
+ *
+ * Though unlikely, user_cpus_ptr can be reset to NULL by a concurrent
+ * do_set_cpus_allowed().
+ */
raw_spin_lock_irqsave(&src->pi_lock, flags);
- cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ if (src->user_cpus_ptr) {
+ swap(dst->user_cpus_ptr, user_mask);
+ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ }
raw_spin_unlock_irqrestore(&src->pi_lock, flags);
+
+ if (unlikely(user_mask))
+ kfree(user_mask);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
52120f0fadcb ("io_uring: add allow_overflow to io_post_aux_cqe")
e6130eba8a84 ("io_uring: add support for passing fixed file descriptors")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
88e9a5b7965c ("efi/fake_mem: arrange for a resource entry per efi_fake_mem instance")
4d0a4388ccdd ("Merge branch 'efi/urgent' into efi/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
88e9a5b7965c ("efi/fake_mem: arrange for a resource entry per efi_fake_mem instance")
4d0a4388ccdd ("Merge branch 'efi/urgent' into efi/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d3f450533bbc ("efi: tpm: Avoid READ_ONCE() for accessing the event log")
047d50aee341 ("efi/tpm: Don't access event->count when it isn't mapped")
c46f3405692d ("tpm: Reserve the TPM final events table")
44038bc514a2 ("tpm: Abstract crypto agile event size calculations")
c8faabfc6f48 ("tpm: add _head suffix to tcg_efi_specid_event and tcg_pcr_event2")
3425d934fc03 ("efi/x86: Handle page faults occurring while running EFI runtime services")
9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault handler")
71e0940d52e1 ("efi: honour memory reservations passed via a linux specific config table")
50a7ca3c6fc8 ("mm: convert return type of handle_mm_fault() caller to vm_fault_t")
3eb420e70d87 ("efi: Use a work queue to invoke EFI Runtime Services")
c90fca951e90 ("Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d3f450533bbcb6dd4d7d59cadc9b61b7321e4ac1 Mon Sep 17 00:00:00 2001
From: Ard Biesheuvel <ardb(a)kernel.org>
Date: Mon, 9 Jan 2023 10:44:31 +0100
Subject: [PATCH] efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable(a)vger.kernel.org>
Cc: Peter Jones <pjones(a)redhat.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Matthew Garrett <mjg59(a)srcf.ucam.org>
Reported-by: Nathan Chancellor <nathan(a)kernel.org>
Tested-by: Nathan Chancellor <nathan(a)kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1782
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/include/linux/tpm_eventlog.h b/include/linux/tpm_eventlog.h
index 20c0ff54b7a0..7d68a5cc5881 100644
--- a/include/linux/tpm_eventlog.h
+++ b/include/linux/tpm_eventlog.h
@@ -198,8 +198,8 @@ static __always_inline int __calc_tpm2_event_size(struct tcg_pcr_event2_head *ev
* The loop below will unmap these fields if the log is larger than
* one page, so save them here for reference:
*/
- count = READ_ONCE(event->count);
- event_type = READ_ONCE(event->event_type);
+ count = event->count;
+ event_type = event->event_type;
/* Verify that it's the log header */
if (event_header->pcr_idx != 0 ||
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d3f450533bbc ("efi: tpm: Avoid READ_ONCE() for accessing the event log")
047d50aee341 ("efi/tpm: Don't access event->count when it isn't mapped")
c46f3405692d ("tpm: Reserve the TPM final events table")
44038bc514a2 ("tpm: Abstract crypto agile event size calculations")
c8faabfc6f48 ("tpm: add _head suffix to tcg_efi_specid_event and tcg_pcr_event2")
3425d934fc03 ("efi/x86: Handle page faults occurring while running EFI runtime services")
9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault handler")
71e0940d52e1 ("efi: honour memory reservations passed via a linux specific config table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d3f450533bbcb6dd4d7d59cadc9b61b7321e4ac1 Mon Sep 17 00:00:00 2001
From: Ard Biesheuvel <ardb(a)kernel.org>
Date: Mon, 9 Jan 2023 10:44:31 +0100
Subject: [PATCH] efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable(a)vger.kernel.org>
Cc: Peter Jones <pjones(a)redhat.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Matthew Garrett <mjg59(a)srcf.ucam.org>
Reported-by: Nathan Chancellor <nathan(a)kernel.org>
Tested-by: Nathan Chancellor <nathan(a)kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1782
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/include/linux/tpm_eventlog.h b/include/linux/tpm_eventlog.h
index 20c0ff54b7a0..7d68a5cc5881 100644
--- a/include/linux/tpm_eventlog.h
+++ b/include/linux/tpm_eventlog.h
@@ -198,8 +198,8 @@ static __always_inline int __calc_tpm2_event_size(struct tcg_pcr_event2_head *ev
* The loop below will unmap these fields if the log is larger than
* one page, so save them here for reference:
*/
- count = READ_ONCE(event->count);
- event_type = READ_ONCE(event->event_type);
+ count = event->count;
+ event_type = event->event_type;
/* Verify that it's the log header */
if (event_header->pcr_idx != 0 ||
Hi,
On arm machine we have a compilation error while building
tools/testing/selftests/kvm/ with LTS 5.15.y kernel
rseq_test.c: In function ‘main’:
rseq_test.c:236:33: warning: implicit declaration of function ‘gettid’;
did you mean ‘getgid’? [-Wimplicit-function-declaration]
(void *)(unsigned long)gettid());
^~~~~~
getgid
/tmp/cc4Wfmv2.o: In function `main':
/home/test/linux/tools/testing/selftests/kvm/rseq_test.c:236: undefined
reference to `gettid'
collect2: error: ld returned 1 exit status
make: *** [../lib.mk:151:
/home/test/linux/tools/testing/selftests/kvm/rseq_test] Error 1
make: Leaving directory '/home/test/linux/tools/testing/selftests/kvm'
This is fixed by commit 561cafebb2cf97b0927b4fb0eba22de6200f682e upstream.
Kernel version to apply: 5.15.y LTS
Could we please backport the above fix onto 5.15.y LTS. It applies
cleanly and the kvm selftests compile properly after applying the patch.
Liam pointed the same in [1]
Thanks,
Harshit
[1]
https://lore.kernel.org/all/fdfb143a-45c4-aaff-aa95-d20c076ff555@oracle.com/
From: Denis Nikitin <denik(a)chromium.org>
commit bde971a83bbff78561458ded236605a365411b87 upstream.
Kernel build with clang and KCFLAGS=-fprofile-sample-use=<profile> fails with:
error: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.tmp.o: Unexpected SHT_REL
section ".rel.llvm.call-graph-profile"
Starting from 13.0.0 llvm can generate SHT_REL section, see
https://reviews.llvm.org/rGca3bdb57fa1ac98b711a735de048c12b5fdd8086.
gen-hyprel does not support SHT_REL relocation section.
Filter out profile use flags to fix the build with profile optimization.
Signed-off-by: Denis Nikitin <denik(a)chromium.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20221014184532.3153551-1-denik@chromium.org
Signed-off-by: Stephen Boyd <swboyd(a)chromium.org>
---
I need this to properly compile 5.15.y stable kernels in the chromeos build
system.
arch/arm64/kvm/hyp/nvhe/Makefile | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 8d741f71377f..964c2134ea1e 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -83,6 +83,10 @@ quiet_cmd_hypcopy = HYPCOPY $@
# Remove ftrace, Shadow Call Stack, and CFI CFLAGS.
# This is equivalent to the 'notrace', '__noscs', and '__nocfi' annotations.
KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_FTRACE) $(CC_FLAGS_SCS) $(CC_FLAGS_CFI), $(KBUILD_CFLAGS))
+# Starting from 13.0.0 llvm emits SHT_REL section '.llvm.call-graph-profile'
+# when profile optimization is applied. gen-hyprel does not support SHT_REL and
+# causes a build failure. Remove profile optimization flags.
+KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% -fprofile-use=%, $(KBUILD_CFLAGS))
# KVM nVHE code is run at a different exception code with a different map, so
# compiler instrumentation that inserts callbacks or checks into the code may
--
https://chromeos.dev
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
406504c7b040 ("KVM: arm64: Fix S1PTW handling on RO memslots")
c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch")
16314874b12b ("Merge branch 'kvm-arm64/misc-5.9' into kvmarm-master/next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 406504c7b0405d74d74c15a667cd4c4620c3e7a9 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 20 Dec 2022 14:03:52 +0000
Subject: [PATCH] KVM: arm64: Fix S1PTW handling on RO memslots
A recent development on the EFI front has resulted in guests having
their page tables baked in the firmware binary, and mapped into the
IPA space as part of a read-only memslot. Not only is this legitimate,
but it also results in added security, so thumbs up.
It is possible to take an S1PTW translation fault if the S1 PTs are
unmapped at stage-2. However, KVM unconditionally treats S1PTW as a
write to correctly handle hardware AF/DB updates to the S1 PTs.
Furthermore, KVM injects an exception into the guest for S1PTW writes.
In the aforementioned case this results in the guest taking an abort
it won't recover from, as the S1 PTs mapping the vectors suffer from
the same problem.
So clearly our handling is... wrong.
Instead, switch to a two-pronged approach:
- On S1PTW translation fault, handle the fault as a read
- On S1PTW permission fault, handle the fault as a write
This is of no consequence to SW that *writes* to its PTs (the write
will trigger a non-S1PTW fault), and SW that uses RO PTs will not
use HW-assisted AF/DB anyway, as that'd be wrong.
Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write
fault on S1PTW permission fault on instruction fetch") do we end-up
with two back-to-back faults (page being evicted and faulted back).
I don't think this is a case worth optimising for.
Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch")
Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev>
Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
Regression-tested-by: Ard Biesheuvel <ardb(a)kernel.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 9bdba47f7e14..0d40c48d8132 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
{
- if (kvm_vcpu_abt_iss1tw(vcpu))
- return true;
+ if (kvm_vcpu_abt_iss1tw(vcpu)) {
+ /*
+ * Only a permission fault on a S1PTW should be
+ * considered as a write. Otherwise, page tables baked
+ * in a read-only memslot will result in an exception
+ * being delivered in the guest.
+ *
+ * The drawback is that we end-up faulting twice if the
+ * guest is using any of HW AF/DB: a translation fault
+ * to map the page containing the PT (read only at
+ * first), then a permission fault to allow the flags
+ * to be set.
+ */
+ switch (kvm_vcpu_trap_get_fault_type(vcpu)) {
+ case ESR_ELx_FSC_PERM:
+ return true;
+ default:
+ return false;
+ }
+ }
if (kvm_vcpu_trap_is_iabt(vcpu))
return false;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
16f1f838442d ("Revert "ALSA: usb-audio: Drop superfluous interface setup at parsing"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 16f1f838442dc6430d32d51ddda347b8421ec34b Mon Sep 17 00:00:00 2001
From: Takashi Iwai <tiwai(a)suse.de>
Date: Wed, 4 Jan 2023 16:09:44 +0100
Subject: [PATCH] Revert "ALSA: usb-audio: Drop superfluous interface setup at
parsing"
This reverts commit ac5e2fb425e1121ceef2b9d1b3ffccc195d55707.
The commit caused a regression on Behringer UMC404HD (and likely
others). As the change was meant only as a minor optimization, it's
better to revert it to address the regression.
Reported-and-tested-by: Michael Ralston <michael(a)ralston.id.au>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/CAC2975JXkS1A5Tj9b02G_sy25ZWN-ys+tc9wmkoS=qPgKCog…
Link: https://lore.kernel.org/r/20230104150944.24918-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/usb/stream.c b/sound/usb/stream.c
index f75601ca2d52..f10f4e6d3fb8 100644
--- a/sound/usb/stream.c
+++ b/sound/usb/stream.c
@@ -1222,6 +1222,12 @@ static int __snd_usb_parse_audio_interface(struct snd_usb_audio *chip,
if (err < 0)
return err;
}
+
+ /* try to set the interface... */
+ usb_set_interface(chip->dev, iface_no, 0);
+ snd_usb_init_pitch(chip, fp);
+ snd_usb_init_sample_rate(chip, fp, fp->rate_max);
+ usb_set_interface(chip->dev, iface_no, altno);
}
return 0;
}
I'm announcing the release of the 6.1.6 kernel.
All users of the 6.1 kernel series must upgrade.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/parisc/include/uapi/asm/mman.h | 29 ++---
arch/parisc/kernel/sys_parisc.c | 28 +++++
arch/parisc/kernel/syscalls/syscall.tbl | 2
arch/x86/kernel/fpu/core.c | 19 +--
arch/x86/kernel/fpu/regset.c | 2
arch/x86/kernel/fpu/signal.c | 2
arch/x86/kernel/fpu/xstate.c | 52 +++++++++-
arch/x86/kernel/fpu/xstate.h | 4
fs/nfsd/nfs4proc.c | 7 -
fs/nfsd/nfs4xdr.c | 2
init/Kconfig | 6 +
net/sched/sch_api.c | 5 +
net/sunrpc/auth_gss/svcauth_gss.c | 4
net/sunrpc/svc.c | 6 -
net/sunrpc/svc_xprt.c | 2
net/sunrpc/svcsock.c | 8 -
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2
sound/core/control.c | 24 +++-
sound/pci/hda/cs35l41_hda.c | 20 +++-
sound/pci/hda/patch_hdmi.c | 1
sound/pci/hda/patch_realtek.c | 2
tools/arch/parisc/include/uapi/asm/mman.h | 12 +-
tools/perf/bench/bench.h | 12 --
tools/testing/selftests/vm/pkey-x86.h | 12 ++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
26 files changed, 308 insertions(+), 88 deletions(-)
Adrian Chan (1):
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
Chris Chiu (1):
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
Chuck Lever (1):
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
Clement Lecigne (1):
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
Frederick Lawler (1):
net: sched: disallow noqueue for qdisc classes
Greg Kroah-Hartman (1):
Linux 6.1.6
Helge Deller (1):
parisc: Align parisc MADV_XXX constants with all other architectures
Jeremy Szu (1):
ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform
Kyle Huey (6):
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
x86/fpu: Allow PKRU to be (once again) written by ptrace.
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Linus Torvalds (1):
gcc: disable -Warray-bounds for gcc-11 too
Takashi Iwai (2):
ALSA: hda: cs35l41: Don't return -EINVAL from system suspend/resume
ALSA: hda: cs35l41: Check runtime suspend capability at runtime_idle
I'm announcing the release of the 5.15.88 kernel.
All users of the 5.15 kernel series must upgrade.
The updated 5.15.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.15.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/parisc/include/uapi/asm/mman.h | 27 ++---
arch/parisc/kernel/sys_parisc.c | 27 +++++
arch/parisc/kernel/syscalls/syscall.tbl | 2
arch/x86/include/asm/fpu/xstate.h | 4
arch/x86/kernel/fpu/regset.c | 2
arch/x86/kernel/fpu/signal.c | 2
arch/x86/kernel/fpu/xstate.c | 41 +++++++-
drivers/tty/serial/fsl_lpuart.c | 2
drivers/tty/serial/serial_core.c | 3
net/ipv4/inet_connection_sock.c | 16 +++
net/ipv4/tcp_ulp.c | 4
net/sched/sch_api.c | 5 +
sound/core/control.c | 24 +++-
sound/pci/hda/patch_hdmi.c | 1
sound/pci/hda/patch_realtek.c | 1
tools/arch/parisc/include/uapi/asm/mman.h | 12 +-
tools/perf/bench/bench.h | 12 --
tools/testing/selftests/vm/pkey-x86.h | 12 ++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
20 files changed, 273 insertions(+), 57 deletions(-)
Adrian Chan (1):
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
Chris Chiu (1):
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
Clement Lecigne (1):
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
Frederick Lawler (1):
net: sched: disallow noqueue for qdisc classes
Greg Kroah-Hartman (1):
Linux 5.15.88
Helge Deller (1):
parisc: Align parisc MADV_XXX constants with all other architectures
Kyle Huey (6):
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
x86/fpu: Allow PKRU to be (once again) written by ptrace.
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Paolo Abeni (1):
net/ulp: prevent ULP without clone op from entering the LISTEN status
Rasmus Villemoes (1):
serial: fixup backport of "serial: Deassert Transmit Enable on probe in driver-specific way"
The x86-64 memmove code has two ALTERNATIVE statements in a row, one to
handle FSRM ("Fast Short REP MOVSB"), and one to handle ERMS ("Enhanced
REP MOVSB"). If either of these features is present, the goal is to jump
directly to a REP MOVSB; otherwise, some setup code that handles short
lengths is executed. The first comparison of a sequence of specific
small sizes is included in the first ALTERNATIVE, so it will be replaced
by NOPs if FSRM is set, and then (assuming ERMS is also set) execution
will fall through to the JMP to a REP MOVSB in the next ALTERNATIVE.
The two ALTERNATIVE invocations can be combined into a single instance
of ALTERNATIVE_2 to simplify and slightly shorten the code. If either
FSRM or ERMS is set, the first instruction in the memmove_begin_forward
path will be replaced with a jump to the REP MOVSB.
This also prevents a problem when FSRM is set but ERMS is not; in this
case, the previous code would have replaced both ALTERNATIVEs with NOPs
and skipped the first check for sizes less than 0x20 bytes. This
combination of CPU features is arguably a firmware bug, but this patch
makes the function robust against this badness.
Fixes: f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")
Signed-off-by: Daniel Verkamp <dverkamp(a)chromium.org>
---
arch/x86/lib/memmove_64.S | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 724bbf83eb5b..1fc36dbd3bdc 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -38,8 +38,10 @@ SYM_FUNC_START(__memmove)
/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
- ALTERNATIVE "", "jmp .Lmemmove_erms", X86_FEATURE_ERMS
+ ALTERNATIVE_2 \
+ "cmp $0x20, %rdx; jb 1f", \
+ "jmp .Lmemmove_erms", X86_FEATURE_FSRM, \
+ "jmp .Lmemmove_erms", X86_FEATURE_ERMS
/*
* movsq instruction have many startup latency
--
2.39.0.314.g84b9a713c41-goog
This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock()
based migrate page isolation, and then fails to migrate all isolated
pages. Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was suspected as it was merged in 6.1 and in
theory can indeed remove a termination condition for
fast_find_migrateblock() under certain conditions, as it removes a place
that always marks a scanned pageblock from being re-scanned. There are
other such places, but those can be skipped under certain conditions,
which seems to match the tracepoint data.
Testing of revert also appears to have resolved the issue, thus revert
the commit until a more robust solution for the original problem is
developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: <stable(a)vger.kernel.org>
---
mm/compaction.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/compaction.c b/mm/compaction.c
index ca1603524bbe..8238e83385a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1839,6 +1839,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
pfn = cc->zone->zone_start_pfn;
cc->fast_search_fail = 0;
found_block = true;
+ set_pageblock_skip(freepage);
break;
}
}
--
2.39.0
The x86-64 memmove code has two ALTERNATIVE statements in a row, one to
handle FSRM ("Fast Short REP MOVSB"), and one to handle ERMS ("Enhanced
REP MOVSB"). If either of these features is present, the goal is to jump
directly to a REP MOVSB; otherwise, some setup code that handles short
lengths is executed. The first comparison of a sequence of specific
small sizes is included in the first ALTERNATIVE, so it will be replaced
by NOPs if FSRM is set, and then (assuming ERMS is also set) execution
will fall through to the JMP to a REP MOVSB in the next ALTERNATIVE.
The two ALTERNATIVE invocations can be combined into a single instance
of ALTERNATIVE_2 to simplify and slightly shorten the code. If either
FSRM or ERMS is set, the first instruction in the memmove_begin_forward
path will be replaced with a jump to the REP MOVSB.
This also prevents a problem when FSRM is set but ERMS is not; in this
case, the previous code would have replaced both ALTERNATIVEs with NOPs
and skipped the first check for sizes less than 0x20 bytes. This
combination of CPU features is arguably a firmware bug, but this patch
makes the function robust against this badness.
Fixes: f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")
Signed-off-by: Daniel Verkamp <dverkamp(a)chromium.org>
---
arch/x86/lib/memmove_64.S | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 724bbf83eb5b..1fc36dbd3bdc 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -38,8 +38,10 @@ SYM_FUNC_START(__memmove)
/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
- ALTERNATIVE "", "jmp .Lmemmove_erms", X86_FEATURE_ERMS
+ ALTERNATIVE_2 \
+ "cmp $0x20, %rdx; jb 1f", \
+ "jmp .Lmemmove_erms", X86_FEATURE_FSRM, \
+ "jmp .Lmemmove_erms", X86_FEATURE_ERMS
/*
* movsq instruction have many startup latency
--
2.39.0.314.g84b9a713c41-goog
Before these patches, the Userspace Path Manager would allow the
creation of subflows with wrong families: taking the one of the MPTCP
socket instead of the provided ones and resulting in the creation of
subflows with likely not the right source and/or destination IPs. It
would also allow the creation of subflows between different families or
not respecting v4/v6-only socket attributes.
Patch 1 lets the userspace PM select the proper family to avoid creating
subflows with the wrong source and/or destination addresses because the
family is not the expected one.
Patch 2 makes sure the userspace PM doesn't allow the userspace to
create subflows for a family that is not allowed.
Patch 3 validates scenarios with a mix of v4 and v6 subflows for the
same MPTCP connection.
These patches fix issues introduced in v5.19 when the userspace path
manager has been introduced.
To: "David S. Miller" <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Kishen Maloor <kishen.maloor(a)intel.com>
To: Florian Westphal <fw(a)strlen.de>
To: Shuah Khan <shuah(a)kernel.org>
Cc: netdev(a)vger.kernel.org
Cc: mptcp(a)lists.linux.dev
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Matthieu Baerts (2):
mptcp: netlink: respect v4/v6-only sockets
selftests: mptcp: userspace: validate v4-v6 subflows mix
Paolo Abeni (1):
mptcp: explicitly specify sock family at subflow creation time
net/mptcp/pm.c | 25 ++++++++++++
net/mptcp/pm_userspace.c | 7 ++++
net/mptcp/protocol.c | 2 +-
net/mptcp/protocol.h | 6 ++-
net/mptcp/subflow.c | 9 +++--
tools/testing/selftests/net/mptcp/userspace_pm.sh | 47 +++++++++++++++++++++++
6 files changed, 90 insertions(+), 6 deletions(-)
---
base-commit: be53771c87f4e322a9835d3faa9cd73a4ecdec5b
change-id: 20230112-upstream-net-20230112-netlink-v4-v6-b6b958039ee0
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
Hello!
This is an experimental semi-automated report about issues detected by
Coverity from a scan of next-20230113 as part of the linux-next scan project:
https://scan.coverity.com/projects/linux-next-weekly-scan
You're getting this email because you were associated with the identified
lines of code (noted below) that were touched by commits:
Wed Jan 11 16:15:43 2023 -0800
ebc4c1bcc2a5 ("maple_tree: fix mas_empty_area_rev() lower bound validation")
Coverity reported the following:
*** CID 1530569: Error handling issues (CHECKED_RETURN)
lib/test_maple_tree.c:2598 in check_empty_area_window()
2592 MT_BUG_ON(mt, mas_empty_area(&mas, 5, 100, 6) != -EBUSY);
2593
2594 mas_reset(&mas);
2595 MT_BUG_ON(mt, mas_empty_area(&mas, 0, 8, 10) != -EBUSY);
2596
2597 mas_reset(&mas);
vvv CID 1530569: Error handling issues (CHECKED_RETURN)
vvv Calling "mas_empty_area" without checking return value (as is done elsewhere 8 out of 10 times).
2598 mas_empty_area(&mas, 100, 165, 3);
2599
2600 mas_reset(&mas);
2601 MT_BUG_ON(mt, mas_empty_area(&mas, 100, 163, 6) != -EBUSY);
2602 rcu_read_unlock();
2603 }
If this is a false positive, please let us know so we can mark it as
such, or teach the Coverity rules to be smarter. If not, please make
sure fixes get into linux-next. :) For patches fixing this, please
include these lines (but double-check the "Fixes" first):
Reported-by: coverity-bot <keescook+coverity-bot(a)chromium.org>
Addresses-Coverity-ID: 1530569 ("Error handling issues")
Fixes: ebc4c1bcc2a5 ("maple_tree: fix mas_empty_area_rev() lower bound validation")
Thanks for your attention!
--
Coverity-bot
This is the start of the stable review cycle for the 5.15.88 release.
There are 10 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 14 Jan 2023 13:53:18 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.88-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.88-rc1
Paolo Abeni <pabeni(a)redhat.com>
net/ulp: prevent ULP without clone op from entering the LISTEN status
Frederick Lawler <fred(a)cloudflare.com>
net: sched: disallow noqueue for qdisc classes
Rasmus Villemoes <linux(a)rasmusvillemoes.dk>
serial: fixup backport of "serial: Deassert Transmit Enable on probe in driver-specific way"
Kyle Huey <me(a)kylehuey.com>
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Allow PKRU to be (once again) written by ptrace.
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
Helge Deller <deller(a)gmx.de>
parisc: Align parisc MADV_XXX constants with all other architectures
-------------
Diffstat:
Makefile | 4 +-
arch/parisc/include/uapi/asm/mman.h | 27 +++---
arch/parisc/kernel/sys_parisc.c | 27 ++++++
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/include/asm/fpu/xstate.h | 4 +-
arch/x86/kernel/fpu/regset.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 41 ++++++++-
drivers/tty/serial/fsl_lpuart.c | 2 +-
drivers/tty/serial/serial_core.c | 3 +-
net/ipv4/inet_connection_sock.c | 16 +++-
net/ipv4/tcp_ulp.c | 4 +
net/sched/sch_api.c | 5 +
tools/arch/parisc/include/uapi/asm/mman.h | 12 +--
tools/perf/bench/bench.h | 12 ---
tools/testing/selftests/vm/pkey-x86.h | 12 +++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
17 files changed, 257 insertions(+), 49 deletions(-)
From: Sean Christopherson <seanjc(a)google.com>
Since VMX and SVM both would never update the control bits if exits
are disable after vCPUs are created, only allow setting exits
disable flag before vCPU creation.
Fixes: 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT
intercepts")
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Kechen Lu <kechenl(a)nvidia.com>
Cc: stable(a)vger.kernel.org
---
Documentation/virt/kvm/api.rst | 1 +
arch/x86/kvm/x86.c | 6 ++++++
2 files changed, 7 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9807b05a1b57..fb0fcc566d5a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7087,6 +7087,7 @@ branch to guests' 0x200 interrupt vector.
:Architectures: x86
:Parameters: args[0] defines which exits are disabled
:Returns: 0 on success, -EINVAL when args[0] contains invalid exits
+ or if any vCPU has already been created
Valid bits in args[0] are::
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da4bbd043a7b..c8ae9c4f9f08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6227,6 +6227,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
if (cap->args[0] & ~KVM_X86_DISABLE_VALID_EXITS)
break;
+ mutex_lock(&kvm->lock);
+ if (kvm->created_vcpus)
+ goto disable_exits_unlock;
+
if ((cap->args[0] & KVM_X86_DISABLE_EXITS_MWAIT) &&
kvm_can_mwait_in_guest())
kvm->arch.mwait_in_guest = true;
@@ -6237,6 +6241,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
if (cap->args[0] & KVM_X86_DISABLE_EXITS_CSTATE)
kvm->arch.cstate_in_guest = true;
r = 0;
+disable_exits_unlock:
+ mutex_unlock(&kvm->lock);
break;
case KVM_CAP_MSR_PLATFORM_INFO:
kvm->arch.guest_can_read_msr_platform_info = cap->args[0];
--
2.34.1
After the qmp phy driver was split it looks like 5.15.y stable kernels
aren't getting fixes like commit 7a7d86d14d07 ("phy: qcom-qmp-combo: fix
broken power on") which is tagged for stable 5.10. Trogdor boards use
the qmp phy on 5.15.y kernels, so I backported the fixes I could find
that looked like we may possibly trip over at some point.
USB and DP work on my Trogdor.Lazor board with this set.
Johan Hovold (4):
phy: qcom-qmp-combo: disable runtime PM on unbind
phy: qcom-qmp-combo: fix memleak on probe deferral
phy: qcom-qmp-combo: fix broken power on
phy: qcom-qmp-combo: fix runtime suspend
drivers/phy/qualcomm/phy-qcom-qmp.c | 72 ++++++++++++++---------------
1 file changed, 36 insertions(+), 36 deletions(-)
Cc: Johan Hovold <johan+linaro(a)kernel.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
Cc: Vinod Koul <vkoul(a)kernel.org>
base-commit: d57287729e229188e7d07ef0117fe927664e08cb
--
https://chromeos.dev
Hi!
> Results from Linaro’s test farm.
> Regressions on arm64 Raspberry Pi 4 Model B.
>
> Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
>
> While running LTP controllers cgroup_fj_stress_blkio test cases
> the Insufficient stack space to handle exception! occurred and
> followed by kernel panic on arm64 Raspberry Pi 4 Model B with
> clang-15 built kernel Image.
>
> The full boot and test log attached to this email and build and
> Kconfig links provided in the bottom of this email.
Full log is 11MB. That's rather... big for an email. Please post such
stuff as a link or at least compress them...
Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
From: Clement Lecigne <clecigne(a)google.com>
[ Note: this is a fix that works around the bug equivalently as the
two upstream commits:
1fa4445f9adf ("ALSA: control - introduce snd_ctl_notify_one() helper")
56b88b50565c ("ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF")
but in a simpler way to fit with older stable trees -- tiwai ]
Add missing locking in ctl_elem_read_user/ctl_elem_write_user which can be
easily triggered and turned into an use-after-free.
Example code paths with SNDRV_CTL_IOCTL_ELEM_READ:
64-bits:
snd_ctl_ioctl
snd_ctl_elem_read_user
[takes controls_rwsem]
snd_ctl_elem_read [lock properly held, all good]
[drops controls_rwsem]
32-bits (compat):
snd_ctl_ioctl_compat
snd_ctl_elem_write_read_compat
ctl_elem_write_read
snd_ctl_elem_read [missing lock, not good]
CVE-2023-0266 was assigned for this issue.
Signed-off-by: Clement Lecigne <clecigne(a)google.com>
Cc: stable(a)kernel.org # 5.12 and older
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
---
Greg, this is a patch for the last ALSA PCM UCM fix for the older
stable trees. Please take this to 5.10.y and older stable trees.
Thanks!
sound/core/control_compat.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/sound/core/control_compat.c b/sound/core/control_compat.c
index 97467f6a32a1..980ab3580f1b 100644
--- a/sound/core/control_compat.c
+++ b/sound/core/control_compat.c
@@ -304,7 +304,9 @@ static int ctl_elem_read_user(struct snd_card *card,
err = snd_power_wait(card, SNDRV_CTL_POWER_D0);
if (err < 0)
goto error;
+ down_read(&card->controls_rwsem);
err = snd_ctl_elem_read(card, data);
+ up_read(&card->controls_rwsem);
if (err < 0)
goto error;
err = copy_ctl_value_to_user(userdata, valuep, data, type, count);
@@ -332,7 +334,9 @@ static int ctl_elem_write_user(struct snd_ctl_file *file,
err = snd_power_wait(card, SNDRV_CTL_POWER_D0);
if (err < 0)
goto error;
+ down_write(&card->controls_rwsem);
err = snd_ctl_elem_write(card, file, data);
+ up_write(&card->controls_rwsem);
if (err < 0)
goto error;
err = copy_ctl_value_to_user(userdata, valuep, data, type, count);
--
2.35.3
Eine Spende wurde an Sie getätigt, antworten Sie für weitere Einzelheiten.
Grüße
Theresia Steven
--
This email has been checked for viruses by Avast antivirus software.
www.avast.com
On Tue, Jan 03, 2023 at 11:58:48AM +0100, Ard Biesheuvel wrote:
> On Tue, 3 Jan 2023 at 03:13, Linus Torvalds
> <torvalds(a)linux-foundation.org> wrote:
> >
> > On Mon, Jan 2, 2023 at 5:45 PM Guenter Roeck <linux(a)roeck-us.net> wrote:
> > >
> > > ... and reverting commit 99cb0d917ff indeed fixes the problem.
> >
> > Hmm. My gut feel is that this just exposes some bug in binutils.
> >
> > That said, maybe that commit should not have added its own /DISCARDS/
> > thing, and instead just added that "*(.note.GNU-stack)" to the general
> > /DISCARDS/ thing that is defined by the
> >
> > #define DISCARDS ..
> >
> > a little bit later, so that we only end up with one single DISCARD
> > list. Something like this (broken patch on purpose):
> >
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -897,5 +897,4 @@
> > */
> > #define NOTES \
> > - /DISCARD/ : { *(.note.GNU-stack) } \
> > .notes : AT(ADDR(.notes) - LOAD_OFFSET) { \
> > BOUNDED_SECTION_BY(.note.*, _notes) \
> > @@ -1016,4 +1015,5 @@
> > #define DISCARDS \
> > /DISCARD/ : { \
> > + *(.note.GNU-stack) \
> > EXIT_DISCARDS \
> > EXIT_CALL \
> >
> > But maybe that DISCARDS macrop ends up being used too late?
> >
>
> Masahiro's v1 did something like this, and it caused an issue on
> RISC-V, which is why we ended up with this approach instead.
>
> > It really shouldn't matter, but here we are, with a build problem with
> > some random old binutils on an odd platform..
> >
>
> AIUI, the way ld.bfd used to combine output sections may also affect
> the /DISCARD/ pseudo-section, and so introducing it much earlier
> results in these discards to be interpreted in a different order.
>
> The purpose of this change is to prevent .note.GNU-stack from deciding
> the section type of the .notes output section, and so keeping it in
> its own section should be sufficient. E.g.,
>
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -896,7 +896,7 @@
> * Otherwise, the type of .notes section would become PROGBITS
> instead of NOTES.
> */
> #define NOTES \
> - /DISCARD/ : { *(.note.GNU-stack) } \
> + .note.GNU-stack : { *(.note.GNU-stack) } \
> .notes : AT(ADDR(.notes) - LOAD_OFFSET) { \
> BOUNDED_SECTION_BY(.note.*, _notes) \
> } NOTES_HEADERS \
>
> The .note.GNU-stack has zero size, so the result should be the same.
+Greg +Nick
This also fixes Build ID on arm64 for stable 5.15, 5.10, and 5.4
which has been broken since backport of:
0d362be5b142 ("Makefile: link with -z noexecstack --no-warn-rwx-segments")
Discussed here:
https://lore.kernel.org/stable/3df32572ec7016e783d37e185f88495831671f5d.167…https://lore.kernel.org/stable/cover.1670358255.git.tom.saeger@oracle.com/
Perhaps add:
Cc: <stable(a)vger.kernel.org> # 5.15, 5.10, 5.4
for stable 5.15, 5.10, 5.4
Tested-by: Tom Saeger <tom.saeger(a)oracle.com>