The decision whether to enable a wake irq during suspend can not be done
based on the runtime PM state directly as a driver may use wake irqs
without implementing runtime PM. Such drivers specifically leave the
state set to the default 'suspended' and the wake irq is thus never
enabled at suspend.
Add a new wake irq flag to track whether a dedicated wake irq has been
enabled at runtime suspend and therefore must not be enabled at system
suspend.
Note that pm_runtime_enabled() can not be used as runtime PM is always
disabled during late suspend.
Fixes: 69728051f5bf ("PM / wakeirq: Fix unbalanced IRQ enable for wakeirq")
Cc: stable(a)vger.kernel.org # 4.16
Cc: Tony Lindgren <tony(a)atomide.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/base/power/power.h | 1 +
drivers/base/power/wakeirq.c | 12 ++++++++----
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/base/power/power.h b/drivers/base/power/power.h
index 0eb7f02b3ad5..922ed457db19 100644
--- a/drivers/base/power/power.h
+++ b/drivers/base/power/power.h
@@ -29,6 +29,7 @@ extern u64 pm_runtime_active_time(struct device *dev);
#define WAKE_IRQ_DEDICATED_MASK (WAKE_IRQ_DEDICATED_ALLOCATED | \
WAKE_IRQ_DEDICATED_MANAGED | \
WAKE_IRQ_DEDICATED_REVERSE)
+#define WAKE_IRQ_DEDICATED_ENABLED BIT(3)
struct wake_irq {
struct device *dev;
diff --git a/drivers/base/power/wakeirq.c b/drivers/base/power/wakeirq.c
index d487a6bac630..afd094dec5ca 100644
--- a/drivers/base/power/wakeirq.c
+++ b/drivers/base/power/wakeirq.c
@@ -314,8 +314,10 @@ void dev_pm_enable_wake_irq_check(struct device *dev,
return;
enable:
- if (!can_change_status || !(wirq->status & WAKE_IRQ_DEDICATED_REVERSE))
+ if (!can_change_status || !(wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) {
enable_irq(wirq->irq);
+ wirq->status |= WAKE_IRQ_DEDICATED_ENABLED;
+ }
}
/**
@@ -336,8 +338,10 @@ void dev_pm_disable_wake_irq_check(struct device *dev, bool cond_disable)
if (cond_disable && (wirq->status & WAKE_IRQ_DEDICATED_REVERSE))
return;
- if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED)
+ if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED) {
+ wirq->status &= ~WAKE_IRQ_DEDICATED_ENABLED;
disable_irq_nosync(wirq->irq);
+ }
}
/**
@@ -376,7 +380,7 @@ void dev_pm_arm_wake_irq(struct wake_irq *wirq)
if (device_may_wakeup(wirq->dev)) {
if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED &&
- !pm_runtime_status_suspended(wirq->dev))
+ !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED))
enable_irq(wirq->irq);
enable_irq_wake(wirq->irq);
@@ -399,7 +403,7 @@ void dev_pm_disarm_wake_irq(struct wake_irq *wirq)
disable_irq_wake(wirq->irq);
if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED &&
- !pm_runtime_status_suspended(wirq->dev))
+ !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED))
disable_irq_nosync(wirq->irq);
}
}
--
2.41.0
The runtime PM state should not be changed by drivers that do not
implement runtime PM even if it happens to work around a bug in PM core.
With the wake irq arming now fixed, drop the bogus runtime PM state
update which left the device in active state (and could potentially
prevent a parent device from suspending).
Fixes: f3974413cf02 ("tty: serial: qcom_geni_serial: Wakeup IRQ cleanup")
Cc: stable(a)vger.kernel.org # 5.6
Cc: Matthias Kaehlcke <mka(a)chromium.org>
Cc: Stephen Boyd <swboyd(a)chromium.org>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 88ed5bbe25a8..b825b05e6137 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -1681,13 +1681,6 @@ static int qcom_geni_serial_probe(struct platform_device *pdev)
if (ret)
return ret;
- /*
- * Set pm_runtime status as ACTIVE so that wakeup_irq gets
- * enabled/disabled from dev_pm_arm_wake_irq during system
- * suspend/resume respectively.
- */
- pm_runtime_set_active(&pdev->dev);
-
if (port->wakeup_irq > 0) {
device_init_wakeup(&pdev->dev, true);
ret = dev_pm_set_dedicated_wake_irq(&pdev->dev,
--
2.41.0
From: Gavin Shan <gshan(a)redhat.com>
We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page sharing by KSM
kvm_mmu_notifier_invalidate_range_start
:
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range
kvm_set_spte_gfn (C)
:
kvm_mmu_notifier_invalidate_range_end
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
We tried a git-bisect and the first problematic commit is cd4c71835228 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35c9
("KVM: ARM: Memory virtualization setup").
Cc: stable(a)vger.kernel.org # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai(a)redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Signed-off-by: Gavin Shan <gshan(a)redhat.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Shaoqin Huang <shahuang(a)redhat.com>
Message-Id: <20230615054259.14911-1-gshan(a)redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..65f94f592ff8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
return __kvm_handle_hva_range(kvm, &range);
}
+
+static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ /*
+ * Skipping invalid memslots is correct if and only change_pte() is
+ * surrounded by invalidate_range_{start,end}(), which is currently
+ * guaranteed by the primary MMU. If that ever changes, KVM needs to
+ * unmap the memslot instead of skipping the memslot to ensure that KVM
+ * doesn't hold references to the old PFN.
+ */
+ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+
+ if (range->slot->flags & KVM_MEMSLOT_INVALID)
+ return false;
+
+ return kvm_set_spte_gfn(kvm, range);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
return;
- kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+ kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
}
void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
--
2.25.1
From: Gavin Shan <gshan(a)redhat.com>
We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page sharing by KSM
kvm_mmu_notifier_invalidate_range_start
:
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range
kvm_set_spte_gfn (C)
:
kvm_mmu_notifier_invalidate_range_end
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
We tried a git-bisect and the first problematic commit is cd4c71835228 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35c9
("KVM: ARM: Memory virtualization setup").
Cc: stable(a)vger.kernel.org # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai(a)redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Signed-off-by: Gavin Shan <gshan(a)redhat.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Shaoqin Huang <shahuang(a)redhat.com>
Message-Id: <20230615054259.14911-1-gshan(a)redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..65f94f592ff8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
return __kvm_handle_hva_range(kvm, &range);
}
+
+static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ /*
+ * Skipping invalid memslots is correct if and only change_pte() is
+ * surrounded by invalidate_range_{start,end}(), which is currently
+ * guaranteed by the primary MMU. If that ever changes, KVM needs to
+ * unmap the memslot instead of skipping the memslot to ensure that KVM
+ * doesn't hold references to the old PFN.
+ */
+ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+
+ if (range->slot->flags & KVM_MEMSLOT_INVALID)
+ return false;
+
+ return kvm_set_spte_gfn(kvm, range);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
return;
- kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+ kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
}
void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
--
2.25.1
This sysctl has the very unusal behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
were to set this sysctl to a more restrictive option in the host pidns
you would need to reboot your machine in order to reset it.
The justification given in [1] is that this is a security feature and
thus it should not be possible to disable. Aside from the fact that we
have plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):
% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>
#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);
int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));
char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%
Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host
filesystem (not to mention the many other things a CAP_SYS_ADMIN process
would be able to do that would be equivalent or worse), it seems strange
to cause a fair amount of headache to admins when there doesn't appear
to be an actual security benefit to blocking this.
It should be noted that with this change, programs that can do an
unprivileged unshare(CLONE_NEWUSER) would be able to create an
executable memfd even if their current pidns didn't allow it. However,
the same sample program above can also be used in this scenario, meaning
that even with this consideration, blocking CAP_SYS_ADMIN makes little
sense:
% unshare -rm ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
This simply further reinforces that locked-down environments need to
disallow CLONE_NEWUSER for unprivileged users (as is already the case in
most container environments).
[1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5T…
Cc: Dominique Martinet <asmadeus(a)codewreck.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: stable(a)vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
kernel/pid_sysctl.h | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h
index b26e027fc9cd..8a22bc29ebb4 100644
--- a/kernel/pid_sysctl.h
+++ b/kernel/pid_sysctl.h
@@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
if (ns != &init_pid_ns)
table_copy.data = &ns->memfd_noexec_scope;
- /*
- * set minimum to current value, the effect is only bigger
- * value is accepted.
- */
- if (*(int *)table_copy.data > *(int *)table_copy.extra1)
- table_copy.extra1 = table_copy.data;
-
return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
}
--
2.41.0
The quilt patch titled
Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
has been removed from the -mm tree. Its filename was
hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
Date: Tue, 11 Jul 2023 15:09:42 -0700
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmap pages associated with each hugetlb page. The hugetlb
page destructor must be changed before pages are freed to lower level
allocators. However, the destructor must be changed under the hugetlb
lock. This means there is potentially one lock cycle per page.
Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators
Link: https://lkml.kernel.org/r/20230711220942.43706-3-mike.kravetz@oracle.com
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Jiaqi Yan <jiaqiyan(a)google.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
--- a/mm/hugetlb.c~hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles
+++ a/mm/hugetlb.c
@@ -1855,11 +1855,44 @@ static void update_and_free_pages_bulk(s
{
struct page *page, *t_page;
struct folio *folio;
+ bool clear_dtor = false;
+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru) {
+ if (HPageVmemmapOptimized(page)) {
+ clear_dtor = true;
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock * to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
folio = page_folio(page);
update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
}
}
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch
The patch titled
Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
Date: Tue, 11 Jul 2023 15:09:42 -0700
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmap pages associated with each hugetlb page. The hugetlb
page destructor must be changed before pages are freed to lower level
allocators. However, the destructor must be changed under the hugetlb
lock. This means there is potentially one lock cycle per page.
Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators
Link: https://lkml.kernel.org/r/20230711220942.43706-3-mike.kravetz@oracle.com
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Jiaqi Yan <jiaqiyan(a)google.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
--- a/mm/hugetlb.c~hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles
+++ a/mm/hugetlb.c
@@ -1855,11 +1855,44 @@ static void update_and_free_pages_bulk(s
{
struct page *page, *t_page;
struct folio *folio;
+ bool clear_dtor = false;
+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru) {
+ if (HPageVmemmapOptimized(page)) {
+ clear_dtor = true;
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock * to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
folio = page_folio(page);
update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
}
}
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch
hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch
The patch titled
Subject: hugetlb: do not clear hugetlb dtor until allocating vmemmap
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlb: do not clear hugetlb dtor until allocating vmemmap
Date: Tue, 11 Jul 2023 15:09:41 -0700
Patch series "Fix hugetlb free path race with memory errors".
In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.
A much more detailed description is in the first patch.
The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation. This could lead to slowdowns if one is freeing a large
number of hugetlb pages.
The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.
The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression. It can be combined with
the first, but was done separately make reviewing easier.
This patch (of 2):
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Jiaqi Yan <jiaqiyan(a)google.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++----------------
1 file changed, 51 insertions(+), 24 deletions(-)
--- a/mm/hugetlb.c~hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap
+++ a/mm/hugetlb.c
@@ -1579,9 +1579,37 @@ static inline void destroy_compound_giga
unsigned int order) { }
#endif
+static inline void __clear_hugetlb_destructor(struct hstate *h,
+ struct folio *folio)
+{
+ lockdep_assert_held(&hugetlb_lock);
+
+ /*
+ * Very subtle
+ *
+ * For non-gigantic pages set the destructor to the normal compound
+ * page dtor. This is needed in case someone takes an additional
+ * temporary ref to the page, and freeing is delayed until they drop
+ * their reference.
+ *
+ * For gigantic pages set the destructor to the null dtor. This
+ * destructor will never be called. Before freeing the gigantic
+ * page destroy_compound_gigantic_folio will turn the folio into a
+ * simple group of pages. After this the destructor does not
+ * apply.
+ *
+ */
+ if (hstate_is_gigantic(h))
+ folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+ else
+ folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
/*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page. Otherwise, wait until after allocating vmemmap
+ * to update dtor.
*
* A reference is held on the folio, except in the case of demote.
*
@@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struc
}
/*
- * Very subtle
- *
- * For non-gigantic pages set the destructor to the normal compound
- * page dtor. This is needed in case someone takes an additional
- * temporary ref to the page, and freeing is delayed until they drop
- * their reference.
- *
- * For gigantic pages set the destructor to the null dtor. This
- * destructor will never be called. Before freeing the gigantic
- * page destroy_compound_gigantic_folio will turn the folio into a
- * simple group of pages. After this the destructor does not
- * apply.
- *
- * This handles the case where more than one ref is held when and
- * after update_and_free_hugetlb_folio is called.
- *
- * In the case of demote we do not ref count the page as it will soon
- * be turned into a page of smaller size.
+ * We can only clear the hugetlb destructor after allocating vmemmap
+ * pages. Otherwise, someone (memory error handling) may try to write
+ * to tail struct pages.
+ */
+ if (!folio_test_hugetlb_vmemmap_optimized(folio))
+ __clear_hugetlb_destructor(h, folio);
+
+ /*
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
if (!demote)
folio_ref_unfreeze(folio, 1);
- if (hstate_is_gigantic(h))
- folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
- else
- folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
h->nr_huge_pages--;
h->nr_huge_pages_node[nid]--;
@@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_fo
{
int i;
struct page *subpage;
+ bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
@@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_fo
if (unlikely(folio_test_hwpoison(folio)))
folio_clear_hugetlb_hwpoison(folio);
+ /*
+ * If vmemmap pages were allocated above, then we need to clear the
+ * hugetlb destructor under the hugetlb lock.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
for (i = 0; i < pages_per_huge_page(h); i++) {
subpage = folio_page(folio, i);
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch
hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch
This is a RFC and a bit rough for now. I only set down to create the
first of the three patches. But while doing so I noticed a few things
that seemed odd for me with my background on writing and editing texts.
So I just quickly performed a few additional changes to fix those to see
if the stable team would appreciate them, as this document is clearly
their domain.
If those changes or even the initial patch are not welcomed, I'll simply
drop them. I'd totally understand this, as texts like these are delicate
and it's easy to accidentlly change the intent or the meaning while
adjusting things in good faith.
At the same time I might be willing to do a few more changes, if people
like the direction this takes and want a bit more fine tuning.
CC: Greg KH <gregkh(a)linuxfoundation.org>
CC: Sasha Levin <sashal(a)kernel.org>
CC: Jonathan Corbet <corbet(a)lwn.net>
Thorsten Leemhuis (3):
docs: stable-kernel-rules: mention other usages for stable tag
comments
docs: stable-kernel-rules: make rule section more straight forward
docs: stable-kernel-rules: improve structure to optimize reading flow
Documentation/process/stable-kernel-rules.rst | 189 ++++++++++--------
1 file changed, 105 insertions(+), 84 deletions(-)
base-commit: 016571b6d52deb473676fb4d24baf8ed3667ae21
--
2.40.1
Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
chmod operations on /proc/thread-self/comm were no longer blocked as
they are on almost all other procfs files.
A very similar situation with /proc/self/environ was used to as a root
exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a
correctness issue.
Ref: https://lwn.net/Articles/191954/
Ref: 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files")
Fixes: 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE")
Cc: stable(a)vger.kernel.org # v4.7+
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
v2: removed nolibc selftests as per review
v1: <https://lore.kernel.org/linux-fsdevel/20230713121907.9693-1-cyphar@cyphar.c…>
---
fs/proc/base.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 05452c3b9872..7394229816f3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3583,7 +3583,8 @@ static int proc_tid_comm_permission(struct mnt_idmap *idmap,
}
static const struct inode_operations proc_tid_comm_inode_operations = {
- .permission = proc_tid_comm_permission,
+ .setattr = proc_setattr,
+ .permission = proc_tid_comm_permission,
};
/*
--
2.41.0
From: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
We were reading the length of the scatterlist sg after copying value of
tsg inside.
So we are using the size of the previous scatterlist and for the first
one we are using an unitialised value.
Fix this by copying tsg in sg[0] before reading the size.
Fixes : 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
Cc: stable(a)vger.kernel.org
Signed-off-by: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
---
Changes in v2:
- Add Fixes 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
- Add Cc: stable(a)vger.kernel.org
drivers/crypto/stm32/stm32-hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/crypto/stm32/stm32-hash.c b/drivers/crypto/stm32/stm32-hash.c
index c179a6c1a457..519fb716acee 100644
--- a/drivers/crypto/stm32/stm32-hash.c
+++ b/drivers/crypto/stm32/stm32-hash.c
@@ -678,9 +678,9 @@ static int stm32_hash_dma_send(struct stm32_hash_dev *hdev)
}
for_each_sg(rctx->sg, tsg, rctx->nents, i) {
+ sg[0] = *tsg;
len = sg->length;
- sg[0] = *tsg;
if (sg_is_last(sg)) {
if (hdev->dma_mode == 1) {
len = (ALIGN(sg->length, 16) - 16);
--
2.25.1
From: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
We were reading the length of the scatterlist sg after copying value of
tsg inside.
So we are using the size of the previous scatterlist and for the first
one we are using an unitialised value.
Fix this by copying tsg in sg[0] before reading the size.
Fixes : 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
Cc: stable(a)vger.kernel.org
Signed-off-by: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
---
Changes in v2:
- Add Fixes 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
- Add Cc: stable(a)vger.kernel.org
drivers/crypto/stm32/stm32-hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/crypto/stm32/stm32-hash.c b/drivers/crypto/stm32/stm32-hash.c
index c179a6c1a457..519fb716acee 100644
--- a/drivers/crypto/stm32/stm32-hash.c
+++ b/drivers/crypto/stm32/stm32-hash.c
@@ -678,9 +678,9 @@ static int stm32_hash_dma_send(struct stm32_hash_dev *hdev)
}
for_each_sg(rctx->sg, tsg, rctx->nents, i) {
+ sg[0] = *tsg;
len = sg->length;
- sg[0] = *tsg;
if (sg_is_last(sg)) {
if (hdev->dma_mode == 1) {
len = (ALIGN(sg->length, 16) - 16);
--
2.25.1
From: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
We were reading the length of the scatterlist sg after copying value of
tsg inside.
So we are using the size of the previous scatterlist and for the first
one we are using an unitialised value.
Fix this by copying tsg in sg[0] before reading the size.
Fixes : 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
Cc: stable(a)vger.kernel.org
Signed-off-by: Thomas Bourgoin <thomas.bourgoin(a)foss.st.com>
---
Changes since V1:
- Add Fixes 8a1012d3f2ab ("crypto: stm32 - Support for STM32 HASH module")
- Add Cc: stable(a)vger.kernel.org
---
drivers/crypto/stm32/stm32-hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/crypto/stm32/stm32-hash.c b/drivers/crypto/stm32/stm32-hash.c
index c179a6c1a457..519fb716acee 100644
--- a/drivers/crypto/stm32/stm32-hash.c
+++ b/drivers/crypto/stm32/stm32-hash.c
@@ -678,9 +678,9 @@ static int stm32_hash_dma_send(struct stm32_hash_dev *hdev)
}
for_each_sg(rctx->sg, tsg, rctx->nents, i) {
+ sg[0] = *tsg;
len = sg->length;
- sg[0] = *tsg;
if (sg_is_last(sg)) {
if (hdev->dma_mode == 1) {
len = (ALIGN(sg->length, 16) - 16);
--
2.25.1
The previous implementation of vm.memfd_noexec=2 did not actually
enforce the usage of MFD_NOEXEC_SEAL, as any program that set MFD_EXEC
would not be affected by the "enforcement" mechanism. This was fixed in
in Andrew's tree recently, but there were still some things that could
be cleaned up.
On the topic of older programs, it seems far less disruptive to have
vm.memfd_noexec=2 have the same behaviour as vm.memfd_noexec=1 (default
to MFD_NOEXEC_SEAL if unspecified) to avoid breaking older programs that
didn't actually care about the exec bits -- which includes the vast
majority of programs that use memfd_create(2), thus allowing users to be
able to enable this sysctl without all older programs needlessly
breaking. Otherwise vm.memfd_noexec=2 would be unusable on most
general-purpose systems as it would require an audit of all of
userspace.
While we're at it, fix the warnings emitted by memfd_create() to use
pr_warn_ratelimited(). If the intention of the warning is to get
developers to switch to explicitly specifying if they want exec bits or
not, you need to warn them whenever they use it. The systemd version on
my box doesn't pass MFD_EXEC, making the warning useless for most
userspace developers because it was already emitted during boot. Commit
105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") mentions
that this was switched to pr_warn_once "as per review" but I couldn't
find the discussion anywhere, and the obvious issue (the ability for
unprivileged userspace to spam the kernel log) should be handled by
pr_warn_ratelimited. If the issue is that this is too spammy, we could
tie it to using vm.memfd_noexec=1 or higher.
This is a user-visible API change, but as it allows programs to do
something that would be blocked before, and the sysctl itself was broken
and recently released, it seems unlikely this will cause any issues.
Cc: Dominique Martinet <asmadeus(a)codewreck.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: stable(a)vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
include/linux/pid_namespace.h | 16 +++--------
mm/memfd.c | 32 ++++++++--------------
tools/testing/selftests/memfd/memfd_test.c | 22 +++++++++++----
3 files changed, 33 insertions(+), 37 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c758809d5bcf..53974d79d98e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -17,18 +17,10 @@
struct fs_pin;
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-/*
- * sysctl for vm.memfd_noexec
- * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- * acts like MFD_EXEC was set.
- * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- * acts like MFD_NOEXEC_SEAL was set.
- * 2: memfd_create() without MFD_NOEXEC_SEAL will be
- * rejected.
- */
-#define MEMFD_NOEXEC_SCOPE_EXEC 0
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2
+/* modes for vm.memfd_noexec sysctl */
+#define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */
#endif
struct pid_namespace {
diff --git a/mm/memfd.c b/mm/memfd.c
index 0bdbd2335af7..4f1f841ae39d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -271,30 +271,22 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
static int check_sysctl_memfd_noexec(unsigned int *flags)
{
#ifdef CONFIG_SYSCTL
- char comm[TASK_COMM_LEN];
- int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
- struct pid_namespace *ns;
-
- ns = task_active_pid_ns(current);
- if (ns)
- sysctl = ns->memfd_noexec_scope;
+ int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
- if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
+ if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
*flags |= MFD_NOEXEC_SEAL;
else
*flags |= MFD_EXEC;
}
- if (*flags & MFD_EXEC && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
- pr_warn_once(
- "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n",
- task_pid_nr(current), get_task_comm(comm, current));
-
+ if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
+ pr_warn_ratelimited(
+ "%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n",
+ current->comm, task_pid_nr(current), sysctl);
return -EACCES;
}
#endif
-
return 0;
}
@@ -302,7 +294,6 @@ SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
unsigned int, flags)
{
- char comm[TASK_COMM_LEN];
unsigned int *file_seals;
struct file *file;
int fd, error;
@@ -324,13 +315,14 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
- pr_warn_once(
- "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n",
- task_pid_nr(current), get_task_comm(comm, current));
+ pr_warn_ratelimited(
+ "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
+ current->comm, task_pid_nr(current));
}
- if (check_sysctl_memfd_noexec(&flags) < 0)
- return -EACCES;
+ error = check_sysctl_memfd_noexec(&flags);
+ if (error < 0)
+ return error;
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index dbdd9ec5e397..d8342989c547 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1145,11 +1145,23 @@ static void test_sysctl_child(void)
printf("%s sysctl 2\n", memfd_str);
sysctl_assert_write("2");
- mfd_fail_new("kern_memfd_sysctl_2",
- MFD_CLOEXEC | MFD_ALLOW_SEALING);
- mfd_fail_new("kern_memfd_sysctl_2_MFD_EXEC",
- MFD_CLOEXEC | MFD_EXEC);
- fd = mfd_assert_new("", 0, MFD_NOEXEC_SEAL);
+ mfd_fail_new("kern_memfd_sysctl_2_exec",
+ MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ fd = mfd_assert_new("kern_memfd_sysctl_2_dfl",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ close(fd);
+
+ fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal",
+ mfd_def_size,
+ MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
close(fd);
sysctl_fail_write("0");
--
2.41.0
Don't assume that the device is fully under the control of PCI core.
Use RMW capability accessors in link retraining which do proper locking
to avoid losing concurrent updates to the register values.
Fixes: 4ec73791a64b ("PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratum")
Fixes: 7d715a6c1ae5 ("PCI: add PCI Express ASPM support")
Suggested-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
pci/enumeration branch moves the link retraining code into PCI core and
also conflicts with a link retraining fix in pci/aspm. The changelog
(and patch splitting) takes the move into account by not referring to
ASPM while the change itself is not based on pci/enumeration (as per
Bjorn's preference).
---
drivers/pci/pci.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 60230da957e0..f7315b13bb82 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4927,7 +4927,6 @@ static int pcie_wait_for_link_status(struct pci_dev *pdev,
int pcie_retrain_link(struct pci_dev *pdev, bool use_lt)
{
int rc;
- u16 lnkctl;
/*
* Ensure the updated LNKCTL parameters are used during link
@@ -4939,17 +4938,14 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt)
if (rc)
return rc;
- pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl);
- lnkctl |= PCI_EXP_LNKCTL_RL;
- pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
+ pcie_capability_set_word(pdev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_RL);
if (pdev->clear_retrain_link) {
/*
* Due to an erratum in some devices the Retrain Link bit
* needs to be cleared again manually to allow the link
* training to succeed.
*/
- lnkctl &= ~PCI_EXP_LNKCTL_RL;
- pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
+ pcie_capability_clear_word(pdev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_RL);
}
return pcie_wait_for_link_status(pdev, use_lt, !use_lt);
--
2.30.2
Many places in the kernel write the Link Control and Root Control PCI
Express Capability Registers without proper concurrency control and
this could result in losing the changes one of the writers intended to
make.
Add pcie_cap_lock spinlock into the struct pci_dev and use it to
protect bit changes made in the RMW capability accessors. Protect only
a selected set of registers by differentiating the RMW accessor
internally to locked/unlocked variants using a wrapper which has the
same signature as pcie_capability_clear_and_set_word(). As the
Capability Register (pos) given to the wrapper is always a constant,
the compiler should be able to simplify all the dead-code away.
So far only the Link Control Register (ASPM, hotplug, link retraining,
various drivers) and the Root Control Register (AER & PME) seem to
require RMW locking.
Fixes: c7f486567c1d ("PCI PM: PCIe PME root port service driver")
Fixes: f12eb72a268b ("PCI/ASPM: Use PCI Express Capability accessors")
Fixes: 7d715a6c1ae5 ("PCI: add PCI Express ASPM support")
Fixes: affa48de8417 ("staging/rdma/hfi1: Add support for enabling/disabling PCIe ASPM")
Fixes: 849a9366cba9 ("misc: rtsx: Add support new chip rts5228 mmc: rtsx: Add support MMC_CAP2_NO_MMC")
Fixes: 3d1e7aa80d1c ("misc: rtsx: Use pcie_capability_clear_and_set_word() for PCI_EXP_LNKCTL")
Fixes: c0e5f4e73a71 ("misc: rtsx: Add support for RTS5261")
Fixes: 3df4fce739e2 ("misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG")
Fixes: 121e9c6b5c4c ("misc: rtsx: modify and fix init_hw function")
Fixes: 19f3bd548f27 ("mfd: rtsx: Remove LCTLR defination")
Fixes: 773ccdfd9cc6 ("mfd: rtsx: Read vendor setting from config space")
Fixes: 8275b77a1513 ("mfd: rts5249: Add support for RTS5250S power saving")
Fixes: 5da4e04ae480 ("misc: rtsx: Add support for RTS5260")
Fixes: 0f49bfbd0f2e ("tg3: Use PCI Express Capability accessors")
Fixes: 5e7dfd0fb94a ("tg3: Prevent corruption at 10 / 100Mbps w CLKREQ")
Fixes: b726e493e8dc ("r8169: sync existing 8168 device hardware start sequences with vendor driver")
Fixes: e6de30d63eb1 ("r8169: more 8168dp support.")
Fixes: 8a06127602de ("Bluetooth: hci_bcm4377: Add new driver for BCM4377 PCIe boards")
Fixes: 6f461f6c7c96 ("e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata")
Fixes: 1eae4eb2a1c7 ("e1000e: Disable L1 ASPM power savings for 82573 mobile variants")
Fixes: 8060e169e02f ("ath9k: Enable extended synch for AR9485 to fix L0s recovery issue")
Fixes: 69ce674bfa69 ("ath9k: do btcoex ASPM disabling at initialization time")
Fixes: f37f05503575 ("mt76: mt76x2e: disable pcie_aspm by default")
Suggested-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
drivers/pci/access.c | 20 +++++++++++++++++---
drivers/pci/probe.c | 1 +
include/linux/pci.h | 34 ++++++++++++++++++++++++++++++++--
3 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 3c230ca3de58..0b2e90d2f04f 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -497,8 +497,8 @@ int pcie_capability_write_dword(struct pci_dev *dev, int pos, u32 val)
}
EXPORT_SYMBOL(pcie_capability_write_dword);
-int pcie_capability_clear_and_set_word(struct pci_dev *dev, int pos,
- u16 clear, u16 set)
+int pcie_capability_clear_and_set_word_unlocked(struct pci_dev *dev, int pos,
+ u16 clear, u16 set)
{
int ret;
u16 val;
@@ -512,7 +512,21 @@ int pcie_capability_clear_and_set_word(struct pci_dev *dev, int pos,
return ret;
}
-EXPORT_SYMBOL(pcie_capability_clear_and_set_word);
+EXPORT_SYMBOL(pcie_capability_clear_and_set_word_unlocked);
+
+int pcie_capability_clear_and_set_word_locked(struct pci_dev *dev, int pos,
+ u16 clear, u16 set)
+{
+ unsigned long flags;
+ int ret;
+
+ spin_lock_irqsave(&dev->pcie_cap_lock, flags);
+ ret = pcie_capability_clear_and_set_word_unlocked(dev, pos, clear, set);
+ spin_unlock_irqrestore(&dev->pcie_cap_lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(pcie_capability_clear_and_set_word_locked);
int pcie_capability_clear_and_set_dword(struct pci_dev *dev, int pos,
u32 clear, u32 set)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8bac3ce02609..f1587fb0ba71 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2324,6 +2324,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
.end = -1,
};
+ spin_lock_init(&dev->pcie_cap_lock);
#ifdef CONFIG_PCI_MSI
raw_spin_lock_init(&dev->msi_lock);
#endif
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c69a2cc1f412..7ee498cd1f37 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -467,6 +467,7 @@ struct pci_dev {
pci_dev_flags_t dev_flags;
atomic_t enable_cnt; /* pci_enable_device has been called */
+ spinlock_t pcie_cap_lock; /* Protects RMW ops in capability accessors */
u32 saved_config_space[16]; /* Config space saved at suspend time */
struct hlist_head saved_cap_space;
int rom_attr_enabled; /* Display of ROM attribute enabled? */
@@ -1217,11 +1218,40 @@ int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *val);
int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *val);
int pcie_capability_write_word(struct pci_dev *dev, int pos, u16 val);
int pcie_capability_write_dword(struct pci_dev *dev, int pos, u32 val);
-int pcie_capability_clear_and_set_word(struct pci_dev *dev, int pos,
- u16 clear, u16 set);
+int pcie_capability_clear_and_set_word_unlocked(struct pci_dev *dev, int pos,
+ u16 clear, u16 set);
+int pcie_capability_clear_and_set_word_locked(struct pci_dev *dev, int pos,
+ u16 clear, u16 set);
int pcie_capability_clear_and_set_dword(struct pci_dev *dev, int pos,
u32 clear, u32 set);
+/**
+ * pcie_capability_clear_and_set_word - RMW accessor for PCI Express Capability Registers
+ * @dev: PCI device structure of the PCI Express device
+ * @pos: PCI Express Capability Register
+ * @clear: Clear bitmask
+ * @set: Set bitmask
+ *
+ * Perform a Read-Modify-Write (RMW) operation using @clear and @set
+ * bitmasks on PCI Express Capability Register at @pos. Certain PCI Express
+ * Capability Registers are accessed concurrently in RMW fashion, hence
+ * require locking which is handled transparently to the caller.
+ */
+static inline int pcie_capability_clear_and_set_word(struct pci_dev *dev,
+ int pos,
+ u16 clear, u16 set)
+{
+ switch (pos) {
+ case PCI_EXP_LNKCTL:
+ case PCI_EXP_RTCTL:
+ return pcie_capability_clear_and_set_word_locked(dev, pos,
+ clear, set);
+ default:
+ return pcie_capability_clear_and_set_word_unlocked(dev, pos,
+ clear, set);
+ }
+}
+
static inline int pcie_capability_set_word(struct pci_dev *dev, int pos,
u16 set)
{
--
2.30.2
Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.
However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:
(1) Filesystems that don't implement a i_op->setattr() for symlinks.
Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.
That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.
Example: afs
(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.
Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.
The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);
Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.
When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.
Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:
* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl
Example: btrfs, ext4, xfs
The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:
fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);
But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.
So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.
If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.
But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.
We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.
Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fc… [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Cc: stable(a)vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Suggested-by: Florian Weimer <fweimer(a)redhat.com>
Reviewed-by: Aleksa Sarai <cyphar(a)cyphar.com>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
---
Changes in v2:
- declarations first...
- Link to v1: https://lore.kernel.org/r/20230712-vfs-chmod-symlinks-v1-1-27921df6011f@ker…
---
---
fs/attr.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index d60dc1edb526..775bf77ce16f 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -394,9 +394,11 @@ int notify_change(struct mnt_idmap *idmap, struct dentry *dentry,
return error;
if ((ia_valid & ATTR_MODE)) {
- umode_t amode = attr->ia_mode;
+ if (S_ISLNK(inode->i_mode))
+ return -EOPNOTSUPP;
+
/* Flag setting protected by i_mutex */
- if (is_sxid(amode))
+ if (is_sxid(attr->ia_mode))
inode->i_flags &= ~S_NOSEC;
}
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230712-vfs-chmod-symlinks-a3ad1923a992
Conversion from big-endian to native is done in a common function
mmc_app_send_scr(). Converting in moxart_transfer_pio() is extra.
Double conversion on a LE system returns an incorrect SCR value,
leads to errors:
mmc0: unrecognised SCR structure version 8
Fixes: 1b66e94e6b99 ("mmc: moxart: Add MOXA ART SD/MMC driver")
Signed-off-by: Sergei Antonov <saproj(a)gmail.com>
Cc: Jonas Jensen <jonas.jensen(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
drivers/mmc/host/moxart-mmc.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/drivers/mmc/host/moxart-mmc.c b/drivers/mmc/host/moxart-mmc.c
index 2d002c81dcf3..d0d6ffcf78d4 100644
--- a/drivers/mmc/host/moxart-mmc.c
+++ b/drivers/mmc/host/moxart-mmc.c
@@ -338,13 +338,7 @@ static void moxart_transfer_pio(struct moxart_host *host)
return;
}
for (len = 0; len < remain && len < host->fifo_width;) {
- /* SCR data must be read in big endian. */
- if (data->mrq->cmd->opcode == SD_APP_SEND_SCR)
- *sgp = ioread32be(host->base +
- REG_DATA_WINDOW);
- else
- *sgp = ioread32(host->base +
- REG_DATA_WINDOW);
+ *sgp = ioread32(host->base + REG_DATA_WINDOW);
sgp++;
len += 4;
}
--
2.37.2
Hi Greg, Sasha,
[ this is v2, using "Upstream commit tag XYZ" as suggested by Salvatore Bonaccorso,
only this comestic change in this batch. ]
The following list shows the backported patches, I am using original
commit IDs for reference:
1) 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase")
2) 3e70489721b6 ("netfilter: nf_tables: unbind non-anonymous set if rule construction fails")
Please, apply.
Thanks.
Pablo Neira Ayuso (2):
netfilter: nf_tables: drop map element references from preparation phase
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
include/net/netfilter/nf_tables.h | 5 +-
net/netfilter/nf_tables_api.c | 147 ++++++++++++++++++++++++++----
net/netfilter/nft_set_bitmap.c | 5 +-
net/netfilter/nft_set_hash.c | 23 ++++-
net/netfilter/nft_set_pipapo.c | 14 ++-
net/netfilter/nft_set_rbtree.c | 5 +-
6 files changed, 168 insertions(+), 31 deletions(-)
--
2.30.2
If system is busy during the command status polling function, the driver
may not get the chance to poll the status register till the end of time
out and return the premature status. Do a final check after time out
happens to ensure reading the correct status.
Fixes: 9d2ee0a60b8b ("mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program")
Signed-off-by: William Zhang <william.zhang(a)broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli(a)broadcom.com>
Cc: stable(a)vger.kernel.org
---
Changes in v4:
- Update comment in the polling status function
- Add cc stable tag
Changes in v3: None
Changes in v2: None
drivers/mtd/nand/raw/brcmnand/brcmnand.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/mtd/nand/raw/brcmnand/brcmnand.c b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
index 9ea96911d16b..9a373a10304d 100644
--- a/drivers/mtd/nand/raw/brcmnand/brcmnand.c
+++ b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
@@ -1080,6 +1080,14 @@ static int bcmnand_ctrl_poll_status(struct brcmnand_controller *ctrl,
cpu_relax();
} while (time_after(limit, jiffies));
+ /*
+ * do a final check after time out in case the CPU was busy and the driver
+ * did not get enough time to perform the polling to avoid false alarms
+ */
+ val = brcmnand_read_reg(ctrl, BRCMNAND_INTFC_STATUS);
+ if ((val & mask) == expected_val)
+ return 0;
+
dev_warn(ctrl->dev, "timeout on status poll (expected %x got %x)\n",
expected_val, val & mask);
--
2.37.3
When executing a NAND command within the panic write path, wait for any
pending command instead of calling BUG_ON to avoid crashing while
already crashing.
Fixes: 27c5b17cd1b1 ("mtd: nand: add NAND driver "library" for Broadcom STB NAND controller")
Signed-off-by: William Zhang <william.zhang(a)broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli(a)broadcom.com>
Reviewed-by: Kursad Oney <kursad.oney(a)broadcom.com>
Reviewed-by: Kamal Dasu <kamal.dasu(a)broadcom.com>
Cc: stable(a)vger.kernel.org
---
Changes in v4:
- Update commit message
- Add cc stable tag
Changes in v3: None
Changes in v2: None
drivers/mtd/nand/raw/brcmnand/brcmnand.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/mtd/nand/raw/brcmnand/brcmnand.c b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
index 9a373a10304d..b2c6396060db 100644
--- a/drivers/mtd/nand/raw/brcmnand/brcmnand.c
+++ b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
@@ -1608,7 +1608,17 @@ static void brcmnand_send_cmd(struct brcmnand_host *host, int cmd)
dev_dbg(ctrl->dev, "send native cmd %d addr 0x%llx\n", cmd, cmd_addr);
- BUG_ON(ctrl->cmd_pending != 0);
+ /*
+ * If we came here through _panic_write and there is a pending
+ * command, try to wait for it. If it times out, rather than
+ * hitting BUG_ON, just return so we don't crash while crashing.
+ */
+ if (oops_in_progress) {
+ if (ctrl->cmd_pending &&
+ bcmnand_ctrl_poll_status(ctrl, NAND_CTRL_RDY, NAND_CTRL_RDY, 0))
+ return;
+ } else
+ BUG_ON(ctrl->cmd_pending != 0);
ctrl->cmd_pending = cmd;
ret = bcmnand_ctrl_poll_status(ctrl, NAND_CTRL_RDY, NAND_CTRL_RDY, 0);
--
2.37.3
When the oob buffer length is not in multiple of words, the oob write
function does out-of-bounds read on the oob source buffer at the last
iteration. Fix that by always checking length limit on the oob buffer
read and fill with 0xff when reaching the end of the buffer to the oob
registers.
Fixes: 27c5b17cd1b1 ("mtd: nand: add NAND driver "library" for Broadcom STB NAND controller")
Signed-off-by: William Zhang <william.zhang(a)broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli(a)broadcom.com>
Cc: stable(a)vger.kernel.org
---
Changes in v4:
- Add comments in the write_oob_to_regs function
- Add cc stable tag
Changes in v3:
- Fix kernel test robot sparse warning:
drivers/mtd/nand/raw/brcmnand/brcmnand.c:1500:54: sparse: expected unsigned int [usertype] data
drivers/mtd/nand/raw/brcmnand/brcmnand.c:1500:54: sparse: got restricted __be32 [usertype]
Changes in v2:
- Handle the remaining unaligned oob data after the oob data write loop
drivers/mtd/nand/raw/brcmnand/brcmnand.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/mtd/nand/raw/brcmnand/brcmnand.c b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
index b2c6396060db..71d0ba652bee 100644
--- a/drivers/mtd/nand/raw/brcmnand/brcmnand.c
+++ b/drivers/mtd/nand/raw/brcmnand/brcmnand.c
@@ -1477,19 +1477,33 @@ static int write_oob_to_regs(struct brcmnand_controller *ctrl, int i,
const u8 *oob, int sas, int sector_1k)
{
int tbytes = sas << sector_1k;
- int j;
+ int j, k = 0;
+ u32 last = 0xffffffff;
+ u8 *plast = (u8 *)&last;
/* Adjust OOB values for 1K sector size */
if (sector_1k && (i & 0x01))
tbytes = max(0, tbytes - (int)ctrl->max_oob);
tbytes = min_t(int, tbytes, ctrl->max_oob);
- for (j = 0; j < tbytes; j += 4)
+ /*
+ * tbytes may not be multiple of words. Make sure we don't read out of
+ * the boundary and stop at last word.
+ */
+ for (j = 0; (j + 3) < tbytes; j += 4)
oob_reg_write(ctrl, j,
(oob[j + 0] << 24) |
(oob[j + 1] << 16) |
(oob[j + 2] << 8) |
(oob[j + 3] << 0));
+
+ /* handle the remaing bytes */
+ while (j < tbytes)
+ plast[k++] = oob[j++];
+
+ if (tbytes & 0x3)
+ oob_reg_write(ctrl, (tbytes & ~0x3), (__force u32)cpu_to_be32(last));
+
return tbytes;
}
--
2.37.3
Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
running a preemptible kernel, as it is possible that a vCPU is blocked
without requesting a doorbell interrupt.
The issue is that any preemption that occurs between vgic_v4_put() and
schedule() on the block path will mark the vPE as nonresident and *not*
request a doorbell irq.
Fix it by consistently requesting a doorbell irq in the vcpu put path if
the vCPU is blocking. While this technically means we could drop the
early doorbell irq request in kvm_vcpu_wfi(), deliberately leave it
intact such that vCPU halt polling can properly detect the wakeup
condition before actually scheduling out a vCPU.
Cc: stable(a)vger.kernel.org
Fixes: 8e01d9a396e6 ("KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put")
Reported-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Tested-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
arch/arm64/kvm/vgic/vgic-v3.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index c3b8e132d599..8c467e9f4f11 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -749,7 +749,7 @@ void vgic_v3_put(struct kvm_vcpu *vcpu)
{
struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;
- WARN_ON(vgic_v4_put(vcpu, false));
+ WARN_ON(vgic_v4_put(vcpu, kvm_vcpu_is_blocking(vcpu)));
vgic_v3_vmcr_sync(vcpu);
--
2.41.0.255.g8b1d071c50-goog
This is the start of the stable review cycle for the 6.4.3 release.
There are 6 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Tue, 11 Jul 2023 20:38:10 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.3-rc2.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.4.3-rc2
Suren Baghdasaryan <surenb(a)google.com>
fork: lock VMAs of the parent process when forking
Liu Shixin <liushixin2(a)huawei.com>
bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
Peter Collingbourne <pcc(a)google.com>
mm: call arch_swap_restore() from do_swap_page()
Hugh Dickins <hughd(a)google.com>
mm: lock newly mapped VMA with corrected ordering
Suren Baghdasaryan <surenb(a)google.com>
mm: lock newly mapped VMA which can be modified after it becomes visible
Suren Baghdasaryan <surenb(a)google.com>
mm: lock a vma before stack expansion
-------------
Diffstat:
Makefile | 4 ++--
include/linux/bootmem_info.h | 2 ++
kernel/fork.c | 1 +
mm/memory.c | 7 +++++++
mm/mmap.c | 6 ++++++
5 files changed, 18 insertions(+), 2 deletions(-)
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 51 insertions(+), 24 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4a28ce0667f..1b67bf341c32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1580,9 +1580,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
unsigned int order) { }
#endif
+static inline void __clear_hugetlb_destructor(struct hstate *h,
+ struct folio *folio)
+{
+ lockdep_assert_held(&hugetlb_lock);
+
+ /*
+ * Very subtle
+ *
+ * For non-gigantic pages set the destructor to the normal compound
+ * page dtor. This is needed in case someone takes an additional
+ * temporary ref to the page, and freeing is delayed until they drop
+ * their reference.
+ *
+ * For gigantic pages set the destructor to the null dtor. This
+ * destructor will never be called. Before freeing the gigantic
+ * page destroy_compound_gigantic_folio will turn the folio into a
+ * simple group of pages. After this the destructor does not
+ * apply.
+ *
+ */
+ if (hstate_is_gigantic(h))
+ folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+ else
+ folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
/*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page. Otherwise, wait until after allocating vmemmap
+ * to update dtor.
*
* A reference is held on the folio, except in the case of demote.
*
@@ -1613,31 +1641,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
}
/*
- * Very subtle
- *
- * For non-gigantic pages set the destructor to the normal compound
- * page dtor. This is needed in case someone takes an additional
- * temporary ref to the page, and freeing is delayed until they drop
- * their reference.
- *
- * For gigantic pages set the destructor to the null dtor. This
- * destructor will never be called. Before freeing the gigantic
- * page destroy_compound_gigantic_folio will turn the folio into a
- * simple group of pages. After this the destructor does not
- * apply.
- *
- * This handles the case where more than one ref is held when and
- * after update_and_free_hugetlb_folio is called.
- *
- * In the case of demote we do not ref count the page as it will soon
- * be turned into a page of smaller size.
+ * We can only clear the hugetlb destructor after allocating vmemmap
+ * pages. Otherwise, someone (memory error handling) may try to write
+ * to tail struct pages.
+ */
+ if (!folio_test_hugetlb_vmemmap_optimized(folio))
+ __clear_hugetlb_destructor(h, folio);
+
+ /*
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
if (!demote)
folio_ref_unfreeze(folio, 1);
- if (hstate_is_gigantic(h))
- folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
- else
- folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
h->nr_huge_pages--;
h->nr_huge_pages_node[nid]--;
@@ -1706,6 +1722,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
{
int i;
struct page *subpage;
+ bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
@@ -1736,6 +1753,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
if (unlikely(folio_test_hwpoison(folio)))
folio_clear_hugetlb_hwpoison(folio);
+ /*
+ * If vmemmap pages were allocated above, then we need to clear the
+ * hugetlb destructor under the hugetlb lock.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
for (i = 0; i < pages_per_huge_page(h); i++) {
subpage = folio_page(folio, i);
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
--
2.41.0
Hist triggers can have referenced variables without having direct
variables fields. This can be the case if referenced variables are added
for trigger actions. In this case the newly added references will not
have field variables. Not taking such referenced variables into
consideration can result in a bug where it would be possible to remove
hist trigger with variables being refenced. This will result in a bug
that is easily reproducable like so
$ cd /sys/kernel/tracing
$ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events
$ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
$ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger
$ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
[ 100.263533] ==================================================================
[ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180
[ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439
[ 100.266320]
[ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4
[ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 100.268561] Call Trace:
[ 100.268902] <TASK>
[ 100.269189] dump_stack_lvl+0x4c/0x70
[ 100.269680] print_report+0xc5/0x600
[ 100.270165] ? resolve_var_refs+0xc7/0x180
[ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0
[ 100.271389] ? resolve_var_refs+0xc7/0x180
[ 100.271913] kasan_report+0xbd/0x100
[ 100.272380] ? resolve_var_refs+0xc7/0x180
[ 100.272920] __asan_load8+0x71/0xa0
[ 100.273377] resolve_var_refs+0xc7/0x180
[ 100.273888] event_hist_trigger+0x749/0x860
[ 100.274505] ? kasan_save_stack+0x2a/0x50
[ 100.275024] ? kasan_set_track+0x29/0x40
[ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10
[ 100.276138] ? ksys_write+0xd1/0x170
[ 100.276607] ? do_syscall_64+0x3c/0x90
[ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.277771] ? destroy_hist_data+0x446/0x470
[ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860
[ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10
[ 100.279627] ? __kasan_check_write+0x18/0x20
[ 100.280177] ? mutex_unlock+0x85/0xd0
[ 100.280660] ? __pfx_mutex_unlock+0x10/0x10
[ 100.281200] ? kfree+0x7b/0x120
[ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0
[ 100.282197] ? event_trigger_write+0xac/0x100
[ 100.282764] ? __kasan_slab_free+0x16/0x20
[ 100.283293] ? __kmem_cache_free+0x153/0x2f0
[ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250
[ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10
[ 100.285221] ? event_trigger_write+0xbc/0x100
[ 100.285781] ? __kasan_check_read+0x15/0x20
[ 100.286321] ? __bitmap_weight+0x66/0xa0
[ 100.286833] ? _find_next_bit+0x46/0xe0
[ 100.287334] ? task_mm_cid_work+0x37f/0x450
[ 100.287872] event_triggers_call+0x84/0x150
[ 100.288408] trace_event_buffer_commit+0x339/0x430
[ 100.289073] ? ring_buffer_event_data+0x3f/0x60
[ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0
[ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0
[ 100.298653] syscall_enter_from_user_mode+0x32/0x40
[ 100.301808] do_syscall_64+0x1a/0x90
[ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.307775] RIP: 0033:0x7f686c75c1cb
[ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48
[ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021
[ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb
[ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a
[ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a
[ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007
[ 100.338381] </TASK>
We hit the bug because when second hist trigger has was created
has_hist_vars() returned false because hist trigger did not have
variables. As a result of that save_hist_vars() was not called to add
the trigger to trace_array->hist_vars. Later on when we attempted to
remove the first histogram find_any_var_ref() failed to detect it is
being used because it did not find the second trigger in hist_vars list.
With this change we wait until trigger actions are created so we can take
into consideration if hist trigger has variable references. Also, now we
check the return value of save_hist_vars() and fail trigger creation if
save_hist_vars() fails.
Signed-off-by: Mohamed Khalfella <mkhalfella(a)purestorage.com>
Cc: stable(a)vger.kernel.org
---
kernel/trace/trace_events_hist.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index b97d3ad832f1..c8c61381eba4 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -6663,13 +6663,15 @@ static int event_hist_trigger_parse(struct event_command *cmd_ops,
if (get_named_trigger_data(trigger_data))
goto enable;
- if (has_hist_vars(hist_data))
- save_hist_vars(hist_data);
-
ret = create_actions(hist_data);
if (ret)
goto out_unreg;
+ if (has_hist_vars(hist_data) || hist_data->n_var_refs) {
+ if (save_hist_vars(hist_data))
+ goto out_unreg;
+ }
+
ret = tracing_map_init(hist_data->map);
if (ret)
goto out_unreg;
--
2.34.1
AutoRetry has been found to cause issues in this controller revision.
This feature allows the controller to send non-terminating/burst retry
ACKs (Retry=1 and Nump!=0) as opposed to terminating retry ACKs
(Retry=1 and Nump=0) to devices when a transaction error occurs.
Unfortunately, some USB devices do not retry transfers when
the controller sends them a non-terminating retry ACK. After the transfer
times out, the xHCI driver tries to resume normal operation of the
controller by sending a Stop Endpoint command to it. However, this
revision of the controller fails to respond to that command in this
state and the xHCI driver therefore gives up. This is reported via dmesg:
[sda] tag#29 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[sda] tag#29 CDB: opcode=0x28 28 00 00 69 42 80 00 00 48 00
xhci-hcd: xHCI host not responding to stop endpoint command
xhci-hcd: xHCI host controller not responding, assume dead
xhci-hcd: HC died; cleaning up
This problem has been observed on Odroid HC2. This board has
an integrated JMS578 USB3-to-SATA bridge. The above problem could be
triggered by starting a read-heavy workload on an attached SSD.
After a while, the host controller would die and the SSD would disappear
from the system.
Reverting b138e23d3dff ("usb: dwc3: core: Enable AutoRetry feature in
the controller") stopped the issue from occurring on Odroid HC2.
As this problem hasn't been reported on other devices, disable
AutoRetry just for the dwc_usb3 revision v2.00a that the board SoC
(Exynos 5422) uses.
Fixes: b138e23d3dff ("usb: dwc3: core: Enable AutoRetry feature in the controller")
Link: https://lore.kernel.org/r/a21f34c04632d250cd0a78c7c6f4a1c9c7a43142.camel@gm…
Link: https://forum.odroid.com/viewtopic.php?t=42630
Cc: stable(a)vger.kernel.org
Cc: Mauro Ribeiro <mauro.ribeiro(a)hardkernel.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Suggested-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Signed-off-by: Jakub Vanek <linuxtardis(a)gmail.com>
---
drivers/usb/dwc3/core.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index d68958e151a7..d742fc882d7e 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1218,9 +1218,15 @@ static int dwc3_core_init(struct dwc3 *dwc)
* Host mode on seeing transaction errors(CRC errors or internal
* overrun scenerios) on IN transfers to reply to the device
* with a non-terminating retry ACK (i.e, an ACK transcation
- * packet with Retry=1 & Nump != 0)
+ * packet with Retry=1 & Nump != 0).
+ *
+ * Do not enable this for DWC_usb3 v2.00a. This controller
+ * revision stops responding to Stop Endpoint commands if
+ * a USB device does not retry after a non-terminating retry
+ * ACK is sent to it.
*/
- reg |= DWC3_GUCTL_HSTINAUTORETRY;
+ if (!DWC3_VER_IS(DWC3, 200A))
+ reg |= DWC3_GUCTL_HSTINAUTORETRY;
dwc3_writel(dwc->regs, DWC3_GUCTL, reg);
}
--
2.25.1
When allocating the 2D array for handling IRQ type registers in
regmap_add_irq_chip_fwnode(), the intent is to allocate a matrix
with num_config_bases rows and num_config_regs columns.
This is currently handled by allocating a buffer to hold a pointer for
each row (i.e. num_config_bases). After that, the logic attempts to
allocate the memory required to hold the register configuration for
each row. However, instead of doing this allocation for each row
(i.e. num_config_bases allocations), the logic erroneously does this
allocation num_config_regs number of times.
This scenario can lead to out-of-bounds accesses when num_config_regs
is greater than num_config_bases. Fix this by updating the terminating
condition of the loop that allocates the memory for holding the register
configuration to allocate memory only for each row in the matrix.
Amit Pundir reported a crash that was occurring on his db845c device
due to memory corruption (see "Closes" tag for Amit's report). The KASAN
report below helped narrow it down to this issue:
[ 14.033877][ T1] ==================================================================
[ 14.042507][ T1] BUG: KASAN: invalid-access in regmap_add_irq_chip_fwnode+0x594/0x1364
[ 14.050796][ T1] Write of size 8 at addr 06ffff8081021850 by task init/1
[ 14.057841][ T1] Pointer tag: [06], memory tag: [fe]
[ 14.063124][ T1]
[ 14.065349][ T1] CPU: 2 PID: 1 Comm: init Tainted: G W E 6.4.0-mainline-g6a4b67fef3e2 #1
[ 14.075014][ T1] Hardware name: Thundercomm Dragonboard 845c (DT)
[ 14.081432][ T1] Call trace:
[ 14.084618][ T1] dump_backtrace+0xe8/0x108
[ 14.089144][ T1] show_stack+0x18/0x30
[ 14.093215][ T1] dump_stack_lvl+0x50/0x6c
[ 14.097642][ T1] print_report+0x178/0x4c0
[ 14.102070][ T1] kasan_report+0xd4/0x12c
[ 14.106407][ T1] kasan_tag_mismatch+0x28/0x40
[ 14.111178][ T1] __hwasan_tag_mismatch+0x2c/0x5c
[ 14.116222][ T1] regmap_add_irq_chip_fwnode+0x594/0x1364
[ 14.121961][ T1] devm_regmap_add_irq_chip+0xb8/0x144
[ 14.127346][ T1] wcd934x_slim_status+0x210/0x28c [wcd934x]
[ 14.133307][ T1] slim_device_alloc_laddr+0x1ac/0x1ec [slimbus]
[ 14.139669][ T1] slim_device_probe+0x80/0x124 [slimbus]
[ 14.145394][ T1] really_probe+0x250/0x4d8
[ 14.149826][ T1] __driver_probe_device+0x104/0x1ac
[ 14.155041][ T1] driver_probe_device+0x80/0x218
[ 14.159990][ T1] __driver_attach+0x19c/0x2e4
[ 14.164678][ T1] bus_for_each_dev+0x158/0x1b4
[ 14.169454][ T1] driver_attach+0x34/0x44
[ 14.173790][ T1] bus_add_driver+0x1fc/0x328
[ 14.178390][ T1] driver_register+0xdc/0x1b4
[ 14.182995][ T1] __slim_driver_register+0x6c/0x84 [slimbus]
[ 14.189068][ T1] init_module+0x20/0xfe4 [wcd934x]
[ 14.194219][ T1] do_one_initcall+0x110/0x418
[ 14.198916][ T1] do_init_module+0x124/0x30c
[ 14.203521][ T1] load_module+0x1938/0x1ab0
[ 14.208034][ T1] __arm64_sys_finit_module+0x110/0x138
[ 14.213509][ T1] invoke_syscall+0x70/0x170
[ 14.218015][ T1] el0_svc_common+0xf0/0x138
[ 14.222523][ T1] do_el0_svc+0x40/0xb8
[ 14.226596][ T1] el0_svc+0x2c/0x78
[ 14.230405][ T1] el0t_64_sync_handler+0x68/0xb4
[ 14.235354][ T1] el0t_64_sync+0x19c/0x1a0
[ 14.239778][ T1]
[ 14.242004][ T1] The buggy address belongs to the object at ffffff8081021850
[ 14.242004][ T1] which belongs to the cache kmalloc-8 of size 8
[ 14.255669][ T1] The buggy address is located 0 bytes inside of
[ 14.255669][ T1] 8-byte region [ffffff8081021850, ffffff8081021858)
[ 14.255685][ T1]
[ 14.255689][ T1] The buggy address belongs to the physical page:
[ 14.255699][ T1] page:0000000080887a30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x85ffff8081021ee0 pfn:0x101021
[ 14.275062][ T1] flags: 0x4000000000000200(slab|zone=1|kasantag=0x0)
[ 14.275078][ T1] page_type: 0xffffffff()
[ 14.275091][ T1] raw: 4000000000000200 49ffff8080002200 dead000000000122 0000000000000000
[ 14.275103][ T1] raw: 85ffff8081021ee0 00000000810000ea 00000001ffffffff 0000000000000000
[ 14.275110][ T1] page dumped because: kasan: bad access detected
[ 14.275116][ T1]
[ 14.275119][ T1] Memory state around the buggy address:
[ 14.275125][ T1] ffffff8081021600: fe fe fe 9a fe fe 5b 3c fe fe c9 fe b4 3f fe 54
[ 14.275133][ T1] ffffff8081021700: fe fe ad 6b fe fe fe 87 fe fe 39 c9 fe 03 fe ea
[ 14.275143][ T1] >ffffff8081021800: fe fe e1 fe 06 fe fe 21 fe fe e7 fe de fe fe 70
[ 14.275149][ T1] ^
[ 14.371674][ T1] ffffff8081021900: d7 fe fe 87 fe a0 fe fe fe 80 e0 f0 05 fe fe fe
[ 14.379667][ T1] ffffff8081021a00: 94 fe 31 fe fe e5 c8 00 d0 fe a1 fe fe e2 e5 fe
[ 14.387664][ T1] ==================================================================
Fixes: faa87ce9196d ("regmap-irq: Introduce config registers for irq types")
Reported-by: Amit Pundir <amit.pundir(a)linaro.org>
Closes: https://lore.kernel.org/all/CAMi1Hd04mu6JojT3y6wyN2YeVkPR5R3qnkKJ8iR8if_YBy…
Tested-by: John Stultz <jstultz(a)google.com>
Cc: stable(a)vger.kernel.org # v6.0+
Cc: Aidan MacDonald <aidanmacdonald.0x0(a)gmail.com>
Cc: Saravana Kannan <saravanak(a)google.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com>
---
drivers/base/regmap/regmap-irq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/regmap/regmap-irq.c b/drivers/base/regmap/regmap-irq.c
index ced0dcf86e0b..45fd13ef13fc 100644
--- a/drivers/base/regmap/regmap-irq.c
+++ b/drivers/base/regmap/regmap-irq.c
@@ -717,7 +717,7 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
if (!d->config_buf)
goto err_alloc;
- for (i = 0; i < chip->num_config_regs; i++) {
+ for (i = 0; i < chip->num_config_bases; i++) {
d->config_buf[i] = kcalloc(chip->num_config_regs,
sizeof(**d->config_buf),
GFP_KERNEL);
--
2.41.0.255.g8b1d071c50-goog
The Winbond W25Q128 (actual vendor name W25Q128JV)
has exactly the same flags as the sibling device
w25q128fw. The devices both require unlocking and
support dual and quad SPI transport.
The actual product naming between devices:
0xef4018: "w25q128" W25Q128JV-IM/JM
0xef7018: "w25q128fw" W25Q128JV-IN/IQ/JQ
The latter device, "w25q128fw" supports features
named DTQ and QPI, otherwise it is the same.
Not having the right flags has the annoying side
effect that write access does not work.
After this patch I can write to the flash on the
Inteno XG6846 router.
Cc: stable(a)vger.kernel.org
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
drivers/mtd/spi-nor/winbond.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/mtd/spi-nor/winbond.c b/drivers/mtd/spi-nor/winbond.c
index 834d6ba5ce70..a67e1d4206f3 100644
--- a/drivers/mtd/spi-nor/winbond.c
+++ b/drivers/mtd/spi-nor/winbond.c
@@ -121,7 +121,9 @@ static const struct flash_info winbond_nor_parts[] = {
{ "w25q80bl", INFO(0xef4014, 0, 64 * 1024, 16)
NO_SFDP_FLAGS(SECT_4K) },
{ "w25q128", INFO(0xef4018, 0, 64 * 1024, 256)
- NO_SFDP_FLAGS(SECT_4K) },
+ FLAGS(SPI_NOR_HAS_LOCK | SPI_NOR_HAS_TB)
+ NO_SFDP_FLAGS(SECT_4K | SPI_NOR_DUAL_READ |
+ SPI_NOR_QUAD_READ) },
{ "w25q256", INFO(0xef4019, 0, 64 * 1024, 512)
NO_SFDP_FLAGS(SECT_4K | SPI_NOR_DUAL_READ | SPI_NOR_QUAD_READ)
.fixups = &w25q256_fixups },
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230711-spi-nor-winbond-w25q128-321a602ee267
Best regards,
--
Linus Walleij <linus.walleij(a)linaro.org>
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest
memory. This is even the case with hugepages, although it is a rather
suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified
range is exactly one page/block of memory, which in the aforementioned
case is not necessarily true. All together this leads to the WARN() in
kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit
at most a single leaf PTE. For hugepage-backed memory in a memslot that
isn't hugepage-aligned, page aging entirely misses accesses to the
hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable
of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new
walker and drop the WARN that caught the issue in the first place. The
implementation of this walker was inspired by the test_clear_young()
implementation by Yu Zhao [*], but repurposed to address a bug in the
existing aging implementation.
Cc: stable(a)vger.kernel.org # v5.15
Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/
Co-developed-by: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Reiji Watanabe <reijiw(a)google.com>
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
arch/arm64/include/asm/kvm_pgtable.h | 26 ++++++------------
arch/arm64/kvm/hyp/pgtable.c | 41 ++++++++++++++++++++++------
arch/arm64/kvm/mmu.c | 18 ++++++------
3 files changed, 49 insertions(+), 36 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dc3c072e862f..75f437e8cd15 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -556,22 +556,26 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr);
/**
- * kvm_pgtable_stage2_mkold() - Clear the access flag in a page-table entry.
+ * kvm_pgtable_stage2_test_clear_young() - Test and optionally clear the access
+ * flag in a page-table entry.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address to identify the page-table entry.
+ * @size: Size of the address range to visit.
+ * @mkold: True if the access flag should be cleared.
*
* The offset of @addr within a page is ignored.
*
- * If there is a valid, leaf page-table entry used to translate @addr, then
- * clear the access flag in that entry.
+ * Tests and conditionally clears the access flag for every valid, leaf
+ * page-table entry used to translate the range [@addr, @addr + @size).
*
* Note that it is the caller's responsibility to invalidate the TLB after
* calling this function to ensure that the updated permissions are visible
* to the CPUs.
*
- * Return: The old page-table entry prior to clearing the flag, 0 on failure.
+ * Return: True if any of the visited PTEs had the access flag set.
*/
-kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr);
+bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
+ u64 size, bool mkold);
/**
* kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a
@@ -593,18 +597,6 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr);
int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
enum kvm_pgtable_prot prot);
-/**
- * kvm_pgtable_stage2_is_young() - Test whether a page-table entry has the
- * access flag set.
- * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
- * @addr: Intermediate physical address to identify the page-table entry.
- *
- * The offset of @addr within a page is ignored.
- *
- * Return: True if the page-table entry has the access flag set, false otherwise.
- */
-bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
-
/**
* kvm_pgtable_stage2_flush_range() - Clean and invalidate data cache to Point
* of Coherency for guest stage-2 address
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5282cb9ca4cf..5d701e9adf5c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1153,25 +1153,48 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
return pte;
}
-kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
+struct stage2_age_data {
+ bool mkold;
+ bool young;
+};
+
+static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
{
- kvm_pte_t pte = 0;
- stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
- &pte, NULL, 0);
+ kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+ struct stage2_age_data *data = ctx->arg;
+
+ if (!kvm_pte_valid(ctx->old) || new == ctx->old)
+ return 0;
+
+ data->young = true;
+
+ if (data->mkold && !stage2_try_set_pte(ctx, new))
+ return -EAGAIN;
+
/*
* "But where's the TLBI?!", you scream.
* "Over in the core code", I sigh.
*
* See the '->clear_flush_young()' callback on the KVM mmu notifier.
*/
- return pte;
+ return 0;
}
-bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
+bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
+ u64 size, bool mkold)
{
- kvm_pte_t pte = 0;
- stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL, 0);
- return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
+ struct stage2_age_data data = {
+ .mkold = mkold,
+ };
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_age_walker,
+ .arg = &data,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
+ return data.young;
}
int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..8a7e9381710e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1639,27 +1639,25 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
- kvm_pte_t kpte;
- pte_t pte;
if (!kvm->arch.mmu.pgt)
return false;
- WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
-
- kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT);
- pte = __pte(kpte);
- return pte_valid(pte) && pte_young(pte);
+ return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, true);
}
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
+ u64 size = (range->end - range->start) << PAGE_SHIFT;
+
if (!kvm->arch.mmu.pgt)
return false;
- return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT);
+ return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, false);
}
phys_addr_t kvm_mmu_get_httbr(void)
base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
--
2.41.0.178.g377b9f9a00-goog
The patch titled
Subject: maple_tree: fix node allocation testing on 32 bit
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
maple_tree-fix-node-allocation-testing-on-32-bit.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: maple_tree: fix node allocation testing on 32 bit
Date: Wed, 12 Jul 2023 13:39:16 -0400
Internal node counting was altered and the 64 bit test was updated,
however the 32bit test was missed.
Restore the 32bit test to a functional state.
Link: https://lore.kernel.org/linux-mm/CAMuHMdV4T53fOw7VPoBgPR7fP6RYqf=CBhD_y_vOg…
Link: https://lkml.kernel.org/r/20230712173916.168805-2-Liam.Howlett@oracle.com
Fixes: 541e06b772c1 ("maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk()")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/radix-tree/maple.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/tools/testing/radix-tree/maple.c~maple_tree-fix-node-allocation-testing-on-32-bit
+++ a/tools/testing/radix-tree/maple.c
@@ -206,9 +206,9 @@ static noinline void __init check_new_no
e = i - 1;
} else {
if (i >= 4)
- e = i - 4;
- else if (i == 3)
- e = i - 2;
+ e = i - 3;
+ else if (i >= 1)
+ e = i - 1;
else
e = 0;
}
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
mm-mlock-fix-vma-iterator-conversion-of-apply_vma_lock_flags.patch
maple_tree-fix-32-bit-mas_next-testing.patch
maple_tree-fix-node-allocation-testing-on-32-bit.patch
The patch titled
Subject: maple_tree: fix 32 bit mas_next testing
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
maple_tree-fix-32-bit-mas_next-testing.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: maple_tree: fix 32 bit mas_next testing
Date: Wed, 12 Jul 2023 13:39:15 -0400
The test setup of mas_next is dependent on node entry size to create a 2
level tree, but the tests did not account for this in the expected value
when shifting beyond the scope of the tree.
Fix this by setting up the test to succeed depending on the node entries
which is dependent on the 32/64 bit setup.
Link: https://lkml.kernel.org/r/20230712173916.168805-1-Liam.Howlett@oracle.com
Fixes: 120b116208a0 ("maple_tree: reorganize testing to restore module testing")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reported-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Link: https://lore.kernel.org/linux-mm/CAMuHMdV4T53fOw7VPoBgPR7fP6RYqf=CBhD_y_vOg…
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/test_maple_tree.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/lib/test_maple_tree.c~maple_tree-fix-32-bit-mas_next-testing
+++ a/lib/test_maple_tree.c
@@ -1898,13 +1898,16 @@ static noinline void __init next_prev_te
725};
static const unsigned long level2_32[] = { 1747, 2000, 1750, 1755,
1760, 1765};
+ unsigned long last_index;
if (MAPLE_32BIT) {
nr_entries = 500;
level2 = level2_32;
+ last_index = 0x138e;
} else {
nr_entries = 200;
level2 = level2_64;
+ last_index = 0x7d6;
}
for (i = 0; i <= nr_entries; i++)
@@ -2011,7 +2014,7 @@ static noinline void __init next_prev_te
val = mas_next(&mas, ULONG_MAX);
MT_BUG_ON(mt, val != NULL);
- MT_BUG_ON(mt, mas.index != 0x7d6);
+ MT_BUG_ON(mt, mas.index != last_index);
MT_BUG_ON(mt, mas.last != ULONG_MAX);
val = mas_prev(&mas, 0);
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
mm-mlock-fix-vma-iterator-conversion-of-apply_vma_lock_flags.patch
maple_tree-fix-32-bit-mas_next-testing.patch
maple_tree-fix-node-allocation-testing-on-32-bit.patch
The patch titled
Subject: selftests/mm: mkdirty: Fix incorrect position of #endif
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
selftests-mm-mkdirty-fix-incorrect-position-of-endif.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Colin Ian King <colin.i.king(a)gmail.com>
Subject: selftests/mm: mkdirty: Fix incorrect position of #endif
Date: Wed, 12 Jul 2023 14:46:48 +0100
The #endif is the wrong side of a } causing a build failure when
__NR_userfaultfd is not defined. Fix this by moving the #end to enclose
the }
Link: https://lkml.kernel.org/r/20230712134648.456349-1-colin.i.king@gmail.com
Fixes: 9eac40fc0cc7 ("selftests/mm: mkdirty: test behavior of (pte|pmd)_mkdirty on VMAs without write permissions")
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/mm/mkdirty.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/tools/testing/selftests/mm/mkdirty.c~selftests-mm-mkdirty-fix-incorrect-position-of-endif
+++ a/tools/testing/selftests/mm/mkdirty.c
@@ -321,8 +321,8 @@ close_uffd:
munmap:
munmap(dst, pagesize);
free(src);
-#endif /* __NR_userfaultfd */
}
+#endif /* __NR_userfaultfd */
int main(void)
{
_
Patches currently in -mm which might be from colin.i.king(a)gmail.com are
selftests-mm-mkdirty-fix-incorrect-position-of-endif.patch
Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.
However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:
(1) Filesystems that don't implement a i_op->setattr() for symlinks.
Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.
That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.
Example: afs
(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.
Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.
The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);
Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.
When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.
Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:
* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl
Example: btrfs, ext4, xfs
The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:
fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);
But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.
So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.
If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.
But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.
We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.
Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fc… [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Cc: stable(a)vger.kernel.org # no backport before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Suggested-by: Florian Weimer <fweimer(a)redhat.com>
Reviewed-by: Aleksa Sarai <cyphar(a)cyphar.com>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
---
---
fs/attr.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/attr.c b/fs/attr.c
index d60dc1edb526..529153ecde25 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -394,6 +394,9 @@ int notify_change(struct mnt_idmap *idmap, struct dentry *dentry,
return error;
if ((ia_valid & ATTR_MODE)) {
+ if (S_ISLNK(inode->i_mode))
+ return -EOPNOTSUPP;
+
umode_t amode = attr->ia_mode;
/* Flag setting protected by i_mutex */
if (is_sxid(amode))
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230712-vfs-chmod-symlinks-a3ad1923a992
From: Zhikai Zhai <zhikai.zhai(a)amd.com>
[WHY]
All of pipes will be used when the MPC split enable on the dcn
which just has 2 pipes. Then MPO enter will trigger the minimal
transition which need programe dcn from 2 pipes MPC split to 2
pipes MPO. This action will cause lag if happen frequently.
[HOW]
Disable the MPC split for the platform which dcn resource is limited
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Alvin Lee <alvin.lee2(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Zhikai Zhai <zhikai.zhai(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
index 45956ef6f3f9..131b8b82afc0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
@@ -65,7 +65,7 @@ static const struct dc_debug_options debug_defaults_drv = {
.timing_trace = false,
.clock_trace = true,
.disable_pplib_clock_request = true,
- .pipe_split_policy = MPC_SPLIT_DYNAMIC,
+ .pipe_split_policy = MPC_SPLIT_AVOID,
.force_single_disp_pipe_split = false,
.disable_dcc = DCC_ENABLE,
.vsr_support = true,
--
2.34.1
From: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
[Why & How]
Port of a change that went into DCN314 to keep the PHY enabled
when we have a connected and active DP display.
The PHY can hang if PHY refclk is disabled inadvertently.
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Josip Pavic <josip.pavic(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
index 7ccd96959256..3db4ef564b99 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
@@ -87,6 +87,11 @@ static int dcn31_get_active_display_cnt_wa(
stream->signal == SIGNAL_TYPE_DVI_SINGLE_LINK ||
stream->signal == SIGNAL_TYPE_DVI_DUAL_LINK)
tmds_present = true;
+
+ /* Checking stream / link detection ensuring that PHY is active*/
+ if (dc_is_dp_signal(stream->signal) && !stream->dpms_off)
+ display_count++;
+
}
for (i = 0; i < dc->link_count; i++) {
--
2.34.1
From: Daniel Miess <daniel.miess(a)amd.com>
[Why]
In dcn314 DML the destination pipe vtotal was being set
to the crtc adjustment vtotal_min value even in cases
where that value is 0.
[How]
Only set vtotal to the crtc adjustment vtotal_min value
in cases where the value is non-zero.
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Daniel Miess <daniel.miess(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c b/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
index d9e049e7ff0a..ed8ddb75b333 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
@@ -295,7 +295,11 @@ int dcn314_populate_dml_pipes_from_context_fpu(struct dc *dc, struct dc_state *c
pipe = &res_ctx->pipe_ctx[i];
timing = &pipe->stream->timing;
- pipes[pipe_cnt].pipe.dest.vtotal = pipe->stream->adjust.v_total_min;
+ if (pipe->stream->adjust.v_total_min != 0)
+ pipes[pipe_cnt].pipe.dest.vtotal = pipe->stream->adjust.v_total_min;
+ else
+ pipes[pipe_cnt].pipe.dest.vtotal = timing->v_total;
+
pipes[pipe_cnt].pipe.dest.vblank_nom = timing->v_total - pipes[pipe_cnt].pipe.dest.vactive;
pipes[pipe_cnt].pipe.dest.vblank_nom = min(pipes[pipe_cnt].pipe.dest.vblank_nom, dcn3_14_ip.VBlankNomDefaultUS);
pipes[pipe_cnt].pipe.dest.vblank_nom = max(pipes[pipe_cnt].pipe.dest.vblank_nom, timing->v_sync_width);
--
2.34.1
From: Zhikai Zhai <zhikai.zhai(a)amd.com>
[WHY]
All of pipes will be used when the MPC split enable on the dcn
which just has 2 pipes. Then MPO enter will trigger the minimal
transition which need programe dcn from 2 pipes MPC split to 2
pipes MPO. This action will cause lag if happen frequently.
[HOW]
Disable the MPC split for the platform which dcn resource is limited
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Alvin Lee <alvin.lee2(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Zhikai Zhai <zhikai.zhai(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
index 45956ef6f3f9..131b8b82afc0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
@@ -65,7 +65,7 @@ static const struct dc_debug_options debug_defaults_drv = {
.timing_trace = false,
.clock_trace = true,
.disable_pplib_clock_request = true,
- .pipe_split_policy = MPC_SPLIT_DYNAMIC,
+ .pipe_split_policy = MPC_SPLIT_AVOID,
.force_single_disp_pipe_split = false,
.disable_dcc = DCC_ENABLE,
.vsr_support = true,
--
2.34.1
From: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
[Why & How]
Port of a change that went into DCN314 to keep the PHY enabled
when we have a connected and active DP display.
The PHY can hang if PHY refclk is disabled inadvertently.
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Josip Pavic <josip.pavic(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
index 7ccd96959256..3db4ef564b99 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c
@@ -87,6 +87,11 @@ static int dcn31_get_active_display_cnt_wa(
stream->signal == SIGNAL_TYPE_DVI_SINGLE_LINK ||
stream->signal == SIGNAL_TYPE_DVI_DUAL_LINK)
tmds_present = true;
+
+ /* Checking stream / link detection ensuring that PHY is active*/
+ if (dc_is_dp_signal(stream->signal) && !stream->dpms_off)
+ display_count++;
+
}
for (i = 0; i < dc->link_count; i++) {
--
2.34.1
From: Daniel Miess <daniel.miess(a)amd.com>
[Why]
In dcn314 DML the destination pipe vtotal was being set
to the crtc adjustment vtotal_min value even in cases
where that value is 0.
[How]
Only set vtotal to the crtc adjustment vtotal_min value
in cases where the value is non-zero.
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Acked-by: Alan Liu <haoping.liu(a)amd.com>
Signed-off-by: Daniel Miess <daniel.miess(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c b/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
index d9e049e7ff0a..ed8ddb75b333 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c
@@ -295,7 +295,11 @@ int dcn314_populate_dml_pipes_from_context_fpu(struct dc *dc, struct dc_state *c
pipe = &res_ctx->pipe_ctx[i];
timing = &pipe->stream->timing;
- pipes[pipe_cnt].pipe.dest.vtotal = pipe->stream->adjust.v_total_min;
+ if (pipe->stream->adjust.v_total_min != 0)
+ pipes[pipe_cnt].pipe.dest.vtotal = pipe->stream->adjust.v_total_min;
+ else
+ pipes[pipe_cnt].pipe.dest.vtotal = timing->v_total;
+
pipes[pipe_cnt].pipe.dest.vblank_nom = timing->v_total - pipes[pipe_cnt].pipe.dest.vactive;
pipes[pipe_cnt].pipe.dest.vblank_nom = min(pipes[pipe_cnt].pipe.dest.vblank_nom, dcn3_14_ip.VBlankNomDefaultUS);
pipes[pipe_cnt].pipe.dest.vblank_nom = max(pipes[pipe_cnt].pipe.dest.vblank_nom, timing->v_sync_width);
--
2.34.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 430635a0ef1ce958b7b4311f172694ece2c692b8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023050813-twiddling-olive-0fd5@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
430635a0ef1c ("perf intel-pt: Fix CYC timestamps after standalone CBR")
3f05516758be ("perf intel-pt: Accumulate cycle count from TSC/TMA/MTC packets")
f3c98c4b5ac8 ("perf intel-pt: Re-factor TIP cases in intel_pt_walk_to_ip")
7b4b4f83881e ("perf intel-pt: Accumulate cycle count from CYC packets")
948e9dc8bb26 ("perf intel-pt: Factor out intel_pt_update_sample_time")
1b6599a9d8e6 ("perf intel-pt: Fix sample timestamp wrt non-taken branches")
bea6385789b8 ("perf intel-pt: Implement decoder flags for trace begin / end")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 430635a0ef1ce958b7b4311f172694ece2c692b8 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 3 Apr 2023 18:48:31 +0300
Subject: [PATCH] perf intel-pt: Fix CYC timestamps after standalone CBR
After a standalone CBR (not associated with TSC), update the cycles
reference timestamp and reset the cycle count, so that CYC timestamps
are calculated relative to that point with the new frequency.
Fixes: cc33618619cefc6d ("perf tools: Add Intel PT support for decoding CYC packets")
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: Ian Rogers <irogers(a)google.com>
Cc: Jiri Olsa <jolsa(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230403154831.8651-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
diff --git a/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c b/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
index 0ac860c8dd2b..7145c5890de0 100644
--- a/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
+++ b/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
@@ -1998,6 +1998,8 @@ static void intel_pt_calc_cbr(struct intel_pt_decoder *decoder)
decoder->cbr = cbr;
decoder->cbr_cyc_to_tsc = decoder->max_non_turbo_ratio_fp / cbr;
+ decoder->cyc_ref_timestamp = decoder->timestamp;
+ decoder->cycle_cnt = 0;
intel_pt_mtc_cyc_cnt_cbr(decoder);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 430635a0ef1ce958b7b4311f172694ece2c692b8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023050812-swaddling-stardust-e90d@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
430635a0ef1c ("perf intel-pt: Fix CYC timestamps after standalone CBR")
3f05516758be ("perf intel-pt: Accumulate cycle count from TSC/TMA/MTC packets")
f3c98c4b5ac8 ("perf intel-pt: Re-factor TIP cases in intel_pt_walk_to_ip")
7b4b4f83881e ("perf intel-pt: Accumulate cycle count from CYC packets")
948e9dc8bb26 ("perf intel-pt: Factor out intel_pt_update_sample_time")
1b6599a9d8e6 ("perf intel-pt: Fix sample timestamp wrt non-taken branches")
bea6385789b8 ("perf intel-pt: Implement decoder flags for trace begin / end")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 430635a0ef1ce958b7b4311f172694ece2c692b8 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 3 Apr 2023 18:48:31 +0300
Subject: [PATCH] perf intel-pt: Fix CYC timestamps after standalone CBR
After a standalone CBR (not associated with TSC), update the cycles
reference timestamp and reset the cycle count, so that CYC timestamps
are calculated relative to that point with the new frequency.
Fixes: cc33618619cefc6d ("perf tools: Add Intel PT support for decoding CYC packets")
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: Ian Rogers <irogers(a)google.com>
Cc: Jiri Olsa <jolsa(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230403154831.8651-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
diff --git a/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c b/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
index 0ac860c8dd2b..7145c5890de0 100644
--- a/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
+++ b/tools/perf/util/intel-pt-decoder/intel-pt-decoder.c
@@ -1998,6 +1998,8 @@ static void intel_pt_calc_cbr(struct intel_pt_decoder *decoder)
decoder->cbr = cbr;
decoder->cbr_cyc_to_tsc = decoder->max_non_turbo_ratio_fp / cbr;
+ decoder->cyc_ref_timestamp = decoder->timestamp;
+ decoder->cycle_cnt = 0;
intel_pt_mtc_cyc_cnt_cbr(decoder);
}
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit a24c1aab652ebacf9ea62470a166514174c96fe1 ]
The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index eec8e2f7537eb..b2c1ab260ed56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2810,12 +2810,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
*statusp = RCU_KTHREAD_RUNNING;
local_irq_disable();
work = *workp;
- *workp = 0;
+ WRITE_ONCE(*workp, 0);
local_irq_enable();
if (work)
rcu_core();
local_bh_enable();
- if (*workp == 0) {
+ if (!READ_ONCE(*workp)) {
trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
*statusp = RCU_KTHREAD_WAITING;
return;
--
2.39.2
apply_vma_lock_flags() calls mlock_fixup(), which could merge the VMA
after where the vma iterator is located. Although this is not an issue,
the next iteration of the loop will check the start of the vma to be
equal to the locally saved 'tmp' variable and cause an incorrect failure
scenario. Fix the error by setting tmp to the end of the vma iterator
value before restarting the loop.
There is also a potential of the error code being overwritten when the
loop terminates early. Fix the return issue by directly returning when
an error is encountered since there is nothing to undo after the loop.
Reported-by: Ryan Roberts <ryan.roberts(a)arm.com>
Link: https://lore.kernel.org/linux-mm/50341ca1-d582-b33a-e3d0-acb08a65166f@arm.c…
Cc: <stable(a)vger.kernel.org>
Fixes: 37598f5a9d8b ("mlock: convert mlock to vma iterator")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
mm/mlock.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index d7db94519884..0a0c996c5c21 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -477,7 +477,6 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
{
unsigned long nstart, end, tmp;
struct vm_area_struct *vma, *prev;
- int error;
VMA_ITERATOR(vmi, current->mm, start);
VM_BUG_ON(offset_in_page(start));
@@ -498,6 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
nstart = start;
tmp = vma->vm_start;
for_each_vma_range(vmi, vma, end) {
+ int error;
vm_flags_t newflags;
if (vma->vm_start != tmp)
@@ -511,14 +511,15 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
tmp = end;
error = mlock_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
if (error)
- break;
+ return error;
+ tmp = vma_iter_end(&vmi);
nstart = tmp;
}
- if (vma_iter_end(&vmi) < end)
+ if (tmp < end)
return -ENOMEM;
- return error;
+ return 0;
}
/*
--
2.39.2
In below thousands of screen rotation loop tests with virtual display
enabled, a CPU hard lockup issue may happen, leading system to unresponsive
and crash.
do {
xrandr --output Virtual --rotate inverted
xrandr --output Virtual --rotate right
xrandr --output Virtual --rotate left
xrandr --output Virtual --rotate normal
} while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
? hrtimer_run_softirq+0x140/0x140
? store_vblank+0xe0/0xe0 [drm]
hrtimer_cancel+0x15/0x30
amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
drm_vblank_disable_and_save+0x185/0x1f0 [drm]
drm_crtc_vblank_off+0x159/0x4c0 [drm]
? record_print_text.cold+0x11/0x11
? wait_for_completion_timeout+0x232/0x280
? drm_crtc_wait_one_vblank+0x40/0x40 [drm]
? bit_wait_io_timeout+0xe0/0xe0
? wait_for_completion_interruptible+0x1d7/0x320
? mutex_unlock+0x81/0xd0
amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on different
CPUs.
CPU1 CPU2
drm_crtc_vblank_off hrtimer_interrupt
grab event_lock (irq disabled) __hrtimer_run_queues
grab vbl_lock/vblank_time_block amdgpu_vkms_vblank_simulate
amdgpu_vkms_disable_vblank drm_handle_vblank
hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless on
current clock base, as that timer queue on CPU2 has no chance to finish it
because of failing to hold the lock. So NMI watchdog will throw the errors
after its threshold, and all later CPUs are impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
does not need to wait the handler to finish. And also it's not necessary
to check the return value of hrtimer_try_to_cancel, because even if it's
-1 which means current timer callback is running, it will be reprogrammed
in hrtimer_start with calling enable_vblank to make it works.
v2: only re-arm timer when vblank is enabled (Christian) and add a Fixes
tag as well
v3: drop warn printing (Christian)
v4: drop superfluous check of blank->enabled in timer function, as it's
guaranteed in drm_handle_vblank (Christian)
Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
Cc: stable(a)vger.kernel.org
Suggested-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Guchun Chen <guchun.chen(a)amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 53ff91fc6cf6..d0748bcfad16 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -55,8 +55,9 @@ static enum hrtimer_restart amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
DRM_WARN("%s: vblank timer overrun\n", __func__);
ret = drm_crtc_handle_vblank(crtc);
+ /* Don't queue timer again when vblank is disabled. */
if (!ret)
- DRM_ERROR("amdgpu_vkms failure on handling vblank");
+ return HRTIMER_NORESTART;
return HRTIMER_RESTART;
}
@@ -81,7 +82,7 @@ static void amdgpu_vkms_disable_vblank(struct drm_crtc *crtc)
{
struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
- hrtimer_cancel(&amdgpu_crtc->vblank_timer);
+ hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
}
static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
--
2.25.1
Hi Greg, Sasha,
[ This is v2 appends patch #11 f838e0906dd3 as Florian suggested. ]
The following list shows the backported patches, I am using original
commit IDs for reference:
1) 0854db2aaef3 ("netfilter: nf_tables: use net_generic infra for transaction data")
2) 81ea01066741 ("netfilter: nf_tables: add rescheduling points during loop detection walks")
3) 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE")
4) 4bedf9eee016 ("netfilter: nf_tables: fix chain binding transaction logic")
5) 26b5a5712eb8 ("netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain")
6) 938154b93be8 ("netfilter: nf_tables: reject unbound anonymous set before commit phase")
7) 62e1e94b246e ("netfilter: nf_tables: reject unbound chain set before commit phase")
8) f8bb7889af58 ("netfilter: nftables: rename set element data activation/deactivation functions")
9) 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase")
10) 3e70489721b6 ("netfilter: nf_tables: unbind non-anonymous set if rule construction fails")
11) f838e0906dd3 ("netfilter: nf_tables: fix scheduling-while-atomic splat")
Note:
- Patch #1 is a backported dependency patch required by these fixes.
- Patch #2 needs a follow up fix coming in this series as patch #11.
Please, apply,
Thanks.
Florian Westphal (3):
netfilter: nf_tables: use net_generic infra for transaction data
netfilter: nf_tables: add rescheduling points during loop detection walks
netfilter: nf_tables: fix scheduling-while-atomic splat
Pablo Neira Ayuso (8):
netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
netfilter: nf_tables: fix chain binding transaction logic
netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
netfilter: nf_tables: reject unbound anonymous set before commit phase
netfilter: nf_tables: reject unbound chain set before commit phase
netfilter: nftables: rename set element data activation/deactivation functions
netfilter: nf_tables: drop map element references from preparation phase
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
include/net/netfilter/nf_tables.h | 41 +-
include/net/netns/nftables.h | 7 -
net/netfilter/nf_tables_api.c | 692 +++++++++++++++++++++---------
net/netfilter/nf_tables_offload.c | 30 +-
net/netfilter/nft_chain_filter.c | 11 +-
net/netfilter/nft_dynset.c | 6 +-
net/netfilter/nft_immediate.c | 90 +++-
net/netfilter/nft_set_bitmap.c | 5 +-
net/netfilter/nft_set_hash.c | 23 +-
net/netfilter/nft_set_pipapo.c | 14 +-
net/netfilter/nft_set_rbtree.c | 5 +-
11 files changed, 678 insertions(+), 246 deletions(-)
--
2.30.2
Dzień dobry,
jakiś czas temu zgłosiła się do nas firma, której strona internetowa nie pozycjonowała się wysoko w wyszukiwarce Google.
Na podstawie wykonanego przez nas audytu SEO zoptymalizowaliśmy treści na stronie pod kątem wcześniej opracowanych słów kluczowych. Nasz wewnętrzny system codziennie analizuje prawidłowe działanie witryny. Dzięki indywidualnej strategii, firma zdobywa coraz więcej Klientów.
Czy chcieliby Państwo zwiększyć liczbę osób odwiedzających stronę internetową firmy? Mógłbym przedstawić ofertę?
Pozdrawiam serdecznie,
Dominik Perkowski
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x fc80fc2d4e39137869da3150ee169b40bf879287
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023071117-anyplace-zipfile-8701@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fc80fc2d4e39137869da3150ee169b40bf879287 Mon Sep 17 00:00:00 2001
From: Ding Hui <dinghui(a)sangfor.com.cn>
Date: Mon, 15 May 2023 10:13:07 +0800
Subject: [PATCH] SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
After the listener svc_sock is freed, and before invoking svc_tcp_accept()
for the established child sock, there is a window that the newsock
retaining a freed listener svc_sock in sk_user_data which cloning from
parent. In the race window, if data is received on the newsock, we will
observe use-after-free report in svc_tcp_listen_data_ready().
Reproduce by two tasks:
1. while :; do rpc.nfsd 0 ; rpc.nfsd; done
2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done
KASAN report:
==================================================================
BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
Read of size 8 at addr ffff888139d96228 by task nc/102553
CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Call Trace:
<IRQ>
dump_stack_lvl+0x33/0x50
print_address_description.constprop.0+0x27/0x310
print_report+0x3e/0x70
kasan_report+0xae/0xe0
svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
tcp_data_queue+0x9f4/0x20e0
tcp_rcv_established+0x666/0x1f60
tcp_v4_do_rcv+0x51c/0x850
tcp_v4_rcv+0x23fc/0x2e80
ip_protocol_deliver_rcu+0x62/0x300
ip_local_deliver_finish+0x267/0x350
ip_local_deliver+0x18b/0x2d0
ip_rcv+0x2fb/0x370
__netif_receive_skb_one_core+0x166/0x1b0
process_backlog+0x24c/0x5e0
__napi_poll+0xa2/0x500
net_rx_action+0x854/0xc90
__do_softirq+0x1bb/0x5de
do_softirq+0xcb/0x100
</IRQ>
<TASK>
...
</TASK>
Allocated by task 102371:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x7b/0x90
svc_setup_socket+0x52/0x4f0 [sunrpc]
svc_addsock+0x20d/0x400 [sunrpc]
__write_ports_addfd+0x209/0x390 [nfsd]
write_ports+0x239/0x2c0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 102551:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0x106/0x190
__kmem_cache_free+0x133/0x270
svc_xprt_free+0x1e2/0x350 [sunrpc]
svc_xprt_destroy_all+0x25a/0x440 [sunrpc]
nfsd_put+0x125/0x240 [nfsd]
nfsd_svc+0x2cb/0x3c0 [nfsd]
write_threads+0x1ac/0x2a0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready()
if state != TCP_LISTEN, that will avoid dereferencing svsk for all
child socket.
Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/
Fixes: fa9251afc33c ("SUNRPC: Call the default socket callbacks instead of open coding")
Signed-off-by: Ding Hui <dinghui(a)sangfor.com.cn>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f77cebe2c071..15f4d0d40bdd 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -826,12 +826,6 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
trace_sk_data_ready(sk);
- if (svsk) {
- /* Refer to svc_setup_socket() for details. */
- rmb();
- svsk->sk_odata(sk);
- }
-
/*
* This callback may called twice when a new connection
* is established as a child socket inherits everything
@@ -840,13 +834,18 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
* when one of child sockets become ESTABLISHED.
* 2) data_ready method of the child socket may be called
* when it receives data before the socket is accepted.
- * In case of 2, we should ignore it silently.
+ * In case of 2, we should ignore it silently and DO NOT
+ * dereference svsk.
*/
- if (sk->sk_state == TCP_LISTEN) {
- if (svsk) {
- set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
- svc_xprt_enqueue(&svsk->sk_xprt);
- }
+ if (sk->sk_state != TCP_LISTEN)
+ return;
+
+ if (svsk) {
+ /* Refer to svc_setup_socket() for details. */
+ rmb();
+ svsk->sk_odata(sk);
+ set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
+ svc_xprt_enqueue(&svsk->sk_xprt);
}
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x fc80fc2d4e39137869da3150ee169b40bf879287
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023071119-underdone-monotype-f516@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fc80fc2d4e39137869da3150ee169b40bf879287 Mon Sep 17 00:00:00 2001
From: Ding Hui <dinghui(a)sangfor.com.cn>
Date: Mon, 15 May 2023 10:13:07 +0800
Subject: [PATCH] SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
After the listener svc_sock is freed, and before invoking svc_tcp_accept()
for the established child sock, there is a window that the newsock
retaining a freed listener svc_sock in sk_user_data which cloning from
parent. In the race window, if data is received on the newsock, we will
observe use-after-free report in svc_tcp_listen_data_ready().
Reproduce by two tasks:
1. while :; do rpc.nfsd 0 ; rpc.nfsd; done
2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done
KASAN report:
==================================================================
BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
Read of size 8 at addr ffff888139d96228 by task nc/102553
CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Call Trace:
<IRQ>
dump_stack_lvl+0x33/0x50
print_address_description.constprop.0+0x27/0x310
print_report+0x3e/0x70
kasan_report+0xae/0xe0
svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
tcp_data_queue+0x9f4/0x20e0
tcp_rcv_established+0x666/0x1f60
tcp_v4_do_rcv+0x51c/0x850
tcp_v4_rcv+0x23fc/0x2e80
ip_protocol_deliver_rcu+0x62/0x300
ip_local_deliver_finish+0x267/0x350
ip_local_deliver+0x18b/0x2d0
ip_rcv+0x2fb/0x370
__netif_receive_skb_one_core+0x166/0x1b0
process_backlog+0x24c/0x5e0
__napi_poll+0xa2/0x500
net_rx_action+0x854/0xc90
__do_softirq+0x1bb/0x5de
do_softirq+0xcb/0x100
</IRQ>
<TASK>
...
</TASK>
Allocated by task 102371:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x7b/0x90
svc_setup_socket+0x52/0x4f0 [sunrpc]
svc_addsock+0x20d/0x400 [sunrpc]
__write_ports_addfd+0x209/0x390 [nfsd]
write_ports+0x239/0x2c0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 102551:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0x106/0x190
__kmem_cache_free+0x133/0x270
svc_xprt_free+0x1e2/0x350 [sunrpc]
svc_xprt_destroy_all+0x25a/0x440 [sunrpc]
nfsd_put+0x125/0x240 [nfsd]
nfsd_svc+0x2cb/0x3c0 [nfsd]
write_threads+0x1ac/0x2a0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready()
if state != TCP_LISTEN, that will avoid dereferencing svsk for all
child socket.
Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/
Fixes: fa9251afc33c ("SUNRPC: Call the default socket callbacks instead of open coding")
Signed-off-by: Ding Hui <dinghui(a)sangfor.com.cn>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f77cebe2c071..15f4d0d40bdd 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -826,12 +826,6 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
trace_sk_data_ready(sk);
- if (svsk) {
- /* Refer to svc_setup_socket() for details. */
- rmb();
- svsk->sk_odata(sk);
- }
-
/*
* This callback may called twice when a new connection
* is established as a child socket inherits everything
@@ -840,13 +834,18 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
* when one of child sockets become ESTABLISHED.
* 2) data_ready method of the child socket may be called
* when it receives data before the socket is accepted.
- * In case of 2, we should ignore it silently.
+ * In case of 2, we should ignore it silently and DO NOT
+ * dereference svsk.
*/
- if (sk->sk_state == TCP_LISTEN) {
- if (svsk) {
- set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
- svc_xprt_enqueue(&svsk->sk_xprt);
- }
+ if (sk->sk_state != TCP_LISTEN)
+ return;
+
+ if (svsk) {
+ /* Refer to svc_setup_socket() for details. */
+ rmb();
+ svsk->sk_odata(sk);
+ set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
+ svc_xprt_enqueue(&svsk->sk_xprt);
}
}
I am sorry to bother you and intrude your privacy. I am single,
lonely and in need of a caring, loving and romantic companion.
I am a secret admirer and would like to explore the opportunity to
learn more about each other. I know it is strange to contact you
this way and I hope you can forgive me. I am a shy person and
this is the only way I know I could get your attention. I just want
to know what you think and my intention is not to offend you.
I hope we can be friends if that is what you want, although I wish
to be more than just a friend. I know you have a few questions to
ask and I hope I can satisfy some of your curiosity with a few
answers.
I believe in the saying that 'to the world you are just one person,
but to someone special you are the world'. All I want is love,
romantic care and attention from a special companion which I am
hoping would be you.
I hope this message will be the beginning of a long term
communication between us, simply send a reply to this message, it
will make me happy.
Hugs and kisses,
Secret Admirer.
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The
hugetlb page destructor must be changed before pages are freed to lower
level allocators. However, the destructor must be changed under the
hugetlb lock. This means there is potentially one lock cycle per page.
Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1b67bf341c32..e751fced870a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1856,11 +1856,44 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
struct page *page, *t_page;
struct folio *folio;
+ bool clear_dtor = false;
+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru) {
+ if (HPageVmemmapOptimized(page)) {
+ clear_dtor = true;
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock * to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
folio = page_folio(page);
update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
}
}
--
2.41.0
From: Fabio Estevam <festevam(a)denx.de>
This reverts commit ff87d619ac180444db297f043962a5c325ded47b.
Andreas reports that on an i.MX8MP-based system where MCLK needs to be
used as an input, the MCLK pin is actually an output, despite not having
the 'fsl,sai-mclk-direction-output' property present in the devicetree.
This is caused by commit ff87d619ac18 ("ASoC: fsl_sai: Enable
MCTL_MCLK_EN bit for master mode") that sets FSL_SAI_MCTL_MCLK_EN
unconditionally for imx8mm/8mn/8mp/93, causing the MCLK to always
be configured as output.
FSL_SAI_MCTL_MCLK_EN corresponds to the MOE (MCLK Output Enable) bit
of register MCR and the drivers sets it when the
'fsl,sai-mclk-direction-output' devicetree property is present.
Revert the commit to allow SAI to use MCLK as input as well.
Cc: stable(a)vger.kernel.org
Fixes: ff87d619ac18 ("ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode")
Reported-by: Andreas Henriksson <andreas(a)fatal.se>
Signed-off-by: Fabio Estevam <festevam(a)denx.de>
---
sound/soc/fsl/fsl_sai.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
index 5e09f634c61b..54b4bf3744c6 100644
--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -507,12 +507,6 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool tx, u32 freq)
savediv / 2 - 1);
}
- if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
- /* SAI is in master mode at this point, so enable MCLK */
- regmap_update_bits(sai->regmap, FSL_SAI_MCTL,
- FSL_SAI_MCTL_MCLK_EN, FSL_SAI_MCTL_MCLK_EN);
- }
-
return 0;
}
--
2.34.1
On Sun, 2023-07-09 at 02:12 -0400, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> mmc: core: Support zeroout using TRIM for eMMC
>
> to the 5.15-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> mmc-core-support-zeroout-using-trim-for-emmc.patch
> and it can be found in the queue-5.15 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit e71c90bcc65727d323b9a3cb40b263db7be5ee4c
> Author: Vincent Whitchurch <vincent.whitchurch(a)axis.com>
> Date: Fri Apr 29 17:21:18 2022 +0200
>
> mmc: core: Support zeroout using TRIM for eMMC
>
>
> [ Upstream commit f7b6fc327327698924ef3afa0c3e87a5b7466af3 ]
>
>
> If an eMMC card supports TRIM and indicates that it erases to zeros, we can
> use it to support hardware offloading of REQ_OP_WRITE_ZEROES, so let's add
> support for this.
>
>
> Signed-off-by: Vincent Whitchurch <vincent.whitchurch(a)axis.com>
> Reviewed-by: Avri Altman <Avri.Altman(a)wdc.com>
> Link: https://lore.kernel.org/r/20220429152118.3617303-1-vincent.whitchurch@axis.…
> Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
> Stable-dep-of: c467c8f08185 ("mmc: Add MMC_QUIRK_BROKEN_SD_CACHE for Kingston Canvas Go Plus from 11/2019")
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
I still[0] prefer that this not be backported to stable.
[0] https://lore.kernel.org/all/Y1owQYlr+vfxEmS8@axis.com/
The WinSystems WS16C48 I/O address region spans offsets 0x0 through 0xA,
which is a total of 11 bytes. Fix the WS16C48_EXTENT define to the
correct value of 11 so that access to necessary device registers is
properly requested in the ws16c48_probe() callback by the
devm_request_region() function call.
Fixes: 2c05a0f29f41 ("gpio: ws16c48: Implement and utilize register structures")
Cc: stable(a)vger.kernel.org
Cc: Paul Demetrotion <pdemetrotion(a)winsystems.com>
Signed-off-by: William Breathitt Gray <william.gray(a)linaro.org>
---
drivers/gpio/gpio-ws16c48.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpio/gpio-ws16c48.c b/drivers/gpio/gpio-ws16c48.c
index e73885a4dc32..afb42a8e916f 100644
--- a/drivers/gpio/gpio-ws16c48.c
+++ b/drivers/gpio/gpio-ws16c48.c
@@ -18,7 +18,7 @@
#include <linux/spinlock.h>
#include <linux/types.h>
-#define WS16C48_EXTENT 10
+#define WS16C48_EXTENT 11
#define MAX_NUM_WS16C48 max_num_isa_dev(WS16C48_EXTENT)
static unsigned int base[MAX_NUM_WS16C48];
base-commit: 4827aae061337251bb91801b316157a78b845ec7
--
2.39.2
From: Johannes Berg <johannes.berg(a)intel.com>
[ Upstream commit 34d4e3eb67fed9c19719bedb748e5a8b6ccc97a5 ]
Since links are only controlled by userspace via cfg80211
in AP mode, also only remove them from the driver in that
case.
Signed-off-by: Johannes Berg <johannes.berg(a)intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman(a)intel.com>
Link: https://lore.kernel.org/r/20230608163202.ed65b94916fa.I2458c46888284cc5ce30…
Signed-off-by: Johannes Berg <johannes.berg(a)intel.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
net/wireless/util.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/wireless/util.c b/net/wireless/util.c
index 39680e7bad45a..f433f3fdd9e94 100644
--- a/net/wireless/util.c
+++ b/net/wireless/util.c
@@ -5,7 +5,7 @@
* Copyright 2007-2009 Johannes Berg <johannes(a)sipsolutions.net>
* Copyright 2013-2014 Intel Mobile Communications GmbH
* Copyright 2017 Intel Deutschland GmbH
- * Copyright (C) 2018-2022 Intel Corporation
+ * Copyright (C) 2018-2023 Intel Corporation
*/
#include <linux/export.h>
#include <linux/bitops.h>
@@ -2479,6 +2479,13 @@ void cfg80211_remove_links(struct wireless_dev *wdev)
{
unsigned int link_id;
+ /*
+ * links are controlled by upper layers (userspace/cfg)
+ * only for AP mode, so only remove them here for AP
+ */
+ if (wdev->iftype != NL80211_IFTYPE_AP)
+ return;
+
wdev_lock(wdev);
if (wdev->valid_links) {
for_each_valid_link(wdev, link_id)
--
2.39.2
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 39020d8abc7ec62c4de9b260e3d10d4a1c2478ce
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023071151-caddy-carded-8f7a@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39020d8abc7ec62c4de9b260e3d10d4a1c2478ce Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 8 Jun 2023 11:27:41 +0100
Subject: [PATCH] btrfs: do not BUG_ON() on tree mod log failure at
balance_level()
At balance_level(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release
all resources in this context, and we have to abort because other future
tree searches that use the tree mod log (btrfs_search_old_slot()) may get
inconsistent results if other operations modify the tree after that
failure and before the tree mod log based search.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d6c29564ce49..d60b28c6bd1b 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1054,7 +1054,12 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
}
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_tree_unlock(child);
+ free_extent_buffer(child);
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
rcu_assign_pointer(root->node, child);
add_root_to_dirty_list(root);
@@ -1142,7 +1147,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
btrfs_node_key(right, &right_key, 0);
ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
BTRFS_MOD_LOG_KEY_REPLACE);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
btrfs_set_node_key(parent, &right_key, pslot + 1);
btrfs_mark_buffer_dirty(parent);
}
@@ -1188,7 +1196,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
btrfs_node_key(mid, &mid_key, 0);
ret = btrfs_tree_mod_log_insert_key(parent, pslot,
BTRFS_MOD_LOG_KEY_REPLACE);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
btrfs_set_node_key(parent, &mid_key, pslot);
btrfs_mark_buffer_dirty(parent);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 39020d8abc7ec62c4de9b260e3d10d4a1c2478ce
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023071152-quaintly-ventricle-7673@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 39020d8abc7ec62c4de9b260e3d10d4a1c2478ce Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 8 Jun 2023 11:27:41 +0100
Subject: [PATCH] btrfs: do not BUG_ON() on tree mod log failure at
balance_level()
At balance_level(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release
all resources in this context, and we have to abort because other future
tree searches that use the tree mod log (btrfs_search_old_slot()) may get
inconsistent results if other operations modify the tree after that
failure and before the tree mod log based search.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d6c29564ce49..d60b28c6bd1b 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1054,7 +1054,12 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
}
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_tree_unlock(child);
+ free_extent_buffer(child);
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
rcu_assign_pointer(root->node, child);
add_root_to_dirty_list(root);
@@ -1142,7 +1147,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
btrfs_node_key(right, &right_key, 0);
ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
BTRFS_MOD_LOG_KEY_REPLACE);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
btrfs_set_node_key(parent, &right_key, pslot + 1);
btrfs_mark_buffer_dirty(parent);
}
@@ -1188,7 +1196,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
btrfs_node_key(mid, &mid_key, 0);
ret = btrfs_tree_mod_log_insert_key(parent, pslot,
BTRFS_MOD_LOG_KEY_REPLACE);
- BUG_ON(ret < 0);
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ goto enospc;
+ }
btrfs_set_node_key(parent, &mid_key, pslot);
btrfs_mark_buffer_dirty(parent);
}