This is the start of the stable review cycle for the 4.14.195 release. There are 50 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Wed, 26 Aug 2020 08:23:34 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.195-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 4.14.195-rc1
Stephen Boyd sboyd@kernel.org clk: Evict unregistered clks from parent caches
Juergen Gross jgross@suse.com xen: don't reschedule in preemption off sections
Peter Xu peterx@redhat.com mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
Al Viro viro@zeniv.linux.org.uk do_epoll_ctl(): clean the failure exits up a bit
Marc Zyngier maz@kernel.org epoll: Keep a reference on files added to the check list
Vasant Hegde hegdevasant@linux.vnet.ibm.com powerpc/pseries: Do not initiate shutdown when system is running on UPS
Tom Rix trix@redhat.com net: dsa: b53: check for timeout
Haiyang Zhang haiyangz@microsoft.com hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
Jiri Wiesner jwiesner@suse.com bonding: fix active-backup failover for current ARP slave
Alex Williamson alex.williamson@redhat.com vfio/type1: Add proper error unwind for vfio_iommu_replay()
Dinghao Liu dinghao.liu@zju.edu.cn ASoC: intel: Fix memleak in sst_media_open
Srinivas Kandagatla srinivas.kandagatla@linaro.org ASoC: msm8916-wcd-analog: fix register Interrupt offset
Cong Wang xiyou.wangcong@gmail.com bonding: fix a potential double-unregister
Jarod Wilson jarod@redhat.com bonding: show saner speed for broadcast mode
Fugang Duan fugang.duan@nxp.com net: fec: correct the error path for regulator disable in probe
Grzegorz Szczurek grzegorzx.szczurek@intel.com i40e: Fix crash during removing i40e driver
Przemyslaw Patynowski przemyslawx.patynowski@intel.com i40e: Set RX_ONLY mode for unicast promiscuous on VLAN
Eric Sandeen sandeen@redhat.com ext4: fix potential negative array index in do_split()
Luc Van Oostenryck luc.vanoostenryck@gmail.com alpha: fix annotation of io{read,write}{16,32}be()
Eiichi Tsukata devel@etsukata.com xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
Mao Wenan wenan.mao@linux.alibaba.com virtio_ring: Avoid loop when vq is broken in virtqueue_poll
Javed Hasan jhasan@marvell.com scsi: libfc: Free skb in fc_disc_gpn_id_resp() for valid cases
Srinivas Pandruvada srinivas.pandruvada@linux.intel.com cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0
Zhe Li lizhe67@huawei.com jffs2: fix UAF problem
Darrick J. Wong darrick.wong@oracle.com xfs: fix inode quota reservation checks
Greg Ungerer gerg@linux-m68k.org m68knommu: fix overwriting of bits in ColdFire V3 cache control
Xiongfeng Wang wangxiongfeng2@huawei.com Input: psmouse - add a newline when printing 'proto' by sysfs
Evgeny Novikov novikov@ispras.ru media: vpss: clean up resources in init
Huacai Chen chenhc@lemote.com rtc: goldfish: Enable interrupt in set_alarm() when necessary
Chuhong Yuan hslester96@gmail.com media: budget-core: Improve exception handling in budget_register()
Stanley Chu stanley.chu@mediatek.com scsi: ufs: Add DELAY_BEFORE_LPM quirk for Micron devices
Lukas Wunner lukas@wunner.de spi: Prevent adding devices below an unregistering controller
Yang Shi shy828301@gmail.com mm/memory.c: skip spurious TLB flush for retried page fault
zhangyi (F) yi.zhang@huawei.com jbd2: add the missing unlock_buffer() in the error path of jbd2_write_superblock()
Jan Kara jack@suse.cz ext4: fix checking of directory entry validity for inline directories
Charan Teja Reddy charante@codeaurora.org mm, page_alloc: fix core hung in free_pcppages_bulk()
Doug Berger opendmb@gmail.com mm: include CMA pages in lowmem_reserve at boot
Wei Yongjun weiyongjun1@huawei.com kernel/relay.c: fix memleak on destroy relay channel
Jann Horn jannh@google.com romfs: fix uninitialized memory leak in romfs_dev_read()
Josef Bacik josef@toxicpanda.com btrfs: sysfs: use NOFS for device creation
Qu Wenruo wqu@suse.com btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
Nikolay Borisov nborisov@suse.com btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
Josef Bacik josef@toxicpanda.com btrfs: don't show full path of bind mounts in subvol=
Marcos Paulo de Souza mpdesouza@suse.com btrfs: export helpers for subvolume name/id resolution
Michael Ellerman mpe@ellerman.id.au powerpc: Allow 4224 bytes of stack expansion for the signal frame
Christophe Leroy christophe.leroy@c-s.fr powerpc/mm: Only read faulting instruction when necessary in do_page_fault()
Hugh Dickins hughd@google.com khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
Hugh Dickins hughd@google.com khugepaged: khugepaged_test_exit() check mmget_still_valid()
Masami Hiramatsu mhiramat@kernel.org perf probe: Fix memory leakage when the probe point is not found
Chris Wilson chris@chris-wilson.co.uk drm/vgem: Replace opencoded version of drm_gem_dumb_map_offset()
-------------
Diffstat:
Makefile | 4 +- arch/alpha/include/asm/io.h | 8 +-- arch/m68k/include/asm/m53xxacr.h | 6 +- arch/powerpc/mm/fault.c | 55 ++++++++++++------ arch/powerpc/platforms/pseries/ras.c | 1 - drivers/clk/clk.c | 52 +++++++++++++---- drivers/cpufreq/intel_pstate.c | 1 + drivers/gpu/drm/vgem/vgem_drv.c | 27 --------- drivers/input/mouse/psmouse-base.c | 2 +- drivers/media/pci/ttpci/budget-core.c | 11 +++- drivers/media/platform/davinci/vpss.c | 20 +++++-- drivers/net/bonding/bond_main.c | 42 ++++++++++++-- drivers/net/dsa/b53/b53_common.c | 2 + drivers/net/ethernet/freescale/fec_main.c | 4 +- drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 2 +- drivers/net/ethernet/intel/i40e/i40e_common.c | 35 ++++++++--- drivers/net/ethernet/intel/i40e/i40e_main.c | 3 + drivers/net/hyperv/netvsc_drv.c | 2 +- drivers/rtc/rtc-goldfish.c | 1 + drivers/scsi/libfc/fc_disc.c | 12 +++- drivers/scsi/ufs/ufs_quirks.h | 1 + drivers/scsi/ufs/ufshcd.c | 2 + drivers/spi/Kconfig | 3 + drivers/spi/spi.c | 21 ++++++- drivers/vfio/vfio_iommu_type1.c | 71 +++++++++++++++++++++-- drivers/virtio/virtio_ring.c | 3 + drivers/xen/preempt.c | 2 +- fs/btrfs/ctree.h | 2 + fs/btrfs/export.c | 8 +-- fs/btrfs/export.h | 5 ++ fs/btrfs/inode.c | 23 +++++--- fs/btrfs/super.c | 18 ++++-- fs/btrfs/sysfs.c | 4 ++ fs/eventpoll.c | 19 +++--- fs/ext4/namei.c | 22 +++++-- fs/jbd2/journal.c | 4 +- fs/jffs2/dir.c | 6 +- fs/romfs/storage.c | 4 +- fs/xfs/xfs_sysfs.h | 6 +- fs/xfs/xfs_trans_dquot.c | 2 +- kernel/relay.c | 1 + mm/hugetlb.c | 24 ++++---- mm/khugepaged.c | 7 +-- mm/memory.c | 3 + mm/page_alloc.c | 7 ++- sound/soc/codecs/msm8916-wcd-analog.c | 4 +- sound/soc/intel/atom/sst-mfld-platform-pcm.c | 5 +- tools/perf/util/probe-finder.c | 2 +- 48 files changed, 403 insertions(+), 166 deletions(-)
From: Chris Wilson chris@chris-wilson.co.uk
[ Upstream commit 119c53d2d4044c59c450c4f5a568d80b9d861856 ]
drm_gem_dumb_map_offset() now exists and does everything vgem_gem_dump_map does and *ought* to do.
In particular, vgem_gem_dumb_map() was trying to reject mmapping an imported dmabuf by checking the existence of obj->filp. Unfortunately, we always allocated an obj->filp, even if unused for an imported dmabuf. Instead, the drm_gem_dumb_map_offset(), since commit 90378e589192 ("drm/gem: drm_gem_dumb_map_offset(): reject dma-buf"), uses the obj->import_attach to reject such invalid mmaps.
This prevents vgem from allowing userspace mmapping the dumb handle and attempting to incorrectly fault in remote pages belonging to another device, where there may not even be a struct page.
v2: Use the default drm_gem_dumb_map_offset() callback
Fixes: af33a9190d02 ("drm/vgem: Enable dmabuf import interfaces") Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch Cc: stable@vger.kernel.org # v4.13+ Link: https://patchwork.freedesktop.org/patch/msgid/20200708154911.21236-1-chris@c... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/vgem/vgem_drv.c | 27 --------------------------- 1 file changed, 27 deletions(-)
diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c index aa592277d5108..67037eb9a80ee 100644 --- a/drivers/gpu/drm/vgem/vgem_drv.c +++ b/drivers/gpu/drm/vgem/vgem_drv.c @@ -220,32 +220,6 @@ static int vgem_gem_dumb_create(struct drm_file *file, struct drm_device *dev, return 0; }
-static int vgem_gem_dumb_map(struct drm_file *file, struct drm_device *dev, - uint32_t handle, uint64_t *offset) -{ - struct drm_gem_object *obj; - int ret; - - obj = drm_gem_object_lookup(file, handle); - if (!obj) - return -ENOENT; - - if (!obj->filp) { - ret = -EINVAL; - goto unref; - } - - ret = drm_gem_create_mmap_offset(obj); - if (ret) - goto unref; - - *offset = drm_vma_node_offset_addr(&obj->vma_node); -unref: - drm_gem_object_put_unlocked(obj); - - return ret; -} - static struct drm_ioctl_desc vgem_ioctls[] = { DRM_IOCTL_DEF_DRV(VGEM_FENCE_ATTACH, vgem_fence_attach_ioctl, DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(VGEM_FENCE_SIGNAL, vgem_fence_signal_ioctl, DRM_AUTH|DRM_RENDER_ALLOW), @@ -439,7 +413,6 @@ static struct drm_driver vgem_driver = { .fops = &vgem_driver_fops,
.dumb_create = vgem_gem_dumb_create, - .dumb_map_offset = vgem_gem_dumb_map,
.prime_handle_to_fd = drm_gem_prime_handle_to_fd, .prime_fd_to_handle = drm_gem_prime_fd_to_handle,
From: Masami Hiramatsu mhiramat@kernel.org
[ Upstream commit 12d572e785b15bc764e956caaa8a4c846fd15694 ]
Fix the memory leakage in debuginfo__find_trace_events() when the probe point is not found in the debuginfo. If there is no probe point found in the debuginfo, debuginfo__find_probes() will NOT return -ENOENT, but 0.
Thus the caller of debuginfo__find_probes() must check the tf.ntevs and release the allocated memory for the array of struct probe_trace_event.
The current code releases the memory only if the debuginfo__find_probes() hits an error but not checks tf.ntevs. In the result, the memory allocated on *tevs are not released if tf.ntevs == 0.
This fixes the memory leakage by checking tf.ntevs == 0 in addition to ret < 0.
Fixes: ff741783506c ("perf probe: Introduce debuginfo to encapsulate dwarf information") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Reviewed-by: Srikar Dronamraju srikar@linux.vnet.ibm.com Cc: Andi Kleen ak@linux.intel.com Cc: Oleg Nesterov oleg@redhat.com Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/159438668346.62703.10887420400718492503.stgit@de... Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- tools/perf/util/probe-finder.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index 8f7f9d05f38c0..bfa6d9d215569 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -1354,7 +1354,7 @@ int debuginfo__find_trace_events(struct debuginfo *dbg, tf.ntevs = 0;
ret = debuginfo__find_probes(dbg, &tf.pf); - if (ret < 0) { + if (ret < 0 || tf.ntevs == 0) { for (i = 0; i < tf.ntevs; i++) clear_probe_trace_event(&tf.tevs[i]); zfree(tevs);
From: Hugh Dickins hughd@google.com
[ Upstream commit bbe98f9cadff58cdd6a4acaeba0efa8565dabe65 ]
Move collapse_huge_page()'s mmget_still_valid() check into khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP only, and earned its mmget_still_valid() check because it inserts a huge pmd entry in place of the page table's pmd entry; whereas collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp() merely clears the page table's pmd entry. But core dumping without mmap lock must have been as open to mistaking a racily cleared pmd entry for a page table at physical page 0, as exit_mmap() was. And we certainly have no interest in mapping as a THP once dumping core.
Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Andrea Arcangeli aarcange@redhat.com Cc: Song Liu songliubraving@fb.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: stable@vger.kernel.org [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- mm/khugepaged.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 04b4c38d0c184..a1b7475c05d04 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -394,7 +394,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm); }
int __khugepaged_enter(struct mm_struct *mm) @@ -1006,9 +1006,6 @@ static void collapse_huge_page(struct mm_struct *mm, * handled by the anon_vma lock + PG_lock. */ down_write(&mm->mmap_sem); - result = SCAN_ANY_PROCESS; - if (!mmget_still_valid(mm)) - goto out; result = hugepage_vma_revalidate(mm, address, &vma); if (result) goto out;
From: Hugh Dickins hughd@google.com
[ Upstream commit f3f99d63a8156c7a4a6b20aac22b53c5579c7dc1 ]
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in __khugepaged_enter(): yes, when one thread is about to dump core, has set core_state, and is waiting for others, another might do something calling __khugepaged_enter(), which now crashes because I lumped the core_state test (known as "mmget_still_valid") into khugepaged_test_exit(). I still think it's best to lump them together, so just in this exceptional case, check mm->mm_users directly instead of khugepaged_test_exit().
Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Song Liu songliubraving@fb.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Eric Dumazet edumazet@google.com Cc: stable@vger.kernel.org [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a1b7475c05d04..9dfe364d4c0d1 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -407,7 +407,7 @@ int __khugepaged_enter(struct mm_struct *mm) return -ENOMEM;
/* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); + VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0;
From: Christophe Leroy christophe.leroy@c-s.fr
[ Upstream commit 0e36b0d12501e278686634712975b785bae11641 ]
Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every userspace instruction miss") has shown that limiting the read of faulting instruction to likely cases improves performance.
This patch goes further into this direction by limiting the read of the faulting instruction to the only cases where it is likely needed.
On an MPC885, with the same benchmark app as in the commit referred above, we see a reduction of about 3900 dTLB misses (approx 3%):
Before the patch: Performance counter stats for './fault 500' (10 runs):
683033312 cpu-cycles ( +- 0.03% ) 134538 dTLB-load-misses ( +- 0.03% ) 46099 iTLB-load-misses ( +- 0.02% ) 19681 faults ( +- 0.02% )
5.389747878 seconds time elapsed ( +- 0.06% )
With the patch:
Performance counter stats for './fault 500' (10 runs):
682112862 cpu-cycles ( +- 0.03% ) 130619 dTLB-load-misses ( +- 0.03% ) 46073 iTLB-load-misses ( +- 0.05% ) 19681 faults ( +- 0.01% )
5.381342641 seconds time elapsed ( +- 0.07% )
The proper work of the huge stack expansion was tested with the following app:
int main(int argc, char **argv) { char buf[1024 * 1025];
sprintf(buf, "Hello world !\n"); printf(buf);
exit(0); }
Signed-off-by: Christophe Leroy christophe.leroy@c-s.fr Reviewed-by: Nicholas Piggin npiggin@gmail.com [mpe: Add include of pagemap.h to fix build errors] Signed-off-by: Michael Ellerman mpe@ellerman.id.au Signed-off-by: Sasha Levin sashal@kernel.org --- arch/powerpc/mm/fault.c | 50 ++++++++++++++++++++++++++++------------- 1 file changed, 34 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5fc8a010fdf07..998c77e600a43 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -22,6 +22,7 @@ #include <linux/errno.h> #include <linux/string.h> #include <linux/types.h> +#include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/mman.h> #include <linux/mm.h> @@ -66,15 +67,11 @@ static inline bool notify_page_fault(struct pt_regs *regs) }
/* - * Check whether the instruction at regs->nip is a store using + * Check whether the instruction inst is a store using * an update addressing form which will update r1. */ -static bool store_updates_sp(struct pt_regs *regs) +static bool store_updates_sp(unsigned int inst) { - unsigned int inst; - - if (get_user(inst, (unsigned int __user *)regs->nip)) - return false; /* check for 1 in the rA field */ if (((inst >> 16) & 0x1f) != 1) return false; @@ -228,8 +225,8 @@ static bool bad_kernel_fault(bool is_exec, unsigned long error_code, }
static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, - struct vm_area_struct *vma, - bool store_update_sp) + struct vm_area_struct *vma, unsigned int flags, + bool *must_retry) { /* * N.B. The POWER/Open ABI allows programs to access up to @@ -241,6 +238,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, * expand to 1MB without further checks. */ if (address + 0x100000 < vma->vm_end) { + unsigned int __user *nip = (unsigned int __user *)regs->nip; /* get user regs even if this fault is in kernel mode */ struct pt_regs *uregs = current->thread.regs; if (uregs == NULL) @@ -258,8 +256,22 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, * between the last mapped region and the stack will * expand the stack rather than segfaulting. */ - if (address + 2048 < uregs->gpr[1] && !store_update_sp) - return true; + if (address + 2048 >= uregs->gpr[1]) + return false; + + if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) && + access_ok(VERIFY_READ, nip, sizeof(*nip))) { + unsigned int inst; + int res; + + pagefault_disable(); + res = __get_user_inatomic(inst, nip); + pagefault_enable(); + if (!res) + return !store_updates_sp(inst); + *must_retry = true; + } + return true; } return false; } @@ -392,7 +404,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, int is_user = user_mode(regs); int is_write = page_fault_is_write(error_code); int fault, major = 0; - bool store_update_sp = false; + bool must_retry = false;
if (notify_page_fault(regs)) return 0; @@ -439,9 +451,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, * can result in fault, which will cause a deadlock when called with * mmap_sem held */ - if (is_write && is_user) - store_update_sp = store_updates_sp(regs); - if (is_user) flags |= FAULT_FLAG_USER; if (is_write) @@ -488,8 +497,17 @@ retry: return bad_area(regs, address);
/* The stack is being expanded, check if it's valid */ - if (unlikely(bad_stack_expansion(regs, address, vma, store_update_sp))) - return bad_area(regs, address); + if (unlikely(bad_stack_expansion(regs, address, vma, flags, + &must_retry))) { + if (!must_retry) + return bad_area(regs, address); + + up_read(&mm->mmap_sem); + if (fault_in_pages_readable((const char __user *)regs->nip, + sizeof(unsigned int))) + return bad_area_nosemaphore(regs, address); + goto retry; + }
/* Try to expand it */ if (unlikely(expand_stack(vma, address)))
From: Michael Ellerman mpe@ellerman.id.au
[ Upstream commit 63dee5df43a31f3844efabc58972f0a206ca4534 ]
We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA.
The code was originally added in 2004 "ported from 2.4". The rough logic is that the stack is allowed to grow to 1MB with no extra checking. Over 1MB the access must be within 2048 bytes of the stack pointer, or be from a user instruction that updates the stack pointer.
The 2048 byte allowance below the stack pointer is there to cover the 288 byte "red zone" as well as the "about 1.5kB" needed by the signal delivery code.
Unfortunately since then the signal frame has expanded, and is now 4224 bytes on 64-bit kernels with transactional memory enabled. This means if a process has consumed more than 1MB of stack, and its stack pointer lies less than 4224 bytes from the next page boundary, signal delivery will fault when trying to expand the stack and the process will see a SEGV.
The total size of the signal frame is the size of struct rt_sigframe (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on 64-bit).
The 2048 byte allowance was correct until 2008 as the signal frame was:
struct rt_sigframe { struct ucontext uc; /* 0 1440 */ /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 1440 16 */ unsigned int tramp[6]; /* 1456 24 */ struct siginfo * pinfo; /* 1480 8 */ void * puc; /* 1488 8 */ struct siginfo info; /* 1496 128 */ /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1624 288 */
/* size: 1920, cachelines: 15, members: 7 */ /* padding: 8 */ };
1920 + 128 = 2048
Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") (Jul 2008) the signal frame expanded to 2304 bytes:
struct rt_sigframe { struct ucontext uc; /* 0 1696 */ <-- /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 1696 16 */ unsigned int tramp[6]; /* 1712 24 */ struct siginfo * pinfo; /* 1736 8 */ void * puc; /* 1744 8 */ struct siginfo info; /* 1752 128 */ /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1880 288 */
/* size: 2176, cachelines: 17, members: 7 */ /* padding: 8 */ };
2176 + 128 = 2304
At this point we should have been exposed to the bug, though as far as I know it was never reported. I no longer have a system old enough to easily test on.
Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a grow-down stack segment") caused our stack expansion code to never trigger, as there was always a VMA found for a write up to PAGE_SIZE below r1.
That meant the bug was hidden as we continued to expand the signal frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") (Feb 2013):
struct rt_sigframe { struct ucontext uc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontext uc_transact; /* 1696 1696 */ <-- /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 3392 16 */ unsigned int tramp[6]; /* 3408 24 */ struct siginfo * pinfo; /* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[288]; /* 3576 288 */
/* size: 3872, cachelines: 31, members: 8 */ /* padding: 8 */ /* last cacheline: 32 bytes */ };
3872 + 128 = 4000
And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit userspace to 512 bytes") (Feb 2014):
struct rt_sigframe { struct ucontext uc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontext uc_transact; /* 1696 1696 */ /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 3392 16 */ unsigned int tramp[6]; /* 3408 24 */ struct siginfo * pinfo; /* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[512]; /* 3576 512 */ <--
/* size: 4096, cachelines: 32, members: 8 */ /* padding: 8 */ };
4096 + 128 = 4224
Then finally in 2017, commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") exposed us to the existing bug, because it changed the stack VMA to be the correct/real size, meaning our stack expansion code is now triggered.
Fix it by increasing the allowance to 4224 bytes.
Hard-coding 4224 is obviously unsafe against future expansions of the signal frame in the same way as the existing code. We can't easily use sizeof() because the signal frame structure is not in a header. We will either fix that, or rip out all the custom stack expansion checking logic entirely.
Fixes: ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") Cc: stable@vger.kernel.org # v2.6.27+ Reported-by: Tom Lane tgl@sss.pgh.pa.us Tested-by: Daniel Axtens dja@axtens.net Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au Signed-off-by: Sasha Levin sashal@kernel.org --- arch/powerpc/mm/fault.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 998c77e600a43..ebe97e5500ee5 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -224,6 +224,9 @@ static bool bad_kernel_fault(bool is_exec, unsigned long error_code, return is_exec || (address >= TASK_SIZE); }
+// This comes from 64-bit struct rt_sigframe + __SIGNAL_FRAMESIZE +#define SIGFRAME_MAX_SIZE (4096 + 128) + static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, struct vm_area_struct *vma, unsigned int flags, bool *must_retry) @@ -231,7 +234,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, /* * N.B. The POWER/Open ABI allows programs to access up to * 288 bytes below the stack pointer. - * The kernel signal delivery code writes up to about 1.5kB + * The kernel signal delivery code writes a bit over 4KB * below the stack pointer (r1) before decrementing it. * The exec code can write slightly over 640kB to the stack * before setting the user r1. Thus we allow the stack to @@ -256,7 +259,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, * between the last mapped region and the stack will * expand the stack rather than segfaulting. */ - if (address + 2048 >= uregs->gpr[1]) + if (address + SIGFRAME_MAX_SIZE >= uregs->gpr[1]) return false;
if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
From: Marcos Paulo de Souza mpdesouza@suse.com
[ Upstream commit c0c907a47dccf2cf26251a8fb4a8e7a3bf79ce84 ]
The functions will be used outside of export.c and super.c to allow resolving subvolume name from a given id, eg. for subvolume deletion by id ioctl.
Signed-off-by: Marcos Paulo de Souza mpdesouza@suse.com Reviewed-by: David Sterba dsterba@suse.com [ split from the next patch ] Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/export.c | 8 ++++---- fs/btrfs/export.h | 5 +++++ fs/btrfs/super.c | 8 ++++---- 4 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5412b12491cb8..de951987fd23d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3262,6 +3262,8 @@ ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size); int btrfs_parse_options(struct btrfs_fs_info *info, char *options, unsigned long new_flags); int btrfs_sync_fs(struct super_block *sb, int wait); +char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info, + u64 subvol_objectid);
static inline __printf(2, 3) void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...) diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c index 3aeb5770f8965..b6ce765aa7f33 100644 --- a/fs/btrfs/export.c +++ b/fs/btrfs/export.c @@ -56,9 +56,9 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len, return type; }
-static struct dentry *btrfs_get_dentry(struct super_block *sb, u64 objectid, - u64 root_objectid, u32 generation, - int check_generation) +struct dentry *btrfs_get_dentry(struct super_block *sb, u64 objectid, + u64 root_objectid, u32 generation, + int check_generation) { struct btrfs_fs_info *fs_info = btrfs_sb(sb); struct btrfs_root *root; @@ -151,7 +151,7 @@ static struct dentry *btrfs_fh_to_dentry(struct super_block *sb, struct fid *fh, return btrfs_get_dentry(sb, objectid, root_objectid, generation, 1); }
-static struct dentry *btrfs_get_parent(struct dentry *child) +struct dentry *btrfs_get_parent(struct dentry *child) { struct inode *dir = d_inode(child); struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb); diff --git a/fs/btrfs/export.h b/fs/btrfs/export.h index 91b3908e7c549..15db024621414 100644 --- a/fs/btrfs/export.h +++ b/fs/btrfs/export.h @@ -17,4 +17,9 @@ struct btrfs_fid { u64 parent_root_objectid; } __attribute__ ((packed));
+struct dentry *btrfs_get_dentry(struct super_block *sb, u64 objectid, + u64 root_objectid, u32 generation, + int check_generation); +struct dentry *btrfs_get_parent(struct dentry *child); + #endif diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 17a8463ef35c1..ca95e57b60ee1 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -939,8 +939,8 @@ out: return error; }
-static char *get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info, - u64 subvol_objectid) +char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info, + u64 subvol_objectid) { struct btrfs_root *root = fs_info->tree_root; struct btrfs_root *fs_root; @@ -1427,8 +1427,8 @@ static struct dentry *mount_subvol(const char *subvol_name, u64 subvol_objectid, goto out; } } - subvol_name = get_subvol_name_from_objectid(btrfs_sb(mnt->mnt_sb), - subvol_objectid); + subvol_name = btrfs_get_subvol_name_from_objectid( + btrfs_sb(mnt->mnt_sb), subvol_objectid); if (IS_ERR(subvol_name)) { root = ERR_CAST(subvol_name); subvol_name = NULL;
From: Josef Bacik josef@toxicpanda.com
[ Upstream commit 3ef3959b29c4a5bd65526ab310a1a18ae533172a ]
Chris Murphy reported a problem where rpm ostree will bind mount a bunch of things for whatever voodoo it's doing. But when it does this /proc/mounts shows something like
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0 /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
Despite subvolid=256 being subvol=/foo. This is because we're just spitting out the dentry of the mount point, which in the case of bind mounts is the source path for the mountpoint. Instead we should spit out the path to the actual subvol. Fix this by looking up the name for the subvolid we have mounted. With this fix the same test looks like this
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0 /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
Reported-by: Chris Murphy chris@colorremedies.com CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/super.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index ca95e57b60ee1..eb64d4b159e07 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1221,6 +1221,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) { struct btrfs_fs_info *info = btrfs_sb(dentry->d_sb); char *compress_type; + const char *subvol_name;
if (btrfs_test_opt(info, DEGRADED)) seq_puts(seq, ",degraded"); @@ -1307,8 +1308,13 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) #endif seq_printf(seq, ",subvolid=%llu", BTRFS_I(d_inode(dentry))->root->root_key.objectid); - seq_puts(seq, ",subvol="); - seq_dentry(seq, dentry, " \t\n\"); + subvol_name = btrfs_get_subvol_name_from_objectid(info, + BTRFS_I(d_inode(dentry))->root->root_key.objectid); + if (!IS_ERR(subvol_name)) { + seq_puts(seq, ",subvol="); + seq_escape(seq, subvol_name, " \t\n\"); + kfree(subvol_name); + } return 0; }
From: Nikolay Borisov nborisov@suse.com
[ Upstream commit cecc8d9038d164eda61fbcd72520975a554ea63e ]
This label is only executed if compress_file_range fails to create an inline extent. So move its code in the semantically related inline extent handling branch. No functional changes.
Signed-off-by: Nikolay Borisov nborisov@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/inode.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 57908ee964a20..dc520749f51db 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -629,7 +629,14 @@ cont: btrfs_free_reserved_data_space_noquota(inode, start, end - start + 1); - goto free_pages_out; + + for (i = 0; i < nr_pages; i++) { + WARN_ON(pages[i]->mapping); + put_page(pages[i]); + } + kfree(pages); + + return; } }
@@ -708,13 +715,6 @@ cleanup_and_bail_uncompressed: *num_added += 1;
return; - -free_pages_out: - for (i = 0; i < nr_pages; i++) { - WARN_ON(pages[i]->mapping); - put_page(pages[i]); - } - kfree(pages); }
static void free_async_extent_pages(struct async_extent *async_extent)
From: Qu Wenruo wqu@suse.com
[ Upstream commit 1e6e238c3002ea3611465ce5f32777ddd6a40126 ]
[BUG] There is a bug report of NULL pointer dereference caused in compress_file_extent():
Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs] LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] Call Trace: [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable) [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs] [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs] [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0 [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660 [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0 [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c ---[ end trace f16954aa20d822f6 ]---
[CAUSE] For the following execution route of compress_file_range(), it's possible to hit NULL pointer dereference:
compress_file_extent() |- pages = NULL; |- start = async_chunk->start = 0; |- end = async_chunk = 4095; |- nr_pages = 1; |- inode_need_compress() == false; <<< Possible, see later explanation | Now, we have nr_pages = 1, pages = NULL |- cont: |- ret = cow_file_range_inline(); |- if (ret <= 0) { |- for (i = 0; i < nr_pages; i++) { |- WARN_ON(pages[i]->mapping); <<< Crash
To enter above call execution branch, we need the following race:
Thread 1 (chattr) | Thread 2 (writeback) --------------------------+------------------------------ | btrfs_run_delalloc_range | |- inode_need_compress = true | |- cow_file_range_async() btrfs_ioctl_set_flag() | |- binode_flags |= | BTRFS_INODE_NOCOMPRESS | | compress_file_range() | |- inode_need_compress = false | |- nr_page = 1 while pages = NULL | | Then hit the crash
[FIX] This patch will fix it by checking @pages before doing accessing it. This patch is only designed as a hot fix and easy to backport.
More elegant fix may make btrfs only check inode_need_compress() once to avoid such race, but that would be another story.
Reported-by: Luciano Chavez chavez@us.ibm.com Fixes: 4d3a800ebb12 ("btrfs: merge nr_pages input and output parameter in compress_pages") CC: stable@vger.kernel.org # 4.14.x: cecc8d9038d16: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Qu Wenruo wqu@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/inode.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index dc520749f51db..17856e92b93d1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -630,11 +630,18 @@ cont: start, end - start + 1);
- for (i = 0; i < nr_pages; i++) { - WARN_ON(pages[i]->mapping); - put_page(pages[i]); + /* + * Ensure we only free the compressed pages if we have + * them allocated, as we can still reach here with + * inode_need_compress() == false. + */ + if (pages) { + for (i = 0; i < nr_pages; i++) { + WARN_ON(pages[i]->mapping); + put_page(pages[i]); + } + kfree(pages); } - kfree(pages);
return; }
From: Josef Bacik josef@toxicpanda.com
Dave hit this splat during testing btrfs/078:
====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc6-default+ #1191 Not tainted ------------------------------------------------------ kswapd0/75 is trying to acquire lock: ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
but task is already holding lock: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc_track_caller+0x49/0x330 kstrdup+0x2e/0x60 __kernfs_new_node.constprop.0+0x44/0x250 kernfs_new_node+0x25/0x50 kernfs_create_link+0x34/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs] btrfs_init_new_device+0x44c/0x12b0 [btrfs] btrfs_ioctl+0xc3c/0x25c0 [btrfs] ksys_ioctl+0x68/0xa0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 btrfs_chunk_alloc+0x137/0x3e0 [btrfs] find_free_extent+0xb44/0xfb0 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x143/0x7a0 [btrfs] btrfs_cow_block+0x15f/0x310 [btrfs] push_leaf_right+0x150/0x240 [btrfs] split_leaf+0x3cd/0x6d0 [btrfs] btrfs_search_slot+0xd14/0xf70 [btrfs] btrfs_insert_empty_items+0x64/0xc0 [btrfs] __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs] btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs] btrfs_work_helper+0x2f9/0x650 [btrfs] process_one_work+0x22c/0x600 worker_thread+0x50/0x3b0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30
-> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 kthread+0x137/0x150 ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/75: #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0 #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
stack backtrace: CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x16f/0x190 check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __mutex_lock+0xa0/0xaf0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? __lock_acquire+0x56f/0xaa0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? lock_acquire+0xa3/0x440 ? btrfs_evict_inode+0x138/0x560 [btrfs] ? btrfs_evict_inode+0x2fe/0x560 [btrfs] ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 ? _raw_spin_unlock_irqrestore+0x3e/0x50 ? finish_wait+0x90/0x90 ? balance_pgdat+0x750/0x750 kthread+0x137/0x150 ? kthread_stop+0x2a0/0x2a0 ret_from_fork+0x1f/0x30
This is because we're holding the chunk_mutex while adding this device and adding its sysfs entries. We actually hold different locks in different places when calling this function, the dev_replace semaphore for instance in dev replace, so instead of moving this call around simply wrap it's operations in NOFS.
CC: stable@vger.kernel.org # 4.14+ Reported-by: David Sterba dsterba@suse.com Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com --- fs/btrfs/sysfs.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index f05341bda1d14..383546ff62f04 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -25,6 +25,7 @@ #include <linux/bug.h> #include <linux/genhd.h> #include <linux/debugfs.h> +#include <linux/sched/mm.h>
#include "ctree.h" #include "disk-io.h" @@ -749,7 +750,9 @@ int btrfs_sysfs_add_device_link(struct btrfs_fs_devices *fs_devices, { int error = 0; struct btrfs_device *dev; + unsigned int nofs_flag;
+ nofs_flag = memalloc_nofs_save(); list_for_each_entry(dev, &fs_devices->devices, dev_list) { struct hd_struct *disk; struct kobject *disk_kobj; @@ -768,6 +771,7 @@ int btrfs_sysfs_add_device_link(struct btrfs_fs_devices *fs_devices, if (error) break; } + memalloc_nofs_restore(nofs_flag);
return error; }
From: Jann Horn jannh@google.com
commit bcf85fcedfdd17911982a3e3564fcfec7b01eebd upstream.
romfs has a superblock field that limits the size of the filesystem; data beyond that limit is never accessed.
romfs_dev_read() fetches a caller-supplied number of bytes from the backing device. It returns 0 on success or an error code on failure; therefore, its API can't represent short reads, it's all-or-nothing.
However, when romfs_dev_read() detects that the requested operation would cross the filesystem size limit, it currently silently truncates the requested number of bytes. This e.g. means that when the content of a file with size 0x1000 starts one byte before the filesystem size limit, ->readpage() will only fill a single byte of the supplied page while leaving the rest uninitialized, leaking that uninitialized memory to userspace.
Fix it by returning an error code instead of truncating the read when the requested read operation would go beyond the end of the filesystem.
Fixes: da4458bda237 ("NOMMU: Make it possible for RomFS to use MTD devices directly") Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: David Howells dhowells@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/romfs/storage.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
--- a/fs/romfs/storage.c +++ b/fs/romfs/storage.c @@ -221,10 +221,8 @@ int romfs_dev_read(struct super_block *s size_t limit;
limit = romfs_maxsize(sb); - if (pos >= limit) + if (pos >= limit || buflen > limit - pos) return -EIO; - if (buflen > limit - pos) - buflen = limit - pos;
#ifdef CONFIG_ROMFS_ON_MTD if (sb->s_mtd)
From: Wei Yongjun weiyongjun1@huawei.com
commit 71e843295c680898959b22dc877ae3839cc22470 upstream.
kmemleak report memory leak as follows:
unreferenced object 0x607ee4e5f948 (size 8): comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: relay_open kernel/relay.c:583 [inline] relay_open+0xb6/0x970 kernel/relay.c:563 do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557 __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597 blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738 blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613 block_ioctl+0xe5/0x120 fs/block_dev.c:1871 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
'chan->buf' is malloced in relay_open() by alloc_percpu() but not free while destroy the relay channel. Fix it by adding free_percpu() before return from relay_destroy_channel().
Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Chris Wilson chris@chris-wilson.co.uk Cc: Al Viro viro@zeniv.linux.org.uk Cc: Michael Ellerman mpe@ellerman.id.au Cc: David Rientjes rientjes@google.com Cc: Michel Lespinasse walken@google.com Cc: Daniel Axtens dja@axtens.net Cc: Thomas Gleixner tglx@linutronix.de Cc: Akash Goel akash.goel@intel.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/relay.c | 1 + 1 file changed, 1 insertion(+)
--- a/kernel/relay.c +++ b/kernel/relay.c @@ -196,6 +196,7 @@ free_buf: static void relay_destroy_channel(struct kref *kref) { struct rchan *chan = container_of(kref, struct rchan, kref); + free_percpu(chan->buf); kfree(chan); }
From: Doug Berger opendmb@gmail.com
commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream.
The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function.
The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file.
The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting.
The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot.
This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged.
In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout
cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff]
would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily.
Funnily enough
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve.
This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values.
Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger opendmb@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@suse.com Cc: Jason Baron jbaron@akamai.com Cc: David Rientjes rientjes@google.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7018,7 +7018,7 @@ int __meminit init_per_zone_wmark_min(vo
return 0; } -core_initcall(init_per_zone_wmark_min) +postcore_initcall(init_per_zone_wmark_min)
/* * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
From: Charan Teja Reddy charante@codeaurora.org
commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream.
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone.
P1 P2
Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1.
Allocate the pages from the movable zone.
Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1.
Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung.
Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages.
The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block.
With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself.
This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet).
[1]: https://patchwork.kernel.org/patch/11696389/
Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy charante@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: David Hildenbrand david@redhat.com Acked-by: David Rientjes rientjes@google.com Acked-by: Michal Hocko mhocko@suse.com Cc: Michal Hocko mhocko@suse.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Vinayak Menon vinmenon@codeaurora.org Cc: stable@vger.kernel.org [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeauro... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1114,6 +1114,11 @@ static void free_pcppages_bulk(struct zo spin_lock(&zone->lock); isolated_pageblocks = has_isolate_pageblock(zone);
+ /* + * Ensure proper count is passed which otherwise would stuck in the + * below while (list_empty(list)) loop. + */ + count = min(pcp->count, count); while (count) { struct page *page; struct list_head *list;
From: Jan Kara jack@suse.cz
commit 7303cb5bfe845f7d43cd9b2dbd37dbb266efda9b upstream.
ext4_search_dir() and ext4_generic_delete_entry() can be called both for standard director blocks and for inline directories stored inside inode or inline xattr space. For the second case we didn't call ext4_check_dir_entry() with proper constraints that could result in accepting corrupted directory entry as well as false positive filesystem errors like:
EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400: block 113246792: comm dockerd: bad entry in directory: directory entry too close to block end - offset=0, inode=28320403, rec_len=32, name_len=8, size=4096
Fix the arguments passed to ext4_check_dir_entry().
Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end") CC: stable@vger.kernel.org Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ext4/namei.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1308,8 +1308,8 @@ int ext4_search_dir(struct buffer_head * ext4_match(fname, de)) { /* found a match - just to be sure, do * a full check */ - if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data, - bh->b_size, offset)) + if (ext4_check_dir_entry(dir, NULL, de, bh, search_buf, + buf_size, offset)) return -1; *res_dir = de; return 1; @@ -2353,7 +2353,7 @@ int ext4_generic_delete_entry(handle_t * de = (struct ext4_dir_entry_2 *)entry_buf; while (i < buf_size - csum_size) { if (ext4_check_dir_entry(dir, NULL, de, bh, - bh->b_data, bh->b_size, i)) + entry_buf, buf_size, i)) return -EFSCORRUPTED; if (de == de_del) { if (pde)
From: zhangyi (F) yi.zhang@huawei.com
commit ef3f5830b859604eda8723c26d90ab23edc027a4 upstream.
jbd2_write_superblock() is under the buffer lock of journal superblock before ending that superblock write, so add a missing unlock_buffer() in in the error path before submitting buffer.
Fixes: 742b06b5628f ("jbd2: check superblock mapped prior to committing") Signed-off-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Ritesh Harjani riteshh@linux.ibm.com Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/jbd2/journal.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -1356,8 +1356,10 @@ static int jbd2_write_superblock(journal int ret;
/* Buffer got discarded which means block device got invalidated */ - if (!buffer_mapped(bh)) + if (!buffer_mapped(bh)) { + unlock_buffer(bh); return -EIO; + }
trace_jbd2_write_superblock(journal, write_flags); if (!(journal->j_flags & JBD2_BARRIER))
From: Yang Shi shy828301@gmail.com
commit b7333b58f358f38d90d78e00c1ee5dec82df10ad upstream.
Recently we found regression when running will_it_scale/page_fault3 test on ARM64. Over 70% down for the multi processes cases and over 20% down for the multi threads cases. It turns out the regression is caused by commit 89b15332af7c ("mm: drop mmap_sem before calling balance_dirty_pages() in write fault").
The test mmaps a memory size file then write to the mapping, this would make all memory dirty and trigger dirty pages throttle, that upstream commit would release mmap_sem then retry the page fault. The retried page fault would see correct PTEs installed then just fall through to spurious TLB flush. The regression is caused by the excessive spurious TLB flush. It is fine on x86 since x86's spurious TLB flush is no-op.
We could just skip the spurious TLB flush to mitigate the regression.
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Reported-by: Xu Yu xuyu@linux.alibaba.com Debugged-by: Xu Yu xuyu@linux.alibaba.com Tested-by: Xu Yu xuyu@linux.alibaba.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com Cc: stable@vger.kernel.org Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/memory.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/mm/memory.c +++ b/mm/memory.c @@ -4010,6 +4010,9 @@ static int handle_pte_fault(struct vm_fa vmf->flags & FAULT_FLAG_WRITE)) { update_mmu_cache(vmf->vma, vmf->address, vmf->pte); } else { + /* Skip spurious TLB flush for retried page fault */ + if (vmf->flags & FAULT_FLAG_TRIED) + goto unlock; /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not.
From: Lukas Wunner lukas@wunner.de
[ Upstream commit ddf75be47ca748f8b12d28ac64d624354fddf189 ]
CONFIG_OF_DYNAMIC and CONFIG_ACPI allow adding SPI devices at runtime using a DeviceTree overlay or DSDT patch. CONFIG_SPI_SLAVE allows the same via sysfs.
But there are no precautions to prevent adding a device below a controller that's being removed. Such a device is unusable and may not even be able to unbind cleanly as it becomes inaccessible once the controller has been torn down. E.g. it is then impossible to quiesce the device's interrupt.
of_spi_notify() and acpi_spi_notify() do hold a ref on the controller, but otherwise run lockless against spi_unregister_controller().
Fix by holding the spi_add_lock in spi_unregister_controller() and bailing out of spi_add_device() if the controller has been unregistered concurrently.
Fixes: ce79d54ae447 ("spi/of: Add OF notifier handler") Signed-off-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org # v3.19+ Cc: Geert Uytterhoeven geert+renesas@glider.be Cc: Octavian Purdila octavian.purdila@intel.com Cc: Pantelis Antoniou pantelis.antoniou@konsulko.com Link: https://lore.kernel.org/r/a8c3205088a969dc8410eec1eba9aface60f36af.159645103... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/spi/Kconfig | 3 +++ drivers/spi/spi.c | 21 ++++++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/drivers/spi/Kconfig b/drivers/spi/Kconfig index a75f2a2cf7805..4b6a1629969f3 100644 --- a/drivers/spi/Kconfig +++ b/drivers/spi/Kconfig @@ -827,4 +827,7 @@ config SPI_SLAVE_SYSTEM_CONTROL
endif # SPI_SLAVE
+config SPI_DYNAMIC + def_bool ACPI || OF_DYNAMIC || SPI_SLAVE + endif # SPI diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index 49eee894f51d4..ab6a4f85bcde7 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -428,6 +428,12 @@ static LIST_HEAD(spi_controller_list); */ static DEFINE_MUTEX(board_lock);
+/* + * Prevents addition of devices with same chip select and + * addition of devices below an unregistering controller. + */ +static DEFINE_MUTEX(spi_add_lock); + /** * spi_alloc_device - Allocate a new SPI device * @ctlr: Controller to which device is connected @@ -506,7 +512,6 @@ static int spi_dev_check(struct device *dev, void *data) */ int spi_add_device(struct spi_device *spi) { - static DEFINE_MUTEX(spi_add_lock); struct spi_controller *ctlr = spi->controller; struct device *dev = ctlr->dev.parent; int status; @@ -534,6 +539,13 @@ int spi_add_device(struct spi_device *spi) goto done; }
+ /* Controller may unregister concurrently */ + if (IS_ENABLED(CONFIG_SPI_DYNAMIC) && + !device_is_registered(&ctlr->dev)) { + status = -ENODEV; + goto done; + } + if (ctlr->cs_gpios) spi->cs_gpio = ctlr->cs_gpios[spi->chip_select];
@@ -2265,6 +2277,10 @@ void spi_unregister_controller(struct spi_controller *ctlr) struct spi_controller *found; int id = ctlr->bus_num;
+ /* Prevent addition of new devices, unregister existing ones */ + if (IS_ENABLED(CONFIG_SPI_DYNAMIC)) + mutex_lock(&spi_add_lock); + device_for_each_child(&ctlr->dev, NULL, __unregister);
/* First make sure that this controller was ever added */ @@ -2285,6 +2301,9 @@ void spi_unregister_controller(struct spi_controller *ctlr) if (found == ctlr) idr_remove(&spi_master_idr, id); mutex_unlock(&board_lock); + + if (IS_ENABLED(CONFIG_SPI_DYNAMIC)) + mutex_unlock(&spi_add_lock); } EXPORT_SYMBOL_GPL(spi_unregister_controller);
From: Stanley Chu stanley.chu@mediatek.com
[ Upstream commit c0a18ee0ce78d7957ec1a53be35b1b3beba80668 ]
It is confirmed that Micron device needs DELAY_BEFORE_LPM quirk to have a delay before VCC is powered off. Sdd Micron vendor ID and this quirk for Micron devices.
Link: https://lore.kernel.org/r/20200612012625.6615-2-stanley.chu@mediatek.com Reviewed-by: Bean Huo beanhuo@micron.com Reviewed-by: Alim Akhtar alim.akhtar@samsung.com Signed-off-by: Stanley Chu stanley.chu@mediatek.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/scsi/ufs/ufs_quirks.h | 1 + drivers/scsi/ufs/ufshcd.c | 2 ++ 2 files changed, 3 insertions(+)
diff --git a/drivers/scsi/ufs/ufs_quirks.h b/drivers/scsi/ufs/ufs_quirks.h index 71f73d1d1ad1f..6c944fbefd40a 100644 --- a/drivers/scsi/ufs/ufs_quirks.h +++ b/drivers/scsi/ufs/ufs_quirks.h @@ -21,6 +21,7 @@ #define UFS_ANY_VENDOR 0xFFFF #define UFS_ANY_MODEL "ANY_MODEL"
+#define UFS_VENDOR_MICRON 0x12C #define UFS_VENDOR_TOSHIBA 0x198 #define UFS_VENDOR_SAMSUNG 0x1CE #define UFS_VENDOR_SKHYNIX 0x1AD diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 1e2a97a10033b..11e917b44a0f1 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -189,6 +189,8 @@ ufs_get_desired_pm_lvl_for_dev_link_state(enum ufs_dev_pwr_mode dev_state,
static struct ufs_dev_fix ufs_fixups[] = { /* UFS cards deviations table */ + UFS_FIX(UFS_VENDOR_MICRON, UFS_ANY_MODEL, + UFS_DEVICE_QUIRK_DELAY_BEFORE_LPM), UFS_FIX(UFS_VENDOR_SAMSUNG, UFS_ANY_MODEL, UFS_DEVICE_QUIRK_DELAY_BEFORE_LPM), UFS_FIX(UFS_VENDOR_SAMSUNG, UFS_ANY_MODEL, UFS_DEVICE_NO_VCCQ),
From: Chuhong Yuan hslester96@gmail.com
[ Upstream commit fc0456458df8b3421dba2a5508cd817fbc20ea71 ]
budget_register() has no error handling after its failure. Add the missed undo functions for error handling to fix it.
Signed-off-by: Chuhong Yuan hslester96@gmail.com Signed-off-by: Sean Young sean@mess.org Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/media/pci/ttpci/budget-core.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/media/pci/ttpci/budget-core.c b/drivers/media/pci/ttpci/budget-core.c index 97499b2af7144..20524376b83be 100644 --- a/drivers/media/pci/ttpci/budget-core.c +++ b/drivers/media/pci/ttpci/budget-core.c @@ -383,20 +383,25 @@ static int budget_register(struct budget *budget) ret = dvbdemux->dmx.add_frontend(&dvbdemux->dmx, &budget->hw_frontend);
if (ret < 0) - return ret; + goto err_release_dmx;
budget->mem_frontend.source = DMX_MEMORY_FE; ret = dvbdemux->dmx.add_frontend(&dvbdemux->dmx, &budget->mem_frontend); if (ret < 0) - return ret; + goto err_release_dmx;
ret = dvbdemux->dmx.connect_frontend(&dvbdemux->dmx, &budget->hw_frontend); if (ret < 0) - return ret; + goto err_release_dmx;
dvb_net_init(&budget->dvb_adapter, &budget->dvb_net, &dvbdemux->dmx);
return 0; + +err_release_dmx: + dvb_dmxdev_release(&budget->dmxdev); + dvb_dmx_release(&budget->demux); + return ret; }
static void budget_unregister(struct budget *budget)
From: Huacai Chen chenhc@lemote.com
[ Upstream commit 22f8d5a1bf230cf8567a4121fc3789babb46336d ]
When use goldfish rtc, the "hwclock" command fails with "select() to /dev/rtc to wait for clock tick timed out". This is because "hwclock" need the set_alarm() hook to enable interrupt when alrm->enabled is true. This operation is missing in goldfish rtc (but other rtc drivers, such as cmos rtc, enable interrupt here), so add it.
Signed-off-by: Huacai Chen chenhc@lemote.com Signed-off-by: Jiaxun Yang jiaxun.yang@flygoat.com Signed-off-by: Alexandre Belloni alexandre.belloni@bootlin.com Link: https://lore.kernel.org/r/1592654683-31314-1-git-send-email-chenhc@lemote.co... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/rtc/rtc-goldfish.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/rtc/rtc-goldfish.c b/drivers/rtc/rtc-goldfish.c index a1c44d0c85578..30cbe22c57a8e 100644 --- a/drivers/rtc/rtc-goldfish.c +++ b/drivers/rtc/rtc-goldfish.c @@ -87,6 +87,7 @@ static int goldfish_rtc_set_alarm(struct device *dev, rtc_alarm64 = rtc_alarm * NSEC_PER_SEC; writel((rtc_alarm64 >> 32), base + TIMER_ALARM_HIGH); writel(rtc_alarm64, base + TIMER_ALARM_LOW); + writel(1, base + TIMER_IRQ_ENABLED); } else { /* * if this function was called with enabled=0
From: Evgeny Novikov novikov@ispras.ru
[ Upstream commit 9c487b0b0ea7ff22127fe99a7f67657d8730ff94 ]
If platform_driver_register() fails within vpss_init() resources are not cleaned up. The patch fixes this issue by introducing the corresponding error handling.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Evgeny Novikov novikov@ispras.ru Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/media/platform/davinci/vpss.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/media/platform/davinci/vpss.c b/drivers/media/platform/davinci/vpss.c index 2ee4cd9e6d80f..d984f45c03149 100644 --- a/drivers/media/platform/davinci/vpss.c +++ b/drivers/media/platform/davinci/vpss.c @@ -514,19 +514,31 @@ static void vpss_exit(void)
static int __init vpss_init(void) { + int ret; + if (!request_mem_region(VPSS_CLK_CTRL, 4, "vpss_clock_control")) return -EBUSY;
oper_cfg.vpss_regs_base2 = ioremap(VPSS_CLK_CTRL, 4); if (unlikely(!oper_cfg.vpss_regs_base2)) { - release_mem_region(VPSS_CLK_CTRL, 4); - return -ENOMEM; + ret = -ENOMEM; + goto err_ioremap; }
writel(VPSS_CLK_CTRL_VENCCLKEN | - VPSS_CLK_CTRL_DACCLKEN, oper_cfg.vpss_regs_base2); + VPSS_CLK_CTRL_DACCLKEN, oper_cfg.vpss_regs_base2); + + ret = platform_driver_register(&vpss_driver); + if (ret) + goto err_pd_register; + + return 0;
- return platform_driver_register(&vpss_driver); +err_pd_register: + iounmap(oper_cfg.vpss_regs_base2); +err_ioremap: + release_mem_region(VPSS_CLK_CTRL, 4); + return ret; } subsys_initcall(vpss_init); module_exit(vpss_exit);
From: Xiongfeng Wang wangxiongfeng2@huawei.com
[ Upstream commit 4aec14de3a15cf9789a0e19c847f164776f49473 ]
When I cat parameter 'proto' by sysfs, it displays as follows. It's better to add a newline for easy reading.
root@syzkaller:~# cat /sys/module/psmouse/parameters/proto autoroot@syzkaller:~#
Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Link: https://lore.kernel.org/r/20200720073846.120724-1-wangxiongfeng2@huawei.com Signed-off-by: Dmitry Torokhov dmitry.torokhov@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/input/mouse/psmouse-base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/input/mouse/psmouse-base.c b/drivers/input/mouse/psmouse-base.c index 8ac9e03c05b45..ca8f726dab2e7 100644 --- a/drivers/input/mouse/psmouse-base.c +++ b/drivers/input/mouse/psmouse-base.c @@ -2012,7 +2012,7 @@ static int psmouse_get_maxproto(char *buffer, const struct kernel_param *kp) { int type = *((unsigned int *)kp->arg);
- return sprintf(buffer, "%s", psmouse_protocol_by_type(type)->name); + return sprintf(buffer, "%s\n", psmouse_protocol_by_type(type)->name); }
static int __init psmouse_init(void)
From: Greg Ungerer gerg@linux-m68k.org
[ Upstream commit bdee0e793cea10c516ff48bf3ebb4ef1820a116b ]
The Cache Control Register (CACR) of the ColdFire V3 has bits that control high level caching functions, and also enable/disable the use of the alternate stack pointer register (the EUSP bit) to provide separate supervisor and user stack pointer registers. The code as it is today will blindly clear the EUSP bit on cache actions like invalidation. So it is broken for this case - and that will result in failed booting (interrupt entry and exit processing will be completely hosed).
This only affects ColdFire V3 parts that support the alternate stack register (like the 5329 for example) - generally speaking new parts do, older parts don't. It has no impact on ColdFire V3 parts with the single stack pointer, like the 5307 for example.
Fix the cache bit defines used, so they maintain the EUSP bit when carrying out cache actions through the CACR register.
Signed-off-by: Greg Ungerer gerg@linux-m68k.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/m68k/include/asm/m53xxacr.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/m68k/include/asm/m53xxacr.h b/arch/m68k/include/asm/m53xxacr.h index 9138a624c5c81..692f90e7fecc1 100644 --- a/arch/m68k/include/asm/m53xxacr.h +++ b/arch/m68k/include/asm/m53xxacr.h @@ -89,9 +89,9 @@ * coherency though in all cases. And for copyback caches we will need * to push cached data as well. */ -#define CACHE_INIT CACR_CINVA -#define CACHE_INVALIDATE CACR_CINVA -#define CACHE_INVALIDATED CACR_CINVA +#define CACHE_INIT (CACHE_MODE + CACR_CINVA - CACR_EC) +#define CACHE_INVALIDATE (CACHE_MODE + CACR_CINVA) +#define CACHE_INVALIDATED (CACHE_MODE + CACR_CINVA)
#define ACR0_MODE ((CONFIG_RAMBASE & 0xff000000) + \ (0x000f0000) + \
From: Darrick J. Wong darrick.wong@oracle.com
[ Upstream commit f959b5d037e71a4d69b5bf71faffa065d9269b4a ]
xfs_trans_dqresv is the function that we use to make reservations against resource quotas. Each resource contains two counters: the q_core counter, which tracks resources allocated on disk; and the dquot reservation counter, which tracks how much of that resource has either been allocated or reserved by threads that are working on metadata updates.
For disk blocks, we compare the proposed reservation counter against the hard and soft limits to decide if we're going to fail the operation. However, for inodes we inexplicably compare against the q_core counter, not the incore reservation count.
Since the q_core counter is always lower than the reservation count and we unlock the dquot between reservation and transaction commit, this means that multiple threads can reserve the last inode count before we hit the hard limit, and when they commit, we'll be well over the hard limit.
Fix this by checking against the incore inode reservation counter, since we would appear to maintain that correctly (and that's what we report in GETQUOTA).
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Allison Collins allison.henderson@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org --- fs/xfs/xfs_trans_dquot.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c index c3d547211d160..9c42e50a5cb7e 100644 --- a/fs/xfs/xfs_trans_dquot.c +++ b/fs/xfs/xfs_trans_dquot.c @@ -669,7 +669,7 @@ xfs_trans_dqresv( } } if (ninos > 0) { - total_count = be64_to_cpu(dqp->q_core.d_icount) + ninos; + total_count = dqp->q_res_icount + ninos; timer = be32_to_cpu(dqp->q_core.d_itimer); warns = be16_to_cpu(dqp->q_core.d_iwarns); warnlimit = dqp->q_mount->m_quotainfo->qi_iwarnlimit;
From: Zhe Li lizhe67@huawei.com
[ Upstream commit 798b7347e4f29553db4b996393caf12f5b233daf ]
The log of UAF problem is listed below. BUG: KASAN: use-after-free in jffs2_rmdir+0xa4/0x1cc [jffs2] at addr c1f165fc Read of size 4 by task rm/8283 ============================================================================= BUG kmalloc-32 (Tainted: P B O ): kasan: bad access detected -----------------------------------------------------------------------------
INFO: Allocated in 0xbbbbbbbb age=3054364 cpu=0 pid=0 0xb0bba6ef jffs2_write_dirent+0x11c/0x9c8 [jffs2] __slab_alloc.isra.21.constprop.25+0x2c/0x44 __kmalloc+0x1dc/0x370 jffs2_write_dirent+0x11c/0x9c8 [jffs2] jffs2_do_unlink+0x328/0x5fc [jffs2] jffs2_rmdir+0x110/0x1cc [jffs2] vfs_rmdir+0x180/0x268 do_rmdir+0x2cc/0x300 ret_from_syscall+0x0/0x3c INFO: Freed in 0x205b age=3054364 cpu=0 pid=0 0x2e9173 jffs2_add_fd_to_list+0x138/0x1dc [jffs2] jffs2_add_fd_to_list+0x138/0x1dc [jffs2] jffs2_garbage_collect_dirent.isra.3+0x21c/0x288 [jffs2] jffs2_garbage_collect_live+0x16bc/0x1800 [jffs2] jffs2_garbage_collect_pass+0x678/0x11d4 [jffs2] jffs2_garbage_collect_thread+0x1e8/0x3b0 [jffs2] kthread+0x1a8/0x1b0 ret_from_kernel_thread+0x5c/0x64 Call Trace: [c17ddd20] [c02452d4] kasan_report.part.0+0x298/0x72c (unreliable) [c17ddda0] [d2509680] jffs2_rmdir+0xa4/0x1cc [jffs2] [c17dddd0] [c026da04] vfs_rmdir+0x180/0x268 [c17dde00] [c026f4e4] do_rmdir+0x2cc/0x300 [c17ddf40] [c001a658] ret_from_syscall+0x0/0x3c
The root cause is that we don't get "jffs2_inode_info.sem" before we scan list "jffs2_inode_info.dents" in function jffs2_rmdir. This patch add codes to get "jffs2_inode_info.sem" before we scan "jffs2_inode_info.dents" to slove the UAF problem.
Signed-off-by: Zhe Li lizhe67@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Richard Weinberger richard@nod.at Signed-off-by: Sasha Levin sashal@kernel.org --- fs/jffs2/dir.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index e5a6deb38e1e1..f4a5ec92f5dc7 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -590,10 +590,14 @@ static int jffs2_rmdir (struct inode *dir_i, struct dentry *dentry) int ret; uint32_t now = get_seconds();
+ mutex_lock(&f->sem); for (fd = f->dents ; fd; fd = fd->next) { - if (fd->ino) + if (fd->ino) { + mutex_unlock(&f->sem); return -ENOTEMPTY; + } } + mutex_unlock(&f->sem);
ret = jffs2_do_unlink(c, dir_f, dentry->d_name.name, dentry->d_name.len, f, now);
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
[ Upstream commit 4daca379c703ff55edc065e8e5173dcfeecf0148 ]
The MSR_TURBO_RATIO_LIMIT can be 0. This is not an error. User can update this MSR via BIOS settings on some systems or can use msr tools to update. Also some systems boot with value = 0.
This results in display of cpufreq/cpuinfo_max_freq wrong. This value will be equal to cpufreq/base_frequency, even though turbo is enabled.
But platform will still function normally in HWP mode as we get max 1-core frequency from the MSR_HWP_CAPABILITIES. This MSR is already used to calculate cpu->pstate.turbo_freq, which is used for to set policy->cpuinfo.max_freq. But some other places cpu->pstate.turbo_pstate is used. For example to set policy->max.
To fix this, also update cpu->pstate.turbo_pstate when updating cpu->pstate.turbo_freq.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/cpufreq/intel_pstate.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 1aa0b05c8cbdf..5c41dc9aaa46d 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1378,6 +1378,7 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
intel_pstate_get_hwp_max(cpu->cpu, &phy_max, ¤t_max); cpu->pstate.turbo_freq = phy_max * cpu->pstate.scaling; + cpu->pstate.turbo_pstate = phy_max; } else { cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * cpu->pstate.scaling; }
From: Javed Hasan jhasan@marvell.com
[ Upstream commit ec007ef40abb6a164d148b0dc19789a7a2de2cc8 ]
In fc_disc_gpn_id_resp(), skb is supposed to get freed in all cases except for PTR_ERR. However, in some cases it didn't.
This fix is to call fc_frame_free(fp) before function returns.
Link: https://lore.kernel.org/r/20200729081824.30996-2-jhasan@marvell.com Reviewed-by: Girish Basrur gbasrur@marvell.com Reviewed-by: Santosh Vernekar svernekar@marvell.com Reviewed-by: Saurav Kashyap skashyap@marvell.com Reviewed-by: Shyam Sundar ssundar@marvell.com Signed-off-by: Javed Hasan jhasan@marvell.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/scsi/libfc/fc_disc.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/libfc/fc_disc.c b/drivers/scsi/libfc/fc_disc.c index 28b50ab2fbb01..62f83cc151b22 100644 --- a/drivers/scsi/libfc/fc_disc.c +++ b/drivers/scsi/libfc/fc_disc.c @@ -605,8 +605,12 @@ static void fc_disc_gpn_id_resp(struct fc_seq *sp, struct fc_frame *fp,
if (PTR_ERR(fp) == -FC_EX_CLOSED) goto out; - if (IS_ERR(fp)) - goto redisc; + if (IS_ERR(fp)) { + mutex_lock(&disc->disc_mutex); + fc_disc_restart(disc); + mutex_unlock(&disc->disc_mutex); + goto out; + }
cp = fc_frame_payload_get(fp, sizeof(*cp)); if (!cp) @@ -633,7 +637,7 @@ static void fc_disc_gpn_id_resp(struct fc_seq *sp, struct fc_frame *fp, new_rdata->disc_id = disc->disc_id; fc_rport_login(new_rdata); } - goto out; + goto free_fp; } rdata->disc_id = disc->disc_id; mutex_unlock(&rdata->rp_mutex); @@ -650,6 +654,8 @@ redisc: fc_disc_restart(disc); mutex_unlock(&disc->disc_mutex); } +free_fp: + fc_frame_free(fp); out: kref_put(&rdata->kref, fc_rport_destroy); if (!IS_ERR(fp))
From: Mao Wenan wenan.mao@linux.alibaba.com
[ Upstream commit 481a0d7422db26fb63e2d64f0652667a5c6d0f3e ]
The loop may exist if vq->broken is true, virtqueue_get_buf_ctx_packed or virtqueue_get_buf_ctx_split will return NULL, so virtnet_poll will reschedule napi to receive packet, it will lead cpu usage(si) to 100%.
call trace as below: virtnet_poll virtnet_receive virtqueue_get_buf_ctx virtqueue_get_buf_ctx_packed virtqueue_get_buf_ctx_split virtqueue_napi_complete virtqueue_poll //return true virtqueue_napi_schedule //it will reschedule napi
to fix this, return false if vq is broken in virtqueue_poll.
Signed-off-by: Mao Wenan wenan.mao@linux.alibaba.com Acked-by: Michael S. Tsirkin mst@redhat.com Link: https://lore.kernel.org/r/1596354249-96204-1-git-send-email-wenan.mao@linux.... Signed-off-by: Michael S. Tsirkin mst@redhat.com Acked-by: Jason Wang jasowang@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/virtio/virtio_ring.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index b82bb0b081615..51278f8bd3ab3 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -829,6 +829,9 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx) { struct vring_virtqueue *vq = to_vvq(_vq);
+ if (unlikely(vq->broken)) + return false; + virtio_mb(vq->weak_barriers); return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx); }
From: Eiichi Tsukata devel@etsukata.com
[ Upstream commit 96cf2a2c75567ff56195fe3126d497a2e7e4379f ]
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN shows the following warning:
UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23 member access within null pointer of type 'struct xfs_kobj' Call Trace: dump_stack+0x10e/0x195 ubsan_type_mismatch_common+0x241/0x280 __ubsan_handle_type_mismatch_v1+0x32/0x40 init_xfs_fs+0x12b/0x28f do_one_initcall+0xdd/0x1d0 do_initcall_level+0x151/0x1b6 do_initcalls+0x50/0x8f do_basic_setup+0x29/0x2b kernel_init_freeable+0x19f/0x20b kernel_init+0x11/0x1e0 ret_from_fork+0x22/0x30
Fix it by checking parent_kobj before the code accesses its member.
Signed-off-by: Eiichi Tsukata devel@etsukata.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com [darrick: minor whitespace edits] Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/xfs/xfs_sysfs.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h index d04637181ef21..980c9429abec5 100644 --- a/fs/xfs/xfs_sysfs.h +++ b/fs/xfs/xfs_sysfs.h @@ -44,9 +44,11 @@ xfs_sysfs_init( struct xfs_kobj *parent_kobj, const char *name) { + struct kobject *parent; + + parent = parent_kobj ? &parent_kobj->kobject : NULL; init_completion(&kobj->complete); - return kobject_init_and_add(&kobj->kobject, ktype, - &parent_kobj->kobject, "%s", name); + return kobject_init_and_add(&kobj->kobject, ktype, parent, "%s", name); }
static inline void
From: Luc Van Oostenryck luc.vanoostenryck@gmail.com
[ Upstream commit bd72866b8da499e60633ff28f8a4f6e09ca78efe ]
These accessors must be used to read/write a big-endian bus. The value returned or written is native-endian.
However, these accessors are defined using be{16,32}_to_cpu() or cpu_to_be{16,32}() to make the endian conversion but these expect a __be{16,32} when none is present. Keeping them would need a force cast that would solve nothing at all.
So, do the conversion using swab{16,32}, like done in asm-generic for similar situations.
Reported-by: kernel test robot lkp@intel.com Signed-off-by: Luc Van Oostenryck luc.vanoostenryck@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Richard Henderson rth@twiddle.net Cc: Ivan Kokshaysky ink@jurassic.park.msu.ru Cc: Matt Turner mattst88@gmail.com Cc: Stephen Boyd sboyd@kernel.org Cc: Arnd Bergmann arnd@arndb.de Link: http://lkml.kernel.org/r/20200622114232.80039-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/alpha/include/asm/io.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index d123ff90f7a83..9995bed6e92e2 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -493,10 +493,10 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) } #endif
-#define ioread16be(p) be16_to_cpu(ioread16(p)) -#define ioread32be(p) be32_to_cpu(ioread32(p)) -#define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) -#define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) +#define ioread16be(p) swab16(ioread16(p)) +#define ioread32be(p) swab32(ioread32(p)) +#define iowrite16be(v,p) iowrite16(swab16(v), (p)) +#define iowrite32be(v,p) iowrite32(swab32(v), (p))
#define inb_p inb #define inw_p inw
From: Eric Sandeen sandeen@redhat.com
[ Upstream commit 5872331b3d91820e14716632ebb56b1399b34fe1 ]
If for any reason a directory passed to do_split() does not have enough active entries to exceed half the size of the block, we can end up iterating over all "count" entries without finding a split point.
In this case, count == move, and split will be zero, and we will attempt a negative index into map[].
Guard against this by detecting this case, and falling back to split-to-half-of-count instead; in this case we will still have plenty of space (> half blocksize) in each split block.
Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks") Signed-off-by: Eric Sandeen sandeen@redhat.com Reviewed-by: Andreas Dilger adilger@dilger.ca Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ext4/namei.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index ed17edb31e22f..3f999053457b6 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1741,7 +1741,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, blocksize, hinfo, map); map -= count; dx_sort_map(map, count); - /* Split the existing block in the middle, size-wise */ + /* Ensure that neither split block is over half full */ size = 0; move = 0; for (i = count-1; i >= 0; i--) { @@ -1751,8 +1751,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, size += map[i].size; move++; } - /* map index at which we will split */ - split = count - move; + /* + * map index at which we will split + * + * If the sum of active entries didn't exceed half the block size, just + * split it in half by count; each resulting block will have at least + * half the space free. + */ + if (i > 0) + split = count - move; + else + split = count/2; + hash2 = map[split].hash; continued = hash2 == map[split - 1].hash; dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
From: Przemyslaw Patynowski przemyslawx.patynowski@intel.com
[ Upstream commit 4bd5e02a2ed1575c2f65bd3c557a077dd399f0e8 ]
Trusted VF with unicast promiscuous mode set, could listen to TX traffic of other VFs. Set unicast promiscuous mode to RX traffic, if VSI has port VLAN configured. Rename misleading I40E_AQC_SET_VSI_PROMISC_TX bit to I40E_AQC_SET_VSI_PROMISC_RX_ONLY. Aligned unicast promiscuous with VLAN to the one without VLAN.
Fixes: 6c41a7606967 ("i40e: Add promiscuous on VLAN support") Fixes: 3b1200891b7f ("i40e: When in promisc mode apply promisc mode to Tx Traffic as well") Signed-off-by: Przemyslaw Patynowski przemyslawx.patynowski@intel.com Signed-off-by: Aleksandr Loktionov aleksandr.loktionov@intel.com Signed-off-by: Arkadiusz Kubalewski arkadiusz.kubalewski@intel.com Tested-by: Andrew Bowers andrewx.bowers@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- .../net/ethernet/intel/i40e/i40e_adminq_cmd.h | 2 +- drivers/net/ethernet/intel/i40e/i40e_common.c | 35 ++++++++++++++----- 2 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h index 5d5f422cbae55..f82da2b47d9a5 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h @@ -1175,7 +1175,7 @@ struct i40e_aqc_set_vsi_promiscuous_modes { #define I40E_AQC_SET_VSI_PROMISC_BROADCAST 0x04 #define I40E_AQC_SET_VSI_DEFAULT 0x08 #define I40E_AQC_SET_VSI_PROMISC_VLAN 0x10 -#define I40E_AQC_SET_VSI_PROMISC_TX 0x8000 +#define I40E_AQC_SET_VSI_PROMISC_RX_ONLY 0x8000 __le16 seid; #define I40E_AQC_VSI_PROM_CMD_SEID_MASK 0x3FF __le16 vlan_tag; diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index 111426ba5fbce..3fd2dfaf2bd53 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -1914,6 +1914,21 @@ i40e_status i40e_aq_set_phy_debug(struct i40e_hw *hw, u8 cmd_flags, return status; }
+/** + * i40e_is_aq_api_ver_ge + * @aq: pointer to AdminQ info containing HW API version to compare + * @maj: API major value + * @min: API minor value + * + * Assert whether current HW API version is greater/equal than provided. + **/ +static bool i40e_is_aq_api_ver_ge(struct i40e_adminq_info *aq, u16 maj, + u16 min) +{ + return (aq->api_maj_ver > maj || + (aq->api_maj_ver == maj && aq->api_min_ver >= min)); +} + /** * i40e_aq_add_vsi * @hw: pointer to the hw struct @@ -2039,18 +2054,16 @@ i40e_status i40e_aq_set_vsi_unicast_promiscuous(struct i40e_hw *hw,
if (set) { flags |= I40E_AQC_SET_VSI_PROMISC_UNICAST; - if (rx_only_promisc && - (((hw->aq.api_maj_ver == 1) && (hw->aq.api_min_ver >= 5)) || - (hw->aq.api_maj_ver > 1))) - flags |= I40E_AQC_SET_VSI_PROMISC_TX; + if (rx_only_promisc && i40e_is_aq_api_ver_ge(&hw->aq, 1, 5)) + flags |= I40E_AQC_SET_VSI_PROMISC_RX_ONLY; }
cmd->promiscuous_flags = cpu_to_le16(flags);
cmd->valid_flags = cpu_to_le16(I40E_AQC_SET_VSI_PROMISC_UNICAST); - if (((hw->aq.api_maj_ver >= 1) && (hw->aq.api_min_ver >= 5)) || - (hw->aq.api_maj_ver > 1)) - cmd->valid_flags |= cpu_to_le16(I40E_AQC_SET_VSI_PROMISC_TX); + if (i40e_is_aq_api_ver_ge(&hw->aq, 1, 5)) + cmd->valid_flags |= + cpu_to_le16(I40E_AQC_SET_VSI_PROMISC_RX_ONLY);
cmd->seid = cpu_to_le16(seid); status = i40e_asq_send_command(hw, &desc, NULL, 0, cmd_details); @@ -2147,11 +2160,17 @@ enum i40e_status_code i40e_aq_set_vsi_uc_promisc_on_vlan(struct i40e_hw *hw, i40e_fill_default_direct_cmd_desc(&desc, i40e_aqc_opc_set_vsi_promiscuous_modes);
- if (enable) + if (enable) { flags |= I40E_AQC_SET_VSI_PROMISC_UNICAST; + if (i40e_is_aq_api_ver_ge(&hw->aq, 1, 5)) + flags |= I40E_AQC_SET_VSI_PROMISC_RX_ONLY; + }
cmd->promiscuous_flags = cpu_to_le16(flags); cmd->valid_flags = cpu_to_le16(I40E_AQC_SET_VSI_PROMISC_UNICAST); + if (i40e_is_aq_api_ver_ge(&hw->aq, 1, 5)) + cmd->valid_flags |= + cpu_to_le16(I40E_AQC_SET_VSI_PROMISC_RX_ONLY); cmd->seid = cpu_to_le16(seid); cmd->vlan_tag = cpu_to_le16(vid | I40E_AQC_SET_VSI_VLAN_VALID);
From: Grzegorz Szczurek grzegorzx.szczurek@intel.com
[ Upstream commit 5b6d4a7f20b09c47ca598760f6dafd554af8b6d5 ]
Fix the reason of crashing system by add waiting time to finish reset recovery process before starting remove driver procedure. Now VSI is releasing if VSI is not in reset recovery mode. Without this fix it was possible to start remove driver if other processing command need reset recovery procedure which resulted in null pointer dereference. VSI used by the ethtool process has been cleared by remove driver process.
[ 6731.508665] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 6731.508668] #PF: supervisor read access in kernel mode [ 6731.508670] #PF: error_code(0x0000) - not-present page [ 6731.508671] PGD 0 P4D 0 [ 6731.508674] Oops: 0000 [#1] SMP PTI [ 6731.508679] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0021.032120170601 03/21/2017 [ 6731.508694] RIP: 0010:i40e_down+0x252/0x310 [i40e] [ 6731.508696] Code: c7 78 de fa c0 e8 61 02 3a c1 66 83 bb f6 0c 00 00 00 0f 84 bf 00 00 00 45 31 e4 45 31 ff eb 03 41 89 c7 48 8b 83 98 0c 00 00 <4a> 8b 3c 20 e8 a5 79 02 00 48 83 bb d0 0c 00 00 00 74 10 48 8b 83 [ 6731.508698] RSP: 0018:ffffb75ac7b3faf0 EFLAGS: 00010246 [ 6731.508700] RAX: 0000000000000000 RBX: ffff9c9874bd5000 RCX: 0000000000000007 [ 6731.508701] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9c987f4d9780 [ 6731.508703] RBP: ffffb75ac7b3fb30 R08: 0000000000005b60 R09: 0000000000000004 [ 6731.508704] R10: ffffb75ac64fbd90 R11: 0000000000000001 R12: 0000000000000000 [ 6731.508706] R13: ffff9c97a08e0000 R14: ffff9c97a08e0a68 R15: 0000000000000000 [ 6731.508708] FS: 00007f2617cd2740(0000) GS:ffff9c987f4c0000(0000) knlGS:0000000000000000 [ 6731.508710] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6731.508711] CR2: 0000000000000000 CR3: 0000001e765c4006 CR4: 00000000003606e0 [ 6731.508713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6731.508714] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6731.508715] Call Trace: [ 6731.508734] i40e_vsi_close+0x84/0x90 [i40e] [ 6731.508742] i40e_quiesce_vsi.part.98+0x3c/0x40 [i40e] [ 6731.508749] i40e_pf_quiesce_all_vsi+0x55/0x60 [i40e] [ 6731.508757] i40e_prep_for_reset+0x59/0x130 [i40e] [ 6731.508765] i40e_reconfig_rss_queues+0x5a/0x120 [i40e] [ 6731.508774] i40e_set_channels+0xda/0x170 [i40e] [ 6731.508778] ethtool_set_channels+0xe9/0x150 [ 6731.508781] dev_ethtool+0x1b94/0x2920 [ 6731.508805] dev_ioctl+0xc2/0x590 [ 6731.508811] sock_do_ioctl+0xae/0x150 [ 6731.508813] sock_ioctl+0x34f/0x3c0 [ 6731.508821] ksys_ioctl+0x98/0xb0 [ 6731.508828] __x64_sys_ioctl+0x1a/0x20 [ 6731.508831] do_syscall_64+0x57/0x1c0 [ 6731.508835] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 4b8164467b85 ("i40e: Add common function for finding VSI by type") Signed-off-by: Grzegorz Szczurek grzegorzx.szczurek@intel.com Signed-off-by: Arkadiusz Kubalewski arkadiusz.kubalewski@intel.com Tested-by: Aaron Brown aaron.f.brown@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index aa2b446d6ad0f..f4475cbf8ce86 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -11822,6 +11822,9 @@ static void i40e_remove(struct pci_dev *pdev) i40e_write_rx_ctl(hw, I40E_PFQF_HENA(0), 0); i40e_write_rx_ctl(hw, I40E_PFQF_HENA(1), 0);
+ while (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state)) + usleep_range(1000, 2000); + /* no more scheduling of any task */ set_bit(__I40E_SUSPENDED, pf->state); set_bit(__I40E_DOWN, pf->state);
From: Fugang Duan fugang.duan@nxp.com
[ Upstream commit c6165cf0dbb82ded90163dce3ac183fc7a913dc4 ]
Correct the error path for regulator disable.
Fixes: 9269e5560b26 ("net: fec: add phy-reset-gpios PROBE_DEFER check") Signed-off-by: Fugang Duan fugang.duan@nxp.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/freescale/fec_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 8ba915cc4c2e4..22f964ef859e5 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -3536,11 +3536,11 @@ fec_probe(struct platform_device *pdev) failed_irq: failed_init: fec_ptp_stop(pdev); - if (fep->reg_phy) - regulator_disable(fep->reg_phy); failed_reset: pm_runtime_put_noidle(&pdev->dev); pm_runtime_disable(&pdev->dev); + if (fep->reg_phy) + regulator_disable(fep->reg_phy); failed_regulator: clk_disable_unprepare(fep->clk_ahb); failed_clk_ahb:
From: Jarod Wilson jarod@redhat.com
[ Upstream commit 4ca0d9ac3fd8f9f90b72a15d8da2aca3ffb58418 ]
Broadcast mode bonds transmit a copy of all traffic simultaneously out of all interfaces, so the "speed" of the bond isn't really the aggregate of all interfaces, but rather, the speed of the slowest active interface.
Also, the type of the speed field is u32, not unsigned long, so adjust that accordingly, as required to make min() function here without complaining about mismatching types.
Fixes: bb5b052f751b ("bond: add support to read speed and duplex via ethtool") CC: Jay Vosburgh j.vosburgh@gmail.com CC: Veaceslav Falico vfalico@gmail.com CC: Andy Gospodarek andy@greyhouse.net CC: "David S. Miller" davem@davemloft.net CC: netdev@vger.kernel.org Acked-by: Jay Vosburgh jay.vosburgh@canonical.com Signed-off-by: Jarod Wilson jarod@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/bonding/bond_main.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 1f867e275408e..9ddbafdca3b05 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4156,13 +4156,23 @@ static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev) return ret; }
+static u32 bond_mode_bcast_speed(struct slave *slave, u32 speed) +{ + if (speed == 0 || speed == SPEED_UNKNOWN) + speed = slave->speed; + else + speed = min(speed, slave->speed); + + return speed; +} + static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, struct ethtool_link_ksettings *cmd) { struct bonding *bond = netdev_priv(bond_dev); - unsigned long speed = 0; struct list_head *iter; struct slave *slave; + u32 speed = 0;
cmd->base.duplex = DUPLEX_UNKNOWN; cmd->base.port = PORT_OTHER; @@ -4174,8 +4184,13 @@ static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, */ bond_for_each_slave(bond, slave, iter) { if (bond_slave_can_tx(slave)) { - if (slave->speed != SPEED_UNKNOWN) - speed += slave->speed; + if (slave->speed != SPEED_UNKNOWN) { + if (BOND_MODE(bond) == BOND_MODE_BROADCAST) + speed = bond_mode_bcast_speed(slave, + speed); + else + speed += slave->speed; + } if (cmd->base.duplex == DUPLEX_UNKNOWN && slave->duplex != DUPLEX_UNKNOWN) cmd->base.duplex = slave->duplex;
From: Cong Wang xiyou.wangcong@gmail.com
[ Upstream commit 832707021666411d04795c564a4adea5d6b94f17 ]
When we tear down a network namespace, we unregister all the netdevices within it. So we may queue a slave device and a bonding device together in the same unregister queue.
If the only slave device is non-ethernet, it would automatically unregister the bonding device as well. Thus, we may end up unregistering the bonding device twice.
Workaround this special case by checking reg_state.
Fixes: 9b5e383c11b0 ("net: Introduce unregister_netdevice_many()") Reported-by: syzbot+af23e7f3e0a7e10c8b67@syzkaller.appspotmail.com Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Andy Gospodarek andy@greyhouse.net Cc: Jay Vosburgh j.vosburgh@gmail.com Signed-off-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/bonding/bond_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 9ddbafdca3b05..a6d8d3b3c903d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2010,7 +2010,8 @@ static int bond_release_and_destroy(struct net_device *bond_dev, int ret;
ret = __bond_release_one(bond_dev, slave_dev, false, true); - if (ret == 0 && !bond_has_slaves(bond)) { + if (ret == 0 && !bond_has_slaves(bond) && + bond_dev->reg_state != NETREG_UNREGISTERING) { bond_dev->priv_flags |= IFF_DISABLE_NETPOLL; netdev_info(bond_dev, "Destroying bond %s\n", bond_dev->name);
From: Srinivas Kandagatla srinivas.kandagatla@linaro.org
[ Upstream commit ff69c97ef84c9f7795adb49e9f07c9adcdd0c288 ]
For some reason interrupt set and clear register offsets are not set correctly. This patch corrects them!
Fixes: 585e881e5b9e ("ASoC: codecs: Add msm8916-wcd analog codec") Signed-off-by: Srinivas Kandagatla srinivas.kandagatla@linaro.org Tested-by: Stephan Gerhold stephan@gerhold.net Reviewed-by: Stephan Gerhold stephan@gerhold.net Link: https://lore.kernel.org/r/20200811103452.20448-1-srinivas.kandagatla@linaro.... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/codecs/msm8916-wcd-analog.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sound/soc/codecs/msm8916-wcd-analog.c b/sound/soc/codecs/msm8916-wcd-analog.c index 3633eb30dd135..4f949ad50d6a7 100644 --- a/sound/soc/codecs/msm8916-wcd-analog.c +++ b/sound/soc/codecs/msm8916-wcd-analog.c @@ -16,8 +16,8 @@
#define CDC_D_REVISION1 (0xf000) #define CDC_D_PERPH_SUBTYPE (0xf005) -#define CDC_D_INT_EN_SET (0x015) -#define CDC_D_INT_EN_CLR (0x016) +#define CDC_D_INT_EN_SET (0xf015) +#define CDC_D_INT_EN_CLR (0xf016) #define MBHC_SWITCH_INT BIT(7) #define MBHC_MIC_ELECTRICAL_INS_REM_DET BIT(6) #define MBHC_BUTTON_PRESS_DET BIT(5)
From: Dinghao Liu dinghao.liu@zju.edu.cn
[ Upstream commit 062fa09f44f4fb3776a23184d5d296b0c8872eb9 ]
When power_up_sst() fails, stream needs to be freed just like when try_module_get() fails. However, current code is returning directly and ends up leaking memory.
Fixes: 0121327c1a68b ("ASoC: Intel: mfld-pcm: add control for powering up/down dsp") Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Acked-by: Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com Link: https://lore.kernel.org/r/20200813084112.26205-1-dinghao.liu@zju.edu.cn Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/intel/atom/sst-mfld-platform-pcm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/sound/soc/intel/atom/sst-mfld-platform-pcm.c b/sound/soc/intel/atom/sst-mfld-platform-pcm.c index 4558c8b930363..3a645fc425cd4 100644 --- a/sound/soc/intel/atom/sst-mfld-platform-pcm.c +++ b/sound/soc/intel/atom/sst-mfld-platform-pcm.c @@ -339,7 +339,7 @@ static int sst_media_open(struct snd_pcm_substream *substream,
ret_val = power_up_sst(stream); if (ret_val < 0) - return ret_val; + goto out_power_up;
/* Make sure, that the period size is always even */ snd_pcm_hw_constraint_step(substream->runtime, 0, @@ -348,8 +348,9 @@ static int sst_media_open(struct snd_pcm_substream *substream, return snd_pcm_hw_constraint_integer(runtime, SNDRV_PCM_HW_PARAM_PERIODS); out_ops: - kfree(stream); mutex_unlock(&sst_lock); +out_power_up: + kfree(stream); return ret_val; }
From: Alex Williamson alex.williamson@redhat.com
[ Upstream commit aae7a75a821a793ed6b8ad502a5890fb8e8f172d ]
The vfio_iommu_replay() function does not currently unwind on error, yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma structure to indicate IOMMU mapping. The IOMMU mappings are torn down when the domain is destroyed, but the other actions go on to cause trouble later. For example, the iommu->domain_list can be empty if we only have a non-IOMMU backed mdev attached. We don't currently check if the list is empty before getting the first entry in the list, which leads to a bogus domain pointer. If a vfio_dma entry is erroneously marked as iommu_mapped, we'll attempt to use that bogus pointer to retrieve the existing physical page addresses.
This is the scenario that uncovered this issue, attempting to hot-add a vfio-pci device to a container with an existing mdev device and DMA mappings, one of which could not be pinned, causing a failure adding the new group to the existing container and setting the conditions for a subsequent attempt to explode.
To resolve this, we can first check if the domain_list is empty so that we can reject replay of a bogus domain, should we ever encounter this inconsistent state again in the future. The real fix though is to add the necessary unwind support, which means cleaning up the current pinning if an IOMMU mapping fails, then walking back through the r-b tree of DMA entries, reading from the IOMMU which ranges are mapped, and unmapping and unpinning those ranges. To be able to do this, we also defer marking the DMA entry as IOMMU mapped until all entries are processed, in order to allow the unwind to know the disposition of each entry.
Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices") Reported-by: Zhiyi Guo zhguo@redhat.com Tested-by: Zhiyi Guo zhguo@redhat.com Reviewed-by: Cornelia Huck cohuck@redhat.com Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/vfio/vfio_iommu_type1.c | 71 ++++++++++++++++++++++++++++++--- 1 file changed, 66 insertions(+), 5 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 35a3750a6ddd3..f22425501bc16 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1086,13 +1086,16 @@ static int vfio_bus_type(struct device *dev, void *data) static int vfio_iommu_replay(struct vfio_iommu *iommu, struct vfio_domain *domain) { - struct vfio_domain *d; + struct vfio_domain *d = NULL; struct rb_node *n; unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; int ret;
/* Arbitrarily pick the first domain in the list for lookups */ - d = list_first_entry(&iommu->domain_list, struct vfio_domain, next); + if (!list_empty(&iommu->domain_list)) + d = list_first_entry(&iommu->domain_list, + struct vfio_domain, next); + n = rb_first(&iommu->dma_list);
for (; n; n = rb_next(n)) { @@ -1110,6 +1113,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, phys_addr_t p; dma_addr_t i;
+ if (WARN_ON(!d)) { /* mapped w/o a domain?! */ + ret = -EINVAL; + goto unwind; + } + phys = iommu_iova_to_phys(d->domain, iova);
if (WARN_ON(!phys)) { @@ -1139,7 +1147,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; - return ret; + goto unwind; }
phys = pfn << PAGE_SHIFT; @@ -1148,14 +1156,67 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
ret = iommu_map(domain->domain, iova, phys, size, dma->prot | domain->prot); - if (ret) - return ret; + if (ret) { + if (!dma->iommu_mapped) + vfio_unpin_pages_remote(dma, iova, + phys >> PAGE_SHIFT, + size >> PAGE_SHIFT, + true); + goto unwind; + }
iova += size; } + } + + /* All dmas are now mapped, defer to second tree walk for unwind */ + for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) { + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); + dma->iommu_mapped = true; } + return 0; + +unwind: + for (; n; n = rb_prev(n)) { + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); + dma_addr_t iova; + + if (dma->iommu_mapped) { + iommu_unmap(domain->domain, dma->iova, dma->size); + continue; + } + + iova = dma->iova; + while (iova < dma->iova + dma->size) { + phys_addr_t phys, p; + size_t size; + dma_addr_t i; + + phys = iommu_iova_to_phys(domain->domain, iova); + if (!phys) { + iova += PAGE_SIZE; + continue; + } + + size = PAGE_SIZE; + p = phys + size; + i = iova + size; + while (i < dma->iova + dma->size && + p == iommu_iova_to_phys(domain->domain, i)) { + size += PAGE_SIZE; + p += PAGE_SIZE; + i += PAGE_SIZE; + } + + iommu_unmap(domain->domain, iova, size); + vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, + size >> PAGE_SHIFT, true); + } + } + + return ret; }
/*
From: Jiri Wiesner jwiesner@suse.com
[ Upstream commit 0410d07190961ac526f05085765a8d04d926545b ]
When the ARP monitor is used for link detection, ARP replies are validated for all slaves (arp_validate=3) and fail_over_mac is set to active, two slaves of an active-backup bond may get stuck in a state where both of them are active and pass packets that they receive to the bond. This state makes IPv6 duplicate address detection fail. The state is reached thus: 1. The current active slave goes down because the ARP target is not reachable. 2. The current ARP slave is chosen and made active. 3. A new slave is enslaved. This new slave becomes the current active slave and can reach the ARP target. As a result, the current ARP slave stays active after the enslave action has finished and the log is littered with "PROBE BAD" messages:
bond0: PROBE: c_arp ens10 && cas ens11 BAD
The workaround is to remove the slave with "going back" status from the bond and re-enslave it. This issue was encountered when DPDK PMD interfaces were being enslaved to an active-backup bond.
I would be possible to fix the issue in bond_enslave() or bond_change_active_slave() but the ARP monitor was fixed instead to keep most of the actions changing the current ARP slave in the ARP monitor code. The current ARP slave is set as inactive and backup during the commit phase. A new state, BOND_LINK_FAIL, has been introduced for slaves in the context of the ARP monitor. This allows administrators to see how slaves are rotated for sending ARP requests and attempts are made to find a new active slave.
Fixes: b2220cad583c9 ("bonding: refactor ARP active-backup monitor") Signed-off-by: Jiri Wiesner jwiesner@suse.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/bonding/bond_main.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a6d8d3b3c903d..861d2c0a521a4 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2753,6 +2753,9 @@ static int bond_ab_arp_inspect(struct bonding *bond) if (bond_time_in_interval(bond, last_rx, 1)) { bond_propose_link_state(slave, BOND_LINK_UP); commit++; + } else if (slave->link == BOND_LINK_BACK) { + bond_propose_link_state(slave, BOND_LINK_FAIL); + commit++; } continue; } @@ -2863,6 +2866,19 @@ static void bond_ab_arp_commit(struct bonding *bond)
continue;
+ case BOND_LINK_FAIL: + bond_set_slave_link_state(slave, BOND_LINK_FAIL, + BOND_SLAVE_NOTIFY_NOW); + bond_set_slave_inactive_flags(slave, + BOND_SLAVE_NOTIFY_NOW); + + /* A slave has just been enslaved and has become + * the current active slave. + */ + if (rtnl_dereference(bond->curr_active_slave)) + RCU_INIT_POINTER(bond->current_arp_slave, NULL); + continue; + default: netdev_err(bond->dev, "impossible: new_link %d on slave %s\n", slave->link_new_state, slave->dev->name); @@ -2912,8 +2928,6 @@ static bool bond_ab_arp_probe(struct bonding *bond) return should_notify_rtnl; }
- bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER); - bond_for_each_slave_rcu(bond, slave, iter) { if (!found && !before && bond_slave_is_up(slave)) before = slave;
From: Haiyang Zhang haiyangz@microsoft.com
[ Upstream commit c3d897e01aef8ddc43149e4d661b86f823e3aae7 ]
netvsc_vf_xmit() / dev_queue_xmit() will call VF NIC’s ndo_select_queue or netdev_pick_tx() again. They will use skb_get_rx_queue() to get the queue number, so the “skb->queue_mapping - 1” will be used. This may cause the last queue of VF not been used.
Use skb_record_rx_queue() here, so that the skb_get_rx_queue() called later will get the correct queue number, and VF will be able to use all queues.
Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF") Signed-off-by: Haiyang Zhang haiyangz@microsoft.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/hyperv/netvsc_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 10c3480c2da89..dbc6c9ed1c8f8 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -500,7 +500,7 @@ static int netvsc_vf_xmit(struct net_device *net, struct net_device *vf_netdev, int rc;
skb->dev = vf_netdev; - skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; + skb_record_rx_queue(skb, qdisc_skb_cb(skb)->slave_dev_queue_mapping);
rc = dev_queue_xmit(skb); if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
From: Tom Rix trix@redhat.com
[ Upstream commit 774d977abfd024e6f73484544b9abe5a5cd62de7 ]
clang static analysis reports this problem
b53_common.c:1583:13: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage ent.port &= ~BIT(port); ~~~~~~~~ ^
ent is set by a successful call to b53_arl_read(). Unsuccessful calls are caught by an switch statement handling specific returns. b32_arl_read() calls b53_arl_op_wait() which fails with the unhandled -ETIMEDOUT.
So add -ETIMEDOUT to the switch statement. Because b53_arl_op_wait() already prints out a message, do not add another one.
Fixes: 1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations") Signed-off-by: Tom Rix trix@redhat.com Acked-by: Florian Fainelli f.fainelli@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/b53/b53_common.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 274d369151107..5c3fa0be8844e 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -1160,6 +1160,8 @@ static int b53_arl_op(struct b53_device *dev, int op, int port, return ret;
switch (ret) { + case -ETIMEDOUT: + return ret; case -ENOSPC: dev_dbg(dev->dev, "{%pM,%.4d} no space left in ARL\n", addr, vid);
From: Vasant Hegde hegdevasant@linux.vnet.ibm.com
commit 90a9b102eddf6a3f987d15f4454e26a2532c1c98 upstream.
As per PAPR we have to look for both EPOW sensor value and event modifier to identify the type of event and take appropriate action.
In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":
SYSTEM_SHUTDOWN 3
The system must be shut down. An EPOW-aware OS logs the EPOW error log information, then schedules the system to be shut down to begin after an OS defined delay internal (default is 10 minutes.)
Then in section 10.3.2.2.8 there is table 146 "Platform Event Log Format, Version 6, EPOW Section", which includes the "EPOW Event Modifier":
For EPOW sensor value = 3 0x01 = Normal system shutdown with no additional delay 0x02 = Loss of utility power, system is running on UPS/Battery 0x03 = Loss of system critical functions, system should be shutdown 0x04 = Ambient temperature too high All other values = reserved
We have a user space tool (rtas_errd) on LPAR to monitor for EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown after predefined time. It also starts monitoring for any new EPOW events. If it receives "Power restored" event before predefined time it will cancel the shutdown. Otherwise after predefined time it will shutdown the system.
Commit 79872e35469b ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of the "on UPS/Battery" case, to immediately shutdown the system. This breaks existing setups that rely on the userspace tool to delay shutdown and let the system run on the UPS.
Fixes: 79872e35469b ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Vasant Hegde hegdevasant@linux.vnet.ibm.com [mpe: Massage change log and add PAPR references] Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20200820061844.306460-1-hegdevasant@linux.vnet.ibm... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/powerpc/platforms/pseries/ras.c | 1 - 1 file changed, 1 deletion(-)
--- a/arch/powerpc/platforms/pseries/ras.c +++ b/arch/powerpc/platforms/pseries/ras.c @@ -115,7 +115,6 @@ static void handle_system_shutdown(char case EPOW_SHUTDOWN_ON_UPS: pr_emerg("Loss of system power detected. System is running on" " UPS/battery. Check RTAS error log for details\n"); - orderly_poweroff(true); break;
case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS:
From: Marc Zyngier maz@kernel.org
commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.
When adding a new fd to an epoll, and that this new fd is an epoll fd itself, we recursively scan the fds attached to it to detect cycles, and add non-epool files to a "check list" that gets subsequently parsed.
However, this check list isn't completely safe when deletions can happen concurrently. To sidestep the issue, make sure that a struct file placed on the check list sees its f_count increased, ensuring that a concurrent deletion won't result in the file disapearing from under our feet.
Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/eventpoll.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1900,9 +1900,11 @@ static int ep_loop_check_proc(void *priv * not already there, and calling reverse_path_check() * during ep_insert(). */ - if (list_empty(&epi->ffd.file->f_tfile_llink)) + if (list_empty(&epi->ffd.file->f_tfile_llink)) { + get_file(epi->ffd.file); list_add(&epi->ffd.file->f_tfile_llink, &tfile_check_list); + } } } mutex_unlock(&ep->mtx); @@ -1946,6 +1948,7 @@ static void clear_tfile_check_list(void) file = list_first_entry(&tfile_check_list, struct file, f_tfile_llink); list_del_init(&file->f_tfile_llink); + fput(file); } INIT_LIST_HEAD(&tfile_check_list); } @@ -2100,9 +2103,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in clear_tfile_check_list(); goto error_tgt_fput; } - } else + } else { + get_file(tf.file); list_add(&tf.file->f_tfile_llink, &tfile_check_list); + } mutex_lock_nested(&ep->mtx, 0); if (is_file_epoll(tf.file)) { tep = tf.file->private_data;
From: Al Viro viro@zeniv.linux.org.uk
commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.
Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/eventpoll.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2099,10 +2099,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in mutex_lock(&epmutex); if (is_file_epoll(tf.file)) { error = -ELOOP; - if (ep_loop_check(ep, tf.file) != 0) { - clear_tfile_check_list(); + if (ep_loop_check(ep, tf.file) != 0) goto error_tgt_fput; - } } else { get_file(tf.file); list_add(&tf.file->f_tfile_llink, @@ -2131,8 +2129,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in error = ep_insert(ep, &epds, tf.file, fd, full_check); } else error = -EEXIST; - if (full_check) - clear_tfile_check_list(); break; case EPOLL_CTL_DEL: if (epi) @@ -2155,8 +2151,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in mutex_unlock(&ep->mtx);
error_tgt_fput: - if (full_check) + if (full_check) { + clear_tfile_check_list(); mutex_unlock(&epmutex); + }
fdput(tf); error_fput:
From: Peter Xu peterx@redhat.com
commit 75802ca66354a39ab8e35822747cd08b3384a99a upstream.
This is found by code observation only.
Firstly, the worst case scenario should assume the whole range was covered by pmd sharing. The old algorithm might not work as expected for ranges like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the expected range should be (0, 2g).
Since at it, remove the loop since it should not be required. With that, the new code should be faster too when the invalidating range is huge.
Mike said:
: With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only : adjust to (0, 1g+2m) which is incorrect. : : We should cc stable. The original reason for adjusting the range was to : prevent data corruption (getting wrong page). Since the range is not : always adjusted correctly, the potential for corruption still exists. : : However, I am fairly confident that adjust_range_if_pmd_sharing_possible : is only gong to be called in two cases: : : 1) for a single page : 2) for range == entire vma : : In those cases, the current code should produce the correct results. : : To be safe, let's just cc stable.
Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/hugetlb.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-)
--- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4575,25 +4575,21 @@ static bool vma_shareable(struct vm_area void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { - unsigned long check_addr = *start; + unsigned long a_start, a_end;
if (!(vma->vm_flags & VM_MAYSHARE)) return;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { - unsigned long a_start = check_addr & PUD_MASK; - unsigned long a_end = a_start + PUD_SIZE; + /* Extend the range to be PUD aligned for a worst case scenario */ + a_start = ALIGN_DOWN(*start, PUD_SIZE); + a_end = ALIGN(*end, PUD_SIZE);
- /* - * If sharing is possible, adjust start/end if necessary. - */ - if (range_in_vma(vma, a_start, a_end)) { - if (a_start < *start) - *start = a_start; - if (a_end > *end) - *end = a_end; - } - } + /* + * Intersect the range with the vma range, since pmd sharing won't be + * across vma after all + */ + *start = max(vma->vm_start, a_start); + *end = min(vma->vm_end, a_end); }
/*
From: Juergen Gross jgross@suse.com
For support of long running hypercalls xen_maybe_preempt_hcall() is calling cond_resched() in case a hypercall marked as preemptible has been interrupted.
Normally this is no problem, as only hypercalls done via some ioctl()s are marked to be preemptible. In rare cases when during such a preemptible hypercall an interrupt occurs and any softirq action is started from irq_exit(), a further hypercall issued by the softirq handler will be regarded to be preemptible, too. This might lead to rescheduling in spite of the softirq handler potentially having set preempt_disable(), leading to splats like:
BUG: sleeping function called from invalid context at drivers/xen/preempt.c:37 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 20775, name: xl INFO: lockdep is turned off. CPU: 1 PID: 20775 Comm: xl Tainted: G D W 5.4.46-1_prgmr_debug.el7.x86_64 #1 Call Trace: <IRQ> dump_stack+0x8f/0xd0 ___might_sleep.cold.76+0xb2/0x103 xen_maybe_preempt_hcall+0x48/0x70 xen_do_hypervisor_callback+0x37/0x40 RIP: e030:xen_hypercall_xen_version+0xa/0x20 Code: ... RSP: e02b:ffffc900400dcc30 EFLAGS: 00000246 RAX: 000000000004000d RBX: 0000000000000200 RCX: ffffffff8100122a RDX: ffff88812e788000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffff83ee3ad0 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: ffff8881824aa0b0 R13: 0000000865496000 R14: 0000000865496000 R15: ffff88815d040000 ? xen_hypercall_xen_version+0xa/0x20 ? xen_force_evtchn_callback+0x9/0x10 ? check_events+0x12/0x20 ? xen_restore_fl_direct+0x1f/0x20 ? _raw_spin_unlock_irqrestore+0x53/0x60 ? debug_dma_sync_single_for_cpu+0x91/0xc0 ? _raw_spin_unlock_irqrestore+0x53/0x60 ? xen_swiotlb_sync_single_for_cpu+0x3d/0x140 ? mlx4_en_process_rx_cq+0x6b6/0x1110 [mlx4_en] ? mlx4_en_poll_rx_cq+0x64/0x100 [mlx4_en] ? net_rx_action+0x151/0x4a0 ? __do_softirq+0xed/0x55b ? irq_exit+0xea/0x100 ? xen_evtchn_do_upcall+0x2c/0x40 ? xen_do_hypervisor_callback+0x29/0x40 </IRQ> ? xen_hypercall_domctl+0xa/0x20 ? xen_hypercall_domctl+0x8/0x20 ? privcmd_ioctl+0x221/0x990 [xen_privcmd] ? do_vfs_ioctl+0xa5/0x6f0 ? ksys_ioctl+0x60/0x90 ? trace_hardirqs_off_thunk+0x1a/0x20 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x62/0x250 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix that by testing preempt_count() before calling cond_resched().
In kernel 5.8 this can't happen any more due to the entry code rework (more than 100 patches, so not a candidate for backporting).
The issue was introduced in kernel 4.3, so this patch should go into all stable kernels in [4.3 ... 5.7].
Reported-by: Sarah Newman srn@prgmr.com Fixes: 0fa2f5cb2b0ecd8 ("sched/preempt, xen: Use need_resched() instead of should_resched()") Cc: Sarah Newman srn@prgmr.com Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross jgross@suse.com Tested-by: Chris Brannon cmb@prgmr.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/xen/preempt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/xen/preempt.c +++ b/drivers/xen/preempt.c @@ -31,7 +31,7 @@ EXPORT_SYMBOL_GPL(xen_in_preemptible_hca asmlinkage __visible void xen_maybe_preempt_hcall(void) { if (unlikely(__this_cpu_read(xen_in_preemptible_hcall) - && need_resched())) { + && need_resched() && !preempt_count())) { /* * Clear flag as we may be rescheduled on a different * cpu.
From: Stephen Boyd sboyd@kernel.org
commit bdcf1dc253248542537a742ae1e7ccafdd03f2d3 upstream.
We leave a dangling pointer in each clk_core::parents array that has an unregistered clk as a potential parent when that clk_core pointer is freed by clk{_hw}_unregister(). It is impossible for the true parent of a clk to be set with clk_set_parent() once the dangling pointer is left in the cache because we compare parent pointers in clk_fetch_parent_index() instead of checking for a matching clk name or clk_hw pointer.
Before commit ede77858473a ("clk: Remove global clk traversal on fetch parent index"), we would check clk_hw pointers, which has a higher chance of being the same between registration and unregistration, but it can still be allocated and freed by the clk provider. In fact, this has been a long standing problem since commit da0f0b2c3ad2 ("clk: Correct lookup logic in clk_fetch_parent_index()") where we stopped trying to compare clk names and skipped over entries in the cache that weren't NULL.
There are good (performance) reasons to not do the global tree lookup in cases where the cache holds dangling pointers to parents that have been unregistered. Let's take the performance hit on the uncommon registration path instead. Loop through all the clk_core::parents arrays when a clk is unregistered and set the entry to NULL when the parent cache entry and clk being unregistered are the same pointer. This will fix this problem and avoid the overhead for the "normal" case.
Based on a patch by Bjorn Andersson.
Fixes: da0f0b2c3ad2 ("clk: Correct lookup logic in clk_fetch_parent_index()") Reviewed-by: Bjorn Andersson bjorn.andersson@linaro.org Tested-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org Signed-off-by: Stephen Boyd sboyd@kernel.org Link: https://lkml.kernel.org/r/20190828181959.204401-1-sboyd@kernel.org Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/clk/clk.c | 52 +++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 41 insertions(+), 11 deletions(-)
--- a/drivers/clk/clk.c +++ b/drivers/clk/clk.c @@ -39,6 +39,17 @@ static HLIST_HEAD(clk_root_list); static HLIST_HEAD(clk_orphan_list); static LIST_HEAD(clk_notifier_list);
+static struct hlist_head *all_lists[] = { + &clk_root_list, + &clk_orphan_list, + NULL, +}; + +static struct hlist_head *orphan_list[] = { + &clk_orphan_list, + NULL, +}; + /*** private data structures ***/
struct clk_core { @@ -1993,17 +2004,6 @@ static int inited = 0; static DEFINE_MUTEX(clk_debug_lock); static HLIST_HEAD(clk_debug_list);
-static struct hlist_head *all_lists[] = { - &clk_root_list, - &clk_orphan_list, - NULL, -}; - -static struct hlist_head *orphan_list[] = { - &clk_orphan_list, - NULL, -}; - static void clk_summary_show_one(struct seq_file *s, struct clk_core *c, int level) { @@ -2735,6 +2735,34 @@ static const struct clk_ops clk_nodrv_op .set_parent = clk_nodrv_set_parent, };
+static void clk_core_evict_parent_cache_subtree(struct clk_core *root, + struct clk_core *target) +{ + int i; + struct clk_core *child; + + for (i = 0; i < root->num_parents; i++) + if (root->parents[i] == target) + root->parents[i] = NULL; + + hlist_for_each_entry(child, &root->children, child_node) + clk_core_evict_parent_cache_subtree(child, target); +} + +/* Remove this clk from all parent caches */ +static void clk_core_evict_parent_cache(struct clk_core *core) +{ + struct hlist_head **lists; + struct clk_core *root; + + lockdep_assert_held(&prepare_lock); + + for (lists = all_lists; *lists; lists++) + hlist_for_each_entry(root, *lists, child_node) + clk_core_evict_parent_cache_subtree(root, core); + +} + /** * clk_unregister - unregister a currently registered clock * @clk: clock to unregister @@ -2773,6 +2801,8 @@ void clk_unregister(struct clk *clk) clk_core_set_parent(child, NULL); }
+ clk_core_evict_parent_cache(clk->core); + hlist_del_init(&clk->core->child_node);
if (clk->core->prepare_count)
On Mon, 24 Aug 2020 10:31:01 +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 4.14.195 release. There are 50 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Wed, 26 Aug 2020 08:23:34 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.195-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y and the diffstat can be found below.
thanks,
greg k-h
All tests passing for Tegra ...
Test results for stable-v4.14: 8 builds: 8 pass, 0 fail 16 boots: 16 pass, 0 fail 30 tests: 30 pass, 0 fail
Linux version: 4.14.195-rc1-g509a87150f17 Boards tested: tegra124-jetson-tk1, tegra20-ventana, tegra210-p2371-2180, tegra30-cardhu-a04
Tested-by: Jon Hunter jonathanh@nvidia.com
Jon
linux-stable-mirror@lists.linaro.org