This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.6.13-rc1
Amir Goldstein amir73il@gmail.com fanotify: merge duplicate events on parent and child
Amir Goldstein amir73il@gmail.com fsnotify: replace inode pointer with an object id
Max Kellermann mk@cm4all.com io_uring: don't use 'fd' for openat/openat2/statx
Christoph Hellwig hch@lst.de bdi: add a ->dev_name field to struct backing_dev_info
Christoph Hellwig hch@lst.de bdi: move bdi_dev_name out of line
Yafang Shao laoar.shao@gmail.com mm, memcg: fix error return value of mem_cgroup_css_alloc()
Ivan Delalande colona@arista.com scripts/decodecode: fix trapping instruction formatting
Julia Lawall Julia.Lawall@inria.fr iommu/virtio: Reverse arguments to list_add
Josh Poimboeuf jpoimboe@redhat.com objtool: Fix stack offset tracking for indirect CFAs
Paolo Bonzini pbonzini@redhat.com kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts
Arnd Bergmann arnd@arndb.de netfilter: nf_osf: avoid passing pointer to local var
Guillaume Nault gnault@redhat.com netfilter: nat: never update the UDP checksum when it's 0
Janakarajan Natarajan Janakarajan.Natarajan@amd.com arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
Suravee Suthikulpanit suravee.suthikulpanit@amd.com KVM: x86: Fixes posted interrupt check for IRQs delivery modes
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Fix premature unwind stoppage due to IRET frames
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Fix error path for bad ORC entry type
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Prevent unwinding before ORC initialization
Miroslav Benes mbenes@suse.cz x86/unwind/orc: Don't skip the first frame for inactive tasks
Jann Horn jannh@google.com x86/entry/64: Fix unwind hints in rewind_stack_do_exit()
Josh Poimboeuf jpoimboe@redhat.com x86/entry/64: Fix unwind hints in __switch_to_asm()
Josh Poimboeuf jpoimboe@redhat.com x86/entry/64: Fix unwind hints in kernel exit path
Josh Poimboeuf jpoimboe@redhat.com x86/entry/64: Fix unwind hints in register clearing code
Rick Edgecombe rick.p.edgecombe@intel.com x86/mm/cpa: Flush direct map alias during cpa
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_v_ogm_process
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_store_throughput_override
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_show_throughput_override
George Spelvin lkml@sdf.org batman-adv: fix batadv_nc_random_weight_tq
Tejun Heo tj@kernel.org iocost: protect iocg->abs_vdebt with iocg->waitq.lock
Vincent Chen vincent.chen@sifive.com riscv: set max_pfn to the PFN of the last page
Luis Chamberlain mcgrof@kernel.org coredump: fix crash when umh is disabled
Oscar Carter oscar.carter@gmx.com staging: gasket: Check the return value of gasket_get_bar_index()
Luis Henriques lhenriques@suse.com ceph: demote quotarealm lookup warning to a debug message
Jeff Layton jlayton@kernel.org ceph: fix endianness bug when handling MDS session feature bits
Henry Willard henry.willard@oracle.com mm: limit boost_watermark on small zones
David Hildenbrand david@redhat.com mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
Khazhismel Kumykov khazhy@google.com eventpoll: fix missing wakeup for ovflist in ep_poll_callback
Roman Penyaev rpenyaev@suse.de epoll: atomically remove wait entry on wake up
Oleg Nesterov oleg@redhat.com ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Daniel Kolesa daniel@octaforge.org drm/amd/display: work around fp code being emitted outside of DC_FP_START/END
H. Nikolaus Schaller hns@goldelico.com drm: ingenic-drm: add MODULE_DEVICE_TABLE
Tomas Winkler tomas.winkler@intel.com mei: me: disable mei interface on LBG servers.
Ulf Hansson ulf.hansson@linaro.org amba: Initialize dma_parms for amba devices
Ulf Hansson ulf.hansson@linaro.org driver core: platform: Initialize dma_parms for platform devices
Mark Rutland mark.rutland@arm.com arm64: hugetlb: avoid potential NULL dereference
Marc Zyngier maz@kernel.org KVM: arm64: Fix 32bit PC wrap-around
Marc Zyngier maz@kernel.org KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVER
Sean Christopherson sean.j.christopherson@intel.com KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path
Christian Borntraeger borntraeger@de.ibm.com KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction
Jason A. Donenfeld Jason@zx2c4.com crypto: arch/lib - limit simd usage to 4k chunks
Jason A. Donenfeld Jason@zx2c4.com crypto: arch/nhpoly1305 - process in explicit 4k chunks
Steven Rostedt (VMware) rostedt@goodmis.org tracing: Add a vmalloc_sync_mappings() for safe measure
Steven Rostedt (VMware) rostedt@goodmis.org tracing: Wait for preempt irq delay thread to finish
Masami Hiramatsu mhiramat@kernel.org tracing/kprobes: Reject new event if loc is NULL
Masami Hiramatsu mhiramat@kernel.org tracing/boottime: Fix kprobe event API usage
Oliver Neukum oneukum@suse.com USB: serial: garmin_gps: add sanity checking for data length
Bryan O'Donoghue bryan.odonoghue@linaro.org usb: chipidea: msm: Ensure proper controller reset using role switch API
Oliver Neukum oneukum@suse.com USB: uas: add quirk for LaCie 2Big Quadra
Jason Gerecke killertofu@gmail.com HID: wacom: Report 2nd-gen Intuos Pro S center button status over BT
Alan Stern stern@rowland.harvard.edu HID: usbhid: Fix race between usbhid_close() and usbhid_stop()
Jason Gerecke killertofu@gmail.com Revert "HID: wacom: generic: read the number of expected touches on a per collection basis"
Jere Leppänen jere.leppanen@nokia.com sctp: Fix bundling of SHUTDOWN with COOKIE-ACK
Jason Gerecke jason.gerecke@wacom.com HID: wacom: Read HID_DG_CONTACTMAX directly for non-generic devices
Jason A. Donenfeld Jason@zx2c4.com wireguard: send/receive: cond_resched() when processing worker ringbuffers
Jason A. Donenfeld Jason@zx2c4.com wireguard: socket: remove errant restriction on looping to self
Dejin Zheng zhengdejin5@gmail.com net: enetc: fix an issue about leak system resources
Toke Høiland-Jørgensen toke@redhat.com wireguard: receive: use tunnel helpers for decapsulating ECN markings
Jason A. Donenfeld Jason@zx2c4.com wireguard: queueing: cleanup ptr_ring in error path of packet_queue_init
Dan Carpenter dan.carpenter@oracle.com net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del()
Dan Carpenter dan.carpenter@oracle.com net: mvpp2: prevent buffer overflow in mvpp22_rss_ctx()
Roi Dayan roid@mellanox.com net/mlx5e: Fix q counters on uplink representors
Moshe Shemesh moshe@mellanox.com net/mlx5: Fix command entry leak in Internal Error State
Moshe Shemesh moshe@mellanox.com net/mlx5: Fix forced completion access non initialized command entry
Erez Shitrit erezsh@mellanox.com net/mlx5: DR, On creation set CQ's arm_db member to right value
Michael Chan michael.chan@broadcom.com bnxt_en: Fix VLAN acceleration handling in bnxt_fix_features().
Michael Chan michael.chan@broadcom.com bnxt_en: Return error when allocating zero size context memory.
Michael Chan michael.chan@broadcom.com bnxt_en: Improve AER slot reset.
Vasundhara Volam vasundhara-v.volam@broadcom.com bnxt_en: Reduce BNXT_MSIX_VEC_MAX value to supported CQs per PF.
Michael Chan michael.chan@broadcom.com bnxt_en: Fix VF anti-spoof filter setup.
Toke Høiland-Jørgensen toke@redhat.com tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040
Tuong Lien tuong.t.lien@dektech.com.au tipc: fix partial topology connection closure
Eric Dumazet edumazet@google.com selftests: net: tcp_mmap: fix SO_RCVLOWAT setting
Eric Dumazet edumazet@google.com selftests: net: tcp_mmap: clear whole tcp_zerocopy_receive struct
Eric Dumazet edumazet@google.com sch_sfq: validate silly quantum values
Eric Dumazet edumazet@google.com sch_choke: avoid potential panic in choke_reset()
Qiushi Wu wu000273@umn.edu nfp: abm: fix a memory leak bug
Matt Jolly Kangie@footclan.ninja net: usb: qmi_wwan: add support for DW5816e
Xiyu Yang xiyuyang19@fudan.edu.cn net/tls: Fix sk_psock refcnt leak when in tls_data_ready()
Xiyu Yang xiyuyang19@fudan.edu.cn net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict()
Anthony Felice tony.felice@timesys.com net: tc35815: Fix phydev supported/advertising mask
Willem de Bruijn willemb@google.com net: stricter validation of untrusted gso packets
Eric Dumazet edumazet@google.com net_sched: sch_skbprio: add message validation to skbprio_change()
Baruch Siach baruch@tkos.co.il net: phy: marvell10g: fix temperature sensor on 2110
Tariq Toukan tariqt@mellanox.com net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc()
Scott Dial scott@scottdial.com net: macsec: preserve ingress frame ordering
Dejin Zheng zhengdejin5@gmail.com net: macb: fix an issue about leak related system resources
Florian Fainelli f.fainelli@gmail.com net: dsa: Do not make user port errors fatal
Florian Fainelli f.fainelli@gmail.com net: dsa: Do not leave DSA master with NULL netdev_ops
Ido Schimmel idosch@mellanox.com net: bridge: vlan: Add a schedule point during VLAN processing
Roman Mashak mrv@mojatatu.com neigh: send protocol value in neighbor create notification
Jiri Pirko jiri@mellanox.com mlxsw: spectrum_acl_tcam: Position vchunk in a vregion list properly
David Ahern dsahern@kernel.org ipv6: Use global sernum for dst validation with nexthop objects
Eric Dumazet edumazet@google.com fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks
Julia Lawall Julia.Lawall@inria.fr dp83640: reverse arguments to list_add_tail
Jakub Kicinski kuba@kernel.org devlink: fix return value after hitting end in region read
Aya Levin ayal@mellanox.com devlink: Fix reporter's recovery condition
Rahul Lakkireddy rahul.lakkireddy@chelsio.com cxgb4: fix EOTID leak when disabling TC-MQPRIO offload
Andy Shevchenko andriy.shevchenko@linux.intel.com net: macb: Fix runtime PM refcounting
Masami Hiramatsu mhiramat@kernel.org tracing/kprobes: Fix a double initialization typo
Sagi Grimberg sagi@grimberg.me nvme: fix possible hang when ns scanning fails during error recovery
Christoph Hellwig hch@lst.de nvme: refactor nvme_identify_ns_descs error handling
Eric Whitney enwlinux@gmail.com ext4: disable dioread_nolock whenever delayed allocation is disabled
Ritesh Harjani riteshh@linux.ibm.com ext4: don't set dioread_nolock by default for blocksize < pagesize
Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com tty: xilinx_uartps: Fix missing id assignment to the console
Nicolas Pitre nico@fluxnic.net vt: fix unicode console freeing with a common interface
Evan Quan evan.quan@amd.com drm/amdgpu: drop redundant cg/pg ungate on runpm enter
Evan Quan evan.quan@amd.com drm/amdgpu: move kfd suspend after ip_suspend_phase1
Matt Jolly Kangie@footclan.ninja USB: serial: qcserial: Add DW5816e support
Mika Westerberg mika.westerberg@linux.intel.com thunderbolt: Check return value of tb_sw_read() in usb4_switch_op()
-------------
Diffstat:
Makefile | 4 +- arch/arm/crypto/chacha-glue.c | 14 ++- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm/crypto/poly1305-glue.c | 15 ++- arch/arm64/crypto/chacha-neon-glue.c | 14 ++- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/poly1305-glue.c | 15 ++- arch/arm64/kvm/guest.c | 7 ++ arch/arm64/mm/hugetlbpage.c | 2 + arch/riscv/mm/init.c | 3 +- arch/s390/kvm/priv.c | 4 +- arch/x86/crypto/blake2s-glue.c | 10 +- arch/x86/crypto/chacha_glue.c | 14 ++- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- arch/x86/crypto/poly1305_glue.c | 13 ++- arch/x86/entry/calling.h | 40 +++---- arch/x86/entry/entry_64.S | 14 +-- arch/x86/include/asm/kvm_host.h | 4 +- arch/x86/include/asm/unwind.h | 2 +- arch/x86/kernel/unwind_orc.c | 61 ++++++++--- arch/x86/kvm/ioapic.c | 10 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx/vmenter.S | 3 + arch/x86/mm/pat/set_memory.c | 12 ++- block/blk-iocost.c | 117 +++++++++++++-------- drivers/amba/bus.c | 1 + drivers/base/platform.c | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +- .../gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 31 ++++-- drivers/gpu/drm/ingenic/ingenic-drm.c | 1 + drivers/hid/usbhid/hid-core.c | 37 +++++-- drivers/hid/usbhid/usbhid.h | 1 + drivers/hid/wacom_sys.c | 4 +- drivers/hid/wacom_wac.c | 88 ++++------------ drivers/iommu/virtio-iommu.c | 2 +- drivers/misc/mei/hw-me.c | 8 ++ drivers/misc/mei/hw-me.h | 4 + drivers/misc/mei/pci-me.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 20 ++-- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 - drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 10 +- drivers/net/ethernet/cadence/macb_main.c | 24 ++--- drivers/net/ethernet/chelsio/cxgb4/sge.c | 39 ++++++- .../net/ethernet/freescale/enetc/enetc_pci_mdio.c | 2 +- drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c | 3 + drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 + drivers/net/ethernet/mellanox/mlx4/main.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 6 +- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 9 +- .../ethernet/mellanox/mlx5/core/steering/dr_send.c | 14 ++- .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c | 12 ++- drivers/net/ethernet/netronome/nfp/abm/main.c | 1 + drivers/net/ethernet/toshiba/tc35815.c | 2 +- drivers/net/macsec.c | 3 +- drivers/net/phy/dp83640.c | 2 +- drivers/net/phy/marvell10g.c | 27 ++++- drivers/net/usb/qmi_wwan.c | 1 + drivers/net/wireguard/queueing.c | 4 +- drivers/net/wireguard/receive.c | 8 +- drivers/net/wireguard/send.c | 4 + drivers/net/wireguard/socket.c | 12 --- drivers/nvme/host/core.c | 28 +++-- drivers/staging/gasket/gasket_core.c | 4 + drivers/thunderbolt/usb4.c | 3 + drivers/tty/serial/xilinx_uartps.c | 1 + drivers/tty/vt/vt.c | 9 +- drivers/usb/chipidea/ci_hdrc_msm.c | 2 +- drivers/usb/serial/garmin_gps.c | 4 +- drivers/usb/serial/qcserial.c | 1 + drivers/usb/storage/unusual_uas.h | 7 ++ fs/ceph/mds_client.c | 8 +- fs/ceph/quota.c | 4 +- fs/coredump.c | 8 ++ fs/eventpoll.c | 61 ++++++----- fs/ext4/ext4_jbd2.h | 3 + fs/ext4/super.c | 13 ++- fs/io_uring.c | 6 -- fs/notify/fanotify/fanotify.c | 9 +- fs/notify/inotify/inotify_fsnotify.c | 4 +- fs/notify/inotify/inotify_user.c | 2 +- include/linux/amba/bus.h | 1 + include/linux/backing-dev-defs.h | 1 + include/linux/backing-dev.h | 9 +- include/linux/fsnotify_backend.h | 7 +- include/linux/platform_device.h | 1 + include/linux/virtio_net.h | 26 ++++- include/net/inet_ecn.h | 57 +++++++++- include/net/ip6_fib.h | 4 + include/net/net_namespace.h | 7 ++ ipc/mqueue.c | 34 ++++-- kernel/trace/preemptirq_delay_test.c | 30 ++++-- kernel/trace/trace.c | 13 +++ kernel/trace/trace_boot.c | 20 ++-- kernel/trace/trace_kprobe.c | 8 +- kernel/umh.c | 5 + mm/backing-dev.c | 13 ++- mm/memcontrol.c | 15 +-- mm/page_alloc.c | 9 ++ net/batman-adv/bat_v_ogm.c | 2 +- net/batman-adv/network-coding.c | 9 +- net/batman-adv/sysfs.c | 3 +- net/bridge/br_netlink.c | 1 + net/core/devlink.c | 12 ++- net/core/neighbour.c | 6 +- net/dsa/dsa2.c | 2 +- net/dsa/master.c | 3 +- net/ipv6/route.c | 25 +++++ net/netfilter/nf_nat_proto.c | 4 +- net/netfilter/nfnetlink_osf.c | 12 ++- net/sched/sch_choke.c | 3 +- net/sched/sch_fq_codel.c | 2 +- net/sched/sch_sfq.c | 9 ++ net/sched/sch_skbprio.c | 3 + net/sctp/sm_statefuns.c | 6 +- net/tipc/topsrv.c | 5 +- net/tls/tls_sw.c | 7 +- scripts/decodecode | 2 +- tools/cgroup/iocost_monitor.py | 7 +- tools/objtool/check.c | 2 +- tools/testing/selftests/net/tcp_mmap.c | 7 +- tools/testing/selftests/wireguard/netns.sh | 54 +++++++++- virt/kvm/arm/hyp/aarch32.c | 8 +- virt/kvm/arm/vgic/vgic-mmio.c | 4 +- 125 files changed, 964 insertions(+), 459 deletions(-)
From: Mika Westerberg mika.westerberg@linux.intel.com
commit c3bf9930921b33edb31909006607e478751a6f5e upstream.
The function misses checking return value of tb_sw_read() before it accesses the value that was read. Fix this by checking the return value first.
Fixes: b04079837b20 ("thunderbolt: Add initial support for USB4") Signed-off-by: Mika Westerberg mika.westerberg@linux.intel.com Reviewed-by: Yehezkel Bernat yehezkelshb@gmail.com Cc: stable stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/thunderbolt/usb4.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/thunderbolt/usb4.c +++ b/drivers/thunderbolt/usb4.c @@ -182,6 +182,9 @@ static int usb4_switch_op(struct tb_swit return ret;
ret = tb_sw_read(sw, &val, TB_CFG_SWITCH, ROUTER_CS_26, 1); + if (ret) + return ret; + if (val & ROUTER_CS_26_ONS) return -EOPNOTSUPP;
From: Matt Jolly Kangie@footclan.ninja
commit 78d6de3cfbd342918d31cf68d0d2eda401338aef upstream.
Add support for Dell Wireless 5816e to drivers/usb/serial/qcserial.c
Signed-off-by: Matt Jolly Kangie@footclan.ninja Cc: stable stable@vger.kernel.org Signed-off-by: Johan Hovold johan@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/serial/qcserial.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/usb/serial/qcserial.c +++ b/drivers/usb/serial/qcserial.c @@ -173,6 +173,7 @@ static const struct usb_device_id id_tab {DEVICE_SWI(0x413c, 0x81b3)}, /* Dell Wireless 5809e Gobi(TM) 4G LTE Mobile Broadband Card (rev3) */ {DEVICE_SWI(0x413c, 0x81b5)}, /* Dell Wireless 5811e QDL */ {DEVICE_SWI(0x413c, 0x81b6)}, /* Dell Wireless 5811e QDL */ + {DEVICE_SWI(0x413c, 0x81cc)}, /* Dell Wireless 5816e */ {DEVICE_SWI(0x413c, 0x81cf)}, /* Dell Wireless 5819 */ {DEVICE_SWI(0x413c, 0x81d0)}, /* Dell Wireless 5819 */ {DEVICE_SWI(0x413c, 0x81d1)}, /* Dell Wireless 5818 */
From: Evan Quan evan.quan@amd.com
[ Upstream commit c457a273e118bb96e1db8d1825f313e6cafe4258 ]
This sequence change should be safe as what did in ip_suspend_phase1 is to suspend DCE only. And this is a prerequisite for coming redundant cg/pg ungate dropping.
Fixes: 487eca11a321ef ("drm/amdgpu: fix gfx hang during suspend with video playback (v2)") Signed-off-by: Evan Quan evan.quan@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index f184cdca938de..1ddf2460cf834 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3328,12 +3328,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon) amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
- amdgpu_amdkfd_suspend(adev); - amdgpu_ras_suspend(adev);
r = amdgpu_device_ip_suspend_phase1(adev);
+ amdgpu_amdkfd_suspend(adev); + /* evict vram memory */ amdgpu_bo_evict_vram(adev);
From: Evan Quan evan.quan@amd.com
[ Upstream commit f7b52890daba570bc8162d43c96b5583bbdd4edd ]
CG/PG ungate is already performed in ip_suspend_phase1. Otherwise, the CG/PG ungate will be performed twice. That will cause gfxoff disablement is performed twice also on runpm enter while gfxoff enablemnt once on rump exit. That will put gfxoff into disabled state.
Fixes: b2a7e9735ab286 ("drm/amdgpu: fix the hw hang during perform system reboot and reset") Signed-off-by: Evan Quan evan.quan@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1ddf2460cf834..5fcbacddb9b0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3325,9 +3325,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon) } }
- amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); - amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); - amdgpu_ras_suspend(adev);
r = amdgpu_device_ip_suspend_phase1(adev);
From: Nicolas Pitre nico@fluxnic.net
[ Upstream commit 57d38f26d81e4275748b69372f31df545dcd9b71 ]
By directly using kfree() in different places we risk missing one if it is switched to using vfree(), especially if the corresponding vmalloc() is hidden away within a common abstraction.
Oh wait, that's exactly what happened here.
So let's fix this by creating a common abstraction for the free case as well.
Signed-off-by: Nicolas Pitre nico@fluxnic.net Reported-by: syzbot+0bfda3ade1ee9288a1be@syzkaller.appspotmail.com Fixes: 9a98e7a80f95 ("vt: don't use kmalloc() for the unicode screen buffer") Cc: stable@vger.kernel.org Reviewed-by: Sam Ravnborg sam@ravnborg.org Link: https://lore.kernel.org/r/nycvar.YSQ.7.76.2005021043110.2671@knanqh.ubzr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/tty/vt/vt.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index cc1a041913654..699d8b56cbe75 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -365,9 +365,14 @@ static struct uni_screen *vc_uniscr_alloc(unsigned int cols, unsigned int rows) return uniscr; }
+static void vc_uniscr_free(struct uni_screen *uniscr) +{ + vfree(uniscr); +} + static void vc_uniscr_set(struct vc_data *vc, struct uni_screen *new_uniscr) { - vfree(vc->vc_uni_screen); + vc_uniscr_free(vc->vc_uni_screen); vc->vc_uni_screen = new_uniscr; }
@@ -1230,7 +1235,7 @@ static int vc_do_resize(struct tty_struct *tty, struct vc_data *vc, err = resize_screen(vc, new_cols, new_rows, user); if (err) { kfree(newscreen); - kfree(new_uniscr); + vc_uniscr_free(new_uniscr); return err; }
From: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com
[ Upstream commit 2ae11c46d5fdc46cb396e35911c713d271056d35 ]
When serial console has been assigned to ttyPS1 (which is serial1 alias) console index is not updated property and pointing to index -1 (statically initialized) which ends up in situation where nothing has been printed on the port.
The commit 18cc7ac8a28e ("Revert "serial: uartps: Register own uart console and driver structures"") didn't contain this line which was removed by accident.
Fixes: 18cc7ac8a28e ("Revert "serial: uartps: Register own uart console and driver structures"") Signed-off-by: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com Cc: stable stable@vger.kernel.org Signed-off-by: Michal Simek michal.simek@xilinx.com Link: https://lore.kernel.org/r/ed3111533ef5bd342ee5ec504812240b870f0853.158860244... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/tty/serial/xilinx_uartps.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/tty/serial/xilinx_uartps.c b/drivers/tty/serial/xilinx_uartps.c index 7a9b360b04386..1d8b6993a4357 100644 --- a/drivers/tty/serial/xilinx_uartps.c +++ b/drivers/tty/serial/xilinx_uartps.c @@ -1471,6 +1471,7 @@ static int cdns_uart_probe(struct platform_device *pdev) cdns_uart_uart_driver.nr = CDNS_UART_NR_PORTS; #ifdef CONFIG_SERIAL_XILINX_PS_UART_CONSOLE cdns_uart_uart_driver.cons = &cdns_uart_console; + cdns_uart_console.index = id; #endif
rc = uart_register_driver(&cdns_uart_uart_driver);
From: Ritesh Harjani riteshh@linux.ibm.com
commit 626b035b816b61a7a7b4d2205a6807e2f11a18c1 upstream.
Currently on calling echo 3 > drop_caches on host machine, we see FS corruption in the guest. This happens on Power machine where blocksize < pagesize.
So as a temporary workaound don't enable dioread_nolock by default for blocksize < pagesize until we identify the root cause.
Also emit a warning msg in case if this mount option is manually enabled for blocksize < pagesize.
Reported-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Signed-off-by: Ritesh Harjani riteshh@linux.ibm.com Link: https://lore.kernel.org/r/20200327200744.12473-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ext4/super.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 446158ab507dd..70796de7c4682 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2181,6 +2181,14 @@ static int parse_options(char *options, struct super_block *sb, } } #endif + if (test_opt(sb, DIOREAD_NOLOCK)) { + int blocksize = + BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); + if (blocksize < PAGE_SIZE) + ext4_msg(sb, KERN_WARNING, "Warning: mounting with an " + "experimental mount option 'dioread_nolock' " + "for blocksize < PAGE_SIZE"); + } return 1; }
@@ -3787,7 +3795,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) set_opt(sb, NO_UID32); /* xattr user namespace & acls are now defaulted on */ set_opt(sb, XATTR_USER); - set_opt(sb, DIOREAD_NOLOCK); #ifdef CONFIG_EXT4_FS_POSIX_ACL set_opt(sb, POSIX_ACL); #endif @@ -3837,6 +3844,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size); + + if (blocksize == PAGE_SIZE) + set_opt(sb, DIOREAD_NOLOCK); + if (blocksize < EXT4_MIN_BLOCK_SIZE || blocksize > EXT4_MAX_BLOCK_SIZE) { ext4_msg(sb, KERN_ERR,
From: Eric Whitney enwlinux@gmail.com
[ Upstream commit c8980e1980ccdc2229aa2218d532ddc62e0aabe5 ]
The patch "ext4: make dioread_nolock the default" (244adf6426ee) causes generic/422 to fail when run in kvm-xfstests' ext3conv test case. This applies both the dioread_nolock and nodelalloc mount options, a combination not previously tested by kvm-xfstests. The failure occurs because the dioread_nolock code path splits a previously fallocated multiblock extent into a series of single block extents when overwriting a portion of that extent. That causes allocation of an extent tree leaf node and a reshuffling of extents. Once writeback is completed, the individual extents are recombined into a single extent, the extent is moved again, and the leaf node is deleted. The difference in block utilization before and after writeback due to the leaf node triggers the failure.
The original reason for this behavior was to avoid ENOSPC when handling I/O completions during writeback in the dioread_nolock code paths when delayed allocation is disabled. It may no longer be necessary, because code was added in the past to reserve extra space to solve this problem when delayed allocation is enabled, and this code may also apply when delayed allocation is disabled. Until this can be verified, don't use the dioread_nolock code paths if delayed allocation is disabled.
Signed-off-by: Eric Whitney enwlinux@gmail.com Link: https://lore.kernel.org/r/20200319150028.24592-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ext4/ext4_jbd2.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 7ea4f6fa173b4..4b9002f0e84c0 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -512,6 +512,9 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) return 0; if (ext4_should_journal_data(inode)) return 0; + /* temporary fix to prevent generic/422 test failures */ + if (!test_opt(inode->i_sb, DELALLOC)) + return 0; return 1; }
From: Christoph Hellwig hch@lst.de
[ Upstream commit fb314eb0cbb2e11540d1ae1a7b28346397f621ef ]
Move the handling of an error into the function from the caller, and only do it for an actual error on the admin command itself, not the command parsing, as that should be enough to deal with devices claiming a bogus version compliance.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fb4c35a430650..545e9e5f1b737 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1075,8 +1075,17 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
status = nvme_submit_sync_cmd(ctrl->admin_q, &c, data, NVME_IDENTIFY_DATA_SIZE); - if (status) + if (status) { + dev_warn(ctrl->device, + "Identify Descriptors failed (%d)\n", status); + /* + * Don't treat an error as fatal, as we potentially already + * have a NGUID or EUI-64. + */ + if (status > 0) + status = 0; goto free_data; + }
for (pos = 0; pos < NVME_IDENTIFY_DATA_SIZE; pos += len) { struct nvme_ns_id_desc *cur = data + pos; @@ -1734,26 +1743,15 @@ static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns) static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, struct nvme_id_ns *id, struct nvme_ns_ids *ids) { - int ret = 0; - memset(ids, 0, sizeof(*ids));
if (ctrl->vs >= NVME_VS(1, 1, 0)) memcpy(ids->eui64, id->eui64, sizeof(id->eui64)); if (ctrl->vs >= NVME_VS(1, 2, 0)) memcpy(ids->nguid, id->nguid, sizeof(id->nguid)); - if (ctrl->vs >= NVME_VS(1, 3, 0)) { - /* Don't treat error as fatal we potentially - * already have a NGUID or EUI-64 - */ - ret = nvme_identify_ns_descs(ctrl, nsid, ids); - if (ret) - dev_warn(ctrl->device, - "Identify Descriptors failed (%d)\n", ret); - if (ret > 0) - ret = 0; - } - return ret; + if (ctrl->vs >= NVME_VS(1, 3, 0)) + return nvme_identify_ns_descs(ctrl, nsid, ids); + return 0; }
static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
From: Sagi Grimberg sagi@grimberg.me
[ Upstream commit 59c7c3caaaf8750df4ec3255082f15eb4e371514 ]
When the controller is reconnecting, the host fails I/O and admin commands as the host cannot reach the controller. ns scanning may revalidate namespaces during that period and it is wrong to remove namespaces due to these failures as we may hang (see 205da2434301).
One command that may fail is nvme_identify_ns_descs. Since we return success due to having ns identify descriptor list optional, we continue to compare ns identifiers in nvme_revalidate_disk, obviously fail and return -ENODEV to nvme_validate_ns, which will remove the namespace.
Exactly what we don't want to happen.
Fixes: 22802bf742c2 ("nvme: Namepace identification descriptor list is optional") Tested-by: Anton Eidelman anton@lightbitslabs.com Signed-off-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Keith Busch kbusch@kernel.org Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 545e9e5f1b737..84f20369d8467 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1082,7 +1082,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, * Don't treat an error as fatal, as we potentially already * have a NGUID or EUI-64. */ - if (status > 0) + if (status > 0 && !(status & NVME_SC_DNR)) status = 0; goto free_data; }
From: Masami Hiramatsu mhiramat@kernel.org
[ Upstream commit dcbd21c9fca5e954fd4e3d91884907eb6d47187e ]
Fix a typo that resulted in an unnecessary double initialization to addr.
Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote...
Cc: Tom Zanussi zanussi@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Fixes: c7411a1a126f ("tracing/kprobe: Check whether the non-suffixed symbol is notrace") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d0568af4a0ef6..0d9300c3b0846 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -453,7 +453,7 @@ static bool __within_notrace_func(unsigned long addr)
static bool within_notrace_func(struct trace_kprobe *tk) { - unsigned long addr = addr = trace_kprobe_address(tk); + unsigned long addr = trace_kprobe_address(tk); char symname[KSYM_NAME_LEN], *p;
if (!__within_notrace_func(addr))
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
[ Upstream commit 0ce205d4660c312cdeb4a81066616dcc6f3799c4 ]
The commit e6a41c23df0d, while trying to fix an issue,
("net: macb: ensure interface is not suspended on at91rm9200")
introduced a refcounting regression, because in error case refcounter must be balanced. Fix it by calling pm_runtime_put_noidle() in error case.
While here, fix the same mistake in other couple of places.
Fixes: e6a41c23df0d ("net: macb: ensure interface is not suspended on at91rm9200") Cc: Alexandre Belloni alexandre.belloni@bootlin.com Cc: Claudiu Beznea claudiu.beznea@microchip.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/cadence/macb_main.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index b3a51935e8e0b..1fc83cd31cf28 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -334,8 +334,10 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum) int status;
status = pm_runtime_get_sync(&bp->pdev->dev); - if (status < 0) + if (status < 0) { + pm_runtime_put_noidle(&bp->pdev->dev); goto mdio_pm_exit; + }
status = macb_mdio_wait_for_idle(bp); if (status < 0) @@ -386,8 +388,10 @@ static int macb_mdio_write(struct mii_bus *bus, int mii_id, int regnum, int status;
status = pm_runtime_get_sync(&bp->pdev->dev); - if (status < 0) + if (status < 0) { + pm_runtime_put_noidle(&bp->pdev->dev); goto mdio_pm_exit; + }
status = macb_mdio_wait_for_idle(bp); if (status < 0) @@ -3803,8 +3807,10 @@ static int at91ether_open(struct net_device *dev) int ret;
ret = pm_runtime_get_sync(&lp->pdev->dev); - if (ret < 0) + if (ret < 0) { + pm_runtime_put_noidle(&lp->pdev->dev); return ret; + }
/* Clear internal statistics */ ctl = macb_readl(lp, NCR);
From: Rahul Lakkireddy rahul.lakkireddy@chelsio.com
[ Upstream commit 69422a7e5d578aab277091f4ebb7c1b387f3e355 ]
Under heavy load, the EOTID termination FLOWC request fails to get enqueued to the end of the Tx ring due to lack of credits. This results in EOTID leak.
When disabling TC-MQPRIO offload, the link is already brought down to cleanup EOTIDs. So, flush any pending enqueued skbs that can't be sent outside the wire, to make room for FLOWC request. Also, move the FLOWC descriptor consumption logic closer to when the FLOWC request is actually posted to hardware.
Fixes: 0e395b3cb1fb ("cxgb4: add FLOWC based QoS offload") Signed-off-by: Rahul Lakkireddy rahul.lakkireddy@chelsio.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/chelsio/cxgb4/sge.c | 39 ++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2202,6 +2202,9 @@ static void ethofld_hard_xmit(struct net if (unlikely(skip_eotx_wr)) { start = (u64 *)wr; eosw_txq->state = next_state; + eosw_txq->cred -= wrlen16; + eosw_txq->ncompl++; + eosw_txq->last_compl = 0; goto write_wr_headers; }
@@ -2360,6 +2363,34 @@ netdev_tx_t t4_start_xmit(struct sk_buff return cxgb4_eth_xmit(skb, dev); }
+static void eosw_txq_flush_pending_skbs(struct sge_eosw_txq *eosw_txq) +{ + int pktcount = eosw_txq->pidx - eosw_txq->last_pidx; + int pidx = eosw_txq->pidx; + struct sk_buff *skb; + + if (!pktcount) + return; + + if (pktcount < 0) + pktcount += eosw_txq->ndesc; + + while (pktcount--) { + pidx--; + if (pidx < 0) + pidx += eosw_txq->ndesc; + + skb = eosw_txq->desc[pidx].skb; + if (skb) { + dev_consume_skb_any(skb); + eosw_txq->desc[pidx].skb = NULL; + eosw_txq->inuse--; + } + } + + eosw_txq->pidx = eosw_txq->last_pidx + 1; +} + /** * cxgb4_ethofld_send_flowc - Send ETHOFLD flowc request to bind eotid to tc. * @dev - netdevice @@ -2435,9 +2466,11 @@ int cxgb4_ethofld_send_flowc(struct net_ FW_FLOWC_MNEM_EOSTATE_CLOSING : FW_FLOWC_MNEM_EOSTATE_ESTABLISHED);
- eosw_txq->cred -= len16; - eosw_txq->ncompl++; - eosw_txq->last_compl = 0; + /* Free up any pending skbs to ensure there's room for + * termination FLOWC. + */ + if (tc == FW_SCHED_CLS_NONE) + eosw_txq_flush_pending_skbs(eosw_txq);
ret = eosw_txq_enqueue(eosw_txq, skb); if (ret) {
From: Aya Levin ayal@mellanox.com
[ Upstream commit bea0c5c942d3b4e9fb6ed45f6a7de74c6b112437 ]
Devlink health core conditions the reporter's recovery with the expiration of the grace period. This is not relevant for the first recovery. Explicitly demand that the grace period will only apply to recoveries other than the first.
Fixes: c8e1da0bf923 ("devlink: Add health report functionality") Signed-off-by: Aya Levin ayal@mellanox.com Reviewed-by: Moshe Shemesh moshe@mellanox.com Reviewed-by: Jiri Pirko jiri@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/devlink.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
--- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -5029,6 +5029,7 @@ int devlink_health_report(struct devlink { enum devlink_health_reporter_state prev_health_state; struct devlink *devlink = reporter->devlink; + unsigned long recover_ts_threshold;
/* write a log message of the current error */ WARN_ON(!msg); @@ -5039,10 +5040,12 @@ int devlink_health_report(struct devlink devlink_recover_notify(reporter, DEVLINK_CMD_HEALTH_REPORTER_RECOVER);
/* abort if the previous error wasn't recovered */ + recover_ts_threshold = reporter->last_recovery_ts + + msecs_to_jiffies(reporter->graceful_period); if (reporter->auto_recover && (prev_health_state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY || - jiffies - reporter->last_recovery_ts < - msecs_to_jiffies(reporter->graceful_period))) { + (reporter->last_recovery_ts && reporter->recovery_count && + time_is_after_jiffies(recover_ts_threshold)))) { trace_devlink_health_recover_aborted(devlink, reporter->ops->name, reporter->health_state,
From: Jakub Kicinski kuba@kernel.org
[ Upstream commit 610a9346c138b9c2c93d38bf5f3728e74ae9cbd5 ]
Commit d5b90e99e1d5 ("devlink: report 0 after hitting end in region read") fixed region dump, but region read still returns a spurious error:
$ devlink region read netdevsim/netdevsim1/dummy snapshot 0 addr 0 len 128 0000000000000000 a6 f4 c4 1c 21 35 95 a6 9d 34 c3 5b 87 5b 35 79 0000000000000010 f3 a0 d7 ee 4f 2f 82 7f c6 dd c4 f6 a5 c3 1b ae 0000000000000020 a4 fd c8 62 07 59 48 03 70 3b c7 09 86 88 7f 68 0000000000000030 6f 45 5d 6d 7d 0e 16 38 a9 d0 7a 4b 1e 1e 2e a6 0000000000000040 e6 1d ae 06 d6 18 00 85 ca 62 e8 7e 11 7e f6 0f 0000000000000050 79 7e f7 0f f3 94 68 bd e6 40 22 85 b6 be 6f b1 0000000000000060 af db ef 5e 34 f0 98 4b 62 9a e3 1b 8b 93 fc 17 devlink answers: Invalid argument 0000000000000070 61 e8 11 11 66 10 a5 f7 b1 ea 8d 40 60 53 ed 12
This is a minimal fix, I'll follow up with a restructuring so we don't have two checks for the same condition.
Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read") Signed-off-by: Jakub Kicinski kuba@kernel.org Reviewed-by: Jacob Keller jacob.e.keller@intel.com Reviewed-by: Jiri Pirko jiri@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/devlink.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -4030,6 +4030,11 @@ static int devlink_nl_cmd_region_read_du end_offset = nla_get_u64(attrs[DEVLINK_ATTR_REGION_CHUNK_ADDR]); end_offset += nla_get_u64(attrs[DEVLINK_ATTR_REGION_CHUNK_LEN]); dump = false; + + if (start_offset == end_offset) { + err = 0; + goto nla_put_failure; + } }
err = devlink_nl_region_read_snapshot_fill(skb, devlink,
From: Julia Lawall Julia.Lawall@inria.fr
[ Upstream commit 865308373ed49c9fb05720d14cbf1315349b32a9 ]
In this code, it appears that phyter_clocks is a list head, based on the previous list_for_each, and that clock->list is intended to be a list element, given that it has just been initialized in dp83640_clock_init. Accordingly, switch the arguments to list_add_tail, which takes the list head as the second argument.
Fixes: cb646e2b02b27 ("ptp: Added a clock driver for the National Semiconductor PHYTER.") Signed-off-by: Julia Lawall Julia.Lawall@inria.fr Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/phy/dp83640.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/phy/dp83640.c +++ b/drivers/net/phy/dp83640.c @@ -1120,7 +1120,7 @@ static struct dp83640_clock *dp83640_clo goto out; } dp83640_clock_init(clock, bus); - list_add_tail(&phyter_clocks, &clock->list); + list_add_tail(&clock->list, &phyter_clocks); out: mutex_unlock(&phyter_clocks_lock);
From: Eric Dumazet edumazet@google.com
[ Upstream commit 14695212d4cd8b0c997f6121b6df8520038ce076 ]
My intent was to not let users set a zero drop_batch_size, it seems I once again messed with min()/max().
Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()") Signed-off-by: Eric Dumazet edumazet@google.com Acked-by: Toke Høiland-Jørgensen toke@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_fq_codel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -416,7 +416,7 @@ static int fq_codel_change(struct Qdisc q->quantum = max(256U, nla_get_u32(tb[TCA_FQ_CODEL_QUANTUM]));
if (tb[TCA_FQ_CODEL_DROP_BATCH_SIZE]) - q->drop_batch_size = min(1U, nla_get_u32(tb[TCA_FQ_CODEL_DROP_BATCH_SIZE])); + q->drop_batch_size = max(1U, nla_get_u32(tb[TCA_FQ_CODEL_DROP_BATCH_SIZE]));
if (tb[TCA_FQ_CODEL_MEMORY_LIMIT]) q->memory_limit = min(1U << 31, nla_get_u32(tb[TCA_FQ_CODEL_MEMORY_LIMIT]));
From: David Ahern dsahern@kernel.org
[ Upstream commit 8f34e53b60b337e559f1ea19e2780ff95ab2fa65 ]
Nik reported a bug with pcpu dst cache when nexthop objects are used illustrated by the following: $ ip netns add foo $ ip -netns foo li set lo up $ ip -netns foo addr add 2001:db8:11::1/128 dev lo $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1 $ ip li add veth1 type veth peer name veth2 $ ip li set veth1 up $ ip addr add 2001:db8:10::1/64 dev veth1 $ ip li set dev veth2 netns foo $ ip -netns foo li set veth2 up $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2 $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1 $ ip -6 route add 2001:db8:11::1/128 nhid 100
Create a pcpu entry on cpu 0: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
Re-add the route entry: $ ip -6 ro del 2001:db8:11::1 $ ip -6 route add 2001:db8:11::1/128 nhid 100
Route get on cpu 0 returns the stale pcpu: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 RTNETLINK answers: Network is unreachable
While cpu 1 works: $ taskset -a -c 1 ip -6 route get 2001:db8:11::1 2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
Conversion of FIB entries to work with external nexthop objects missed an important difference between IPv4 and IPv6 - how dst entries are invalidated when the FIB changes. IPv4 has a per-network namespace generation id (rt_genid) that is bumped on changes to the FIB. Checking if a dst_entry is still valid means comparing rt_genid in the rtable to the current value of rt_genid for the namespace.
IPv6 also has a per network namespace counter, fib6_sernum, but the count is saved per fib6_node. With the per-node counter only dst_entries based on fib entries under the node are invalidated when changes are made to the routes - limiting the scope of invalidations. IPv6 uses a reference in the rt6_info, 'from', to track the corresponding fib entry used to create the dst_entry. When validating a dst_entry, the 'from' is used to backtrack to the fib6_node and check the sernum of it to the cookie passed to the dst_check operation.
With the inline format (nexthop definition inline with the fib6_info), dst_entries cached in the fib6_nh have a 1:1 correlation between fib entries, nexthop data and dst_entries. With external nexthops, IPv6 looks more like IPv4 which means multiple fib entries across disparate fib6_nodes can all reference the same fib6_nh. That means validation of dst_entries based on external nexthops needs to use the IPv4 format - the per-network namespace counter.
Add sernum to rt6_info and set it when creating a pcpu dst entry. Update rt6_get_cookie to return sernum if it is set and update dst_check for IPv6 to look for sernum set and based the check on it if so. Finally, rt6_get_pcpu_route needs to validate the cached entry before returning a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and __mkroute_output for IPv4).
This problem only affects routes using the new, external nexthops.
Thanks to the kbuild test robot for catching the IS_ENABLED needed around rt_genid_ipv6 before I sent this out.
Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects") Reported-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David Ahern dsahern@kernel.org Reviewed-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Tested-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/ip6_fib.h | 4 ++++ include/net/net_namespace.h | 7 +++++++ net/ipv6/route.c | 25 +++++++++++++++++++++++++ 3 files changed, 36 insertions(+)
--- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -204,6 +204,7 @@ struct fib6_info { struct rt6_info { struct dst_entry dst; struct fib6_info __rcu *from; + int sernum;
struct rt6key rt6i_dst; struct rt6key rt6i_src; @@ -292,6 +293,9 @@ static inline u32 rt6_get_cookie(const s struct fib6_info *from; u32 cookie = 0;
+ if (rt->sernum) + return rt->sernum; + rcu_read_lock();
from = rcu_dereference(rt->from); --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -432,6 +432,13 @@ static inline int rt_genid_ipv4(const st return atomic_read(&net->ipv4.rt_genid); }
+#if IS_ENABLED(CONFIG_IPV6) +static inline int rt_genid_ipv6(const struct net *net) +{ + return atomic_read(&net->ipv6.fib6_sernum); +} +#endif + static inline void rt_genid_bump_ipv4(struct net *net) { atomic_inc(&net->ipv4.rt_genid); --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1388,9 +1388,18 @@ static struct rt6_info *ip6_rt_pcpu_allo } ip6_rt_copy_init(pcpu_rt, res); pcpu_rt->rt6i_flags |= RTF_PCPU; + + if (f6i->nh) + pcpu_rt->sernum = rt_genid_ipv6(dev_net(dev)); + return pcpu_rt; }
+static bool rt6_is_valid(const struct rt6_info *rt6) +{ + return rt6->sernum == rt_genid_ipv6(dev_net(rt6->dst.dev)); +} + /* It should be called with rcu_read_lock() acquired */ static struct rt6_info *rt6_get_pcpu_route(const struct fib6_result *res) { @@ -1398,6 +1407,19 @@ static struct rt6_info *rt6_get_pcpu_rou
pcpu_rt = this_cpu_read(*res->nh->rt6i_pcpu);
+ if (pcpu_rt && pcpu_rt->sernum && !rt6_is_valid(pcpu_rt)) { + struct rt6_info *prev, **p; + + p = this_cpu_ptr(res->nh->rt6i_pcpu); + prev = xchg(p, NULL); + if (prev) { + dst_dev_put(&prev->dst); + dst_release(&prev->dst); + } + + pcpu_rt = NULL; + } + return pcpu_rt; }
@@ -2596,6 +2618,9 @@ static struct dst_entry *ip6_dst_check(s
rt = container_of(dst, struct rt6_info, dst);
+ if (rt->sernum) + return rt6_is_valid(rt) ? dst : NULL; + rcu_read_lock();
/* All IPV6 dsts are created with ->obsolete set to the value
From: Jiri Pirko jiri@mellanox.com
[ Upstream commit 6ef4889fc0b3aa6ab928e7565935ac6f762cee6e ]
Vregion helpers to get min and max priority depend on the correct ordering of vchunks in the vregion list. However, the current code always adds new chunk to the end of the list, no matter what the priority is. Fix this by finding the correct place in the list and put vchunk there.
Fixes: 22a677661f56 ("mlxsw: spectrum: Introduce ACL core with simple TCAM implementation") Signed-off-by: Jiri Pirko jiri@mellanox.com Signed-off-by: Ido Schimmel idosch@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c @@ -986,8 +986,9 @@ mlxsw_sp_acl_tcam_vchunk_create(struct m unsigned int priority, struct mlxsw_afk_element_usage *elusage) { + struct mlxsw_sp_acl_tcam_vchunk *vchunk, *vchunk2; struct mlxsw_sp_acl_tcam_vregion *vregion; - struct mlxsw_sp_acl_tcam_vchunk *vchunk; + struct list_head *pos; int err;
if (priority == MLXSW_SP_ACL_TCAM_CATCHALL_PRIO) @@ -1025,7 +1026,14 @@ mlxsw_sp_acl_tcam_vchunk_create(struct m }
mlxsw_sp_acl_tcam_rehash_ctx_vregion_changed(vregion); - list_add_tail(&vchunk->list, &vregion->vchunk_list); + + /* Position the vchunk inside the list according to priority */ + list_for_each(pos, &vregion->vchunk_list) { + vchunk2 = list_entry(pos, typeof(*vchunk2), list); + if (vchunk2->priority > priority) + break; + } + list_add_tail(&vchunk->list, pos); mutex_unlock(&vregion->lock);
return vchunk;
From: Roman Mashak mrv@mojatatu.com
[ Upstream commit 38212bb31fe923d0a2c6299bd2adfbb84cddef2a ]
When a new neighbor entry has been added, event is generated but it does not include protocol, because its value is assigned after the event notification routine has run, so move protocol assignment code earlier.
Fixes: df9b0e30d44c ("neighbor: Add protocol attribute") Cc: David Ahern dsahern@gmail.com Signed-off-by: Roman Mashak mrv@mojatatu.com Reviewed-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/neighbour.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1954,6 +1954,9 @@ static int neigh_add(struct sk_buff *skb NEIGH_UPDATE_F_OVERRIDE_ISROUTER); }
+ if (protocol) + neigh->protocol = protocol; + if (ndm->ndm_flags & NTF_EXT_LEARNED) flags |= NEIGH_UPDATE_F_EXT_LEARNED;
@@ -1967,9 +1970,6 @@ static int neigh_add(struct sk_buff *skb err = __neigh_update(neigh, lladdr, ndm->ndm_state, flags, NETLINK_CB(skb).portid, extack);
- if (protocol) - neigh->protocol = protocol; - neigh_release(neigh);
out:
From: Ido Schimmel idosch@mellanox.com
[ Upstream commit 7979457b1d3a069cd857f5bd69e070e30223dd0c ]
User space can request to delete a range of VLANs from a bridge slave in one netlink request. For each deleted VLAN the FDB needs to be traversed in order to flush all the affected entries.
If a large range of VLANs is deleted and the number of FDB entries is large or the FDB lock is contented, it is possible for the kernel to loop through the deleted VLANs for a long time. In case preemption is disabled, this can result in a soft lockup.
Fix this by adding a schedule point after each VLAN is deleted to yield the CPU, if needed. This is safe because the VLANs are traversed in process context.
Fixes: bdced7ef7838 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests") Signed-off-by: Ido Schimmel idosch@mellanox.com Reported-by: Stefan Priebe - Profihost AG s.priebe@profihost.ag Tested-by: Stefan Priebe - Profihost AG s.priebe@profihost.ag Acked-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bridge/br_netlink.c | 1 + 1 file changed, 1 insertion(+)
--- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -612,6 +612,7 @@ int br_process_vlan_info(struct net_brid v - 1, rtm_cmd); v_change_start = 0; } + cond_resched(); } /* v_change_start is set only if the last/whole range changed */ if (v_change_start)
From: Florian Fainelli f.fainelli@gmail.com
[ Upstream commit 050569fc8384c8056bacefcc246bcb2dfe574936 ]
When ndo_get_phys_port_name() for the CPU port was added we introduced an early check for when the DSA master network device in dsa_master_ndo_setup() already implements ndo_get_phys_port_name(). When we perform the teardown operation in dsa_master_ndo_teardown() we would not be checking that cpu_dp->orig_ndo_ops was successfully allocated and non-NULL initialized.
With network device drivers such as virtio_net, this leads to a NPD as soon as the DSA switch hanging off of it gets torn down because we are now assigning the virtio_net device's netdev_ops a NULL pointer.
Fixes: da7b9e9b00d4 ("net: dsa: Add ndo_get_phys_port_name() for CPU port") Reported-by: Allen Pais allen.pais@oracle.com Signed-off-by: Florian Fainelli f.fainelli@gmail.com Tested-by: Allen Pais allen.pais@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/dsa/master.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -289,7 +289,8 @@ static void dsa_master_ndo_teardown(stru { struct dsa_port *cpu_dp = dev->dsa_ptr;
- dev->netdev_ops = cpu_dp->orig_ndo_ops; + if (cpu_dp->orig_ndo_ops) + dev->netdev_ops = cpu_dp->orig_ndo_ops; cpu_dp->orig_ndo_ops = NULL; }
From: Florian Fainelli f.fainelli@gmail.com
[ Upstream commit 86f8b1c01a0a537a73d2996615133be63cdf75db ]
Prior to 1d27732f411d ("net: dsa: setup and teardown ports"), we would not treat failures to set-up an user port as fatal, but after this commit we would, which is a regression for some systems where interfaces may be declared in the Device Tree, but the underlying hardware may not be present (pluggable daughter cards for instance).
Fixes: 1d27732f411d ("net: dsa: setup and teardown ports") Signed-off-by: Florian Fainelli f.fainelli@gmail.com Reviewed-by: Andrew Lunn andrew@lunn.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/dsa/dsa2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -459,7 +459,7 @@ static int dsa_tree_setup_switches(struc list_for_each_entry(dp, &dst->ports, list) { err = dsa_port_setup(dp); if (err) - goto teardown; + continue; }
return 0;
From: Dejin Zheng zhengdejin5@gmail.com
[ Upstream commit b959c77dac09348955f344104c6a921ebe104753 ]
A call of the function macb_init() can fail in the function fu540_c000_init. The related system resources were not released then. use devm_platform_ioremap_resource() to replace ioremap() to fix it.
Fixes: c218ad559020ff9 ("macb: Add support for SiFive FU540-C000") Cc: Andy Shevchenko andy.shevchenko@gmail.com Reviewed-by: Yash Shah yash.shah@sifive.com Suggested-by: Nicolas Ferre nicolas.ferre@microchip.com Suggested-by: Andy Shevchenko andy.shevchenko@gmail.com Signed-off-by: Dejin Zheng zhengdejin5@gmail.com Acked-by: Nicolas Ferre nicolas.ferre@microchip.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/cadence/macb_main.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-)
--- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -4165,15 +4165,9 @@ static int fu540_c000_clk_init(struct pl
static int fu540_c000_init(struct platform_device *pdev) { - struct resource *res; - - res = platform_get_resource(pdev, IORESOURCE_MEM, 1); - if (!res) - return -ENODEV; - - mgmt->reg = ioremap(res->start, resource_size(res)); - if (!mgmt->reg) - return -ENOMEM; + mgmt->reg = devm_platform_ioremap_resource(pdev, 1); + if (IS_ERR(mgmt->reg)) + return PTR_ERR(mgmt->reg);
return macb_init(pdev); }
From: Scott Dial scott@scottdial.com
[ Upstream commit ab046a5d4be4c90a3952a0eae75617b49c0cb01b ]
MACsec decryption always occurs in a softirq context. Since the FPU may not be usable in the softirq context, the call to decrypt may be scheduled on the cryptd work queue. The cryptd work queue does not provide ordering guarantees. Therefore, preserving order requires masking out ASYNC implementations of gcm(aes).
For instance, an Intel CPU with AES-NI makes available the generic-gcm-aesni driver from the aesni_intel module to implement gcm(aes). However, this implementation requires the FPU, so it is not always available to use from a softirq context, and will fallback to the cryptd work queue, which does not preserve frame ordering. With this change, such a system would select gcm_base(ctr(aes-aesni),ghash-generic). While the aes-aesni implementation prefers to use the FPU, it will fallback to the aes-asm implementation if unavailable.
By using a synchronous version of gcm(aes), the decryption will complete before returning from crypto_aead_decrypt(). Therefore, the macsec_decrypt_done() callback will be called before returning from macsec_decrypt(). Thus, the order of calls to macsec_post_decrypt() for the frames is preserved.
While it's presumable that the pure AES-NI version of gcm(aes) is more performant, the hybrid solution is capable of gigabit speeds on modest hardware. Regardless, preserving the order of frames is paramount for many network protocols (e.g., triggering TCP retries). Within the MACsec driver itself, the replay protection is tripped by the out-of-order frames, and can cause frames to be dropped.
This bug has been present in this code since it was added in v4.6, however it may not have been noticed since not all CPUs have FPU offload available. Additionally, the bug manifests as occasional out-of-order packets that are easily misattributed to other network phenomena.
When this code was added in v4.6, the crypto/gcm.c code did not restrict selection of the ghash function based on the ASYNC flag. For instance, x86 CPUs with PCLMULQDQ would select the ghash-clmulni driver instead of ghash-generic, which submits to the cryptd work queue if the FPU is busy. However, this bug was was corrected in v4.8 by commit b30bdfa86431afbafe15284a3ad5ac19b49b88e3, and was backported all the way back to the v3.14 stable branch, so this patch should be applicable back to the v4.6 stable branch.
Signed-off-by: Scott Dial scott@scottdial.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/macsec.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -1226,7 +1226,8 @@ static struct crypto_aead *macsec_alloc_ struct crypto_aead *tfm; int ret;
- tfm = crypto_alloc_aead("gcm(aes)", 0, 0); + /* Pick a sync gcm(aes) cipher to ensure order is preserved. */ + tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm)) return tfm;
From: Tariq Toukan tariqt@mellanox.com
[ Upstream commit 40e473071dbad04316ddc3613c3a3d1c75458299 ]
When ENOSPC is set the idx is still valid and gets set to the global MLX4_SINK_COUNTER_INDEX. However gcc's static analysis cannot tell that ENOSPC is impossible from mlx4_cmd_imm() and gives this warning:
drivers/net/ethernet/mellanox/mlx4/main.c:2552:28: warning: 'idx' may be used uninitialized in this function [-Wmaybe-uninitialized] 2552 | priv->def_counter[port] = idx;
Also, when ENOSPC is returned mlx4_allocate_default_counters should not fail.
Fixes: 6de5f7f6a1fa ("net/mlx4_core: Allocate default counter per port") Signed-off-by: Jason Gunthorpe jgg@mellanox.com Signed-off-by: Tariq Toukan tariqt@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2550,6 +2550,7 @@ static int mlx4_allocate_default_counter
if (!err || err == -ENOSPC) { priv->def_counter[port] = idx; + err = 0; } else if (err == -ENOENT) { err = 0; continue; @@ -2600,7 +2601,8 @@ int mlx4_counter_alloc(struct mlx4_dev * MLX4_CMD_TIME_CLASS_A, MLX4_CMD_WRAPPED); if (!err) *idx = get_param_l(&out_param); - + if (WARN_ON(err == -ENOSPC)) + err = -EINVAL; return err; } return __mlx4_counter_alloc(dev, idx);
From: Baruch Siach baruch@tkos.co.il
[ Upstream commit c3e302edca2457bbd0c958c445a7538fbf6a6ac8 ]
Read the temperature sensor register from the correct location for the 88E2110 PHY. There is no enable/disable bit on 2110, so make mv3310_hwmon_config() run on 88X3310 only.
Fixes: 62d01535474b61 ("net: phy: marvell10g: add support for the 88x2110 PHY") Cc: Maxime Chevallier maxime.chevallier@bootlin.com Reviewed-by: Andrew Lunn andrew@lunn.ch Signed-off-by: Baruch Siach baruch@tkos.co.il Reviewed-by: Russell King rmk+kernel@armlinux.org.uk Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/phy/marvell10g.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)
--- a/drivers/net/phy/marvell10g.c +++ b/drivers/net/phy/marvell10g.c @@ -44,6 +44,9 @@ enum { MV_PCS_PAIRSWAP_AB = 0x0002, MV_PCS_PAIRSWAP_NONE = 0x0003,
+ /* Temperature read register (88E2110 only) */ + MV_PCS_TEMP = 0x8042, + /* These registers appear at 0x800X and 0xa00X - the 0xa00X control * registers appear to set themselves to the 0x800X when AN is * restarted, but status registers appear readable from either. @@ -54,6 +57,7 @@ enum { /* Vendor2 MMD registers */ MV_V2_PORT_CTRL = 0xf001, MV_V2_PORT_CTRL_PWRDOWN = 0x0800, + /* Temperature control/read registers (88X3310 only) */ MV_V2_TEMP_CTRL = 0xf08a, MV_V2_TEMP_CTRL_MASK = 0xc000, MV_V2_TEMP_CTRL_SAMPLE = 0x0000, @@ -79,6 +83,24 @@ static umode_t mv3310_hwmon_is_visible(c return 0; }
+static int mv3310_hwmon_read_temp_reg(struct phy_device *phydev) +{ + return phy_read_mmd(phydev, MDIO_MMD_VEND2, MV_V2_TEMP); +} + +static int mv2110_hwmon_read_temp_reg(struct phy_device *phydev) +{ + return phy_read_mmd(phydev, MDIO_MMD_PCS, MV_PCS_TEMP); +} + +static int mv10g_hwmon_read_temp_reg(struct phy_device *phydev) +{ + if (phydev->drv->phy_id == MARVELL_PHY_ID_88X3310) + return mv3310_hwmon_read_temp_reg(phydev); + else /* MARVELL_PHY_ID_88E2110 */ + return mv2110_hwmon_read_temp_reg(phydev); +} + static int mv3310_hwmon_read(struct device *dev, enum hwmon_sensor_types type, u32 attr, int channel, long *value) { @@ -91,7 +113,7 @@ static int mv3310_hwmon_read(struct devi }
if (type == hwmon_temp && attr == hwmon_temp_input) { - temp = phy_read_mmd(phydev, MDIO_MMD_VEND2, MV_V2_TEMP); + temp = mv10g_hwmon_read_temp_reg(phydev); if (temp < 0) return temp;
@@ -144,6 +166,9 @@ static int mv3310_hwmon_config(struct ph u16 val; int ret;
+ if (phydev->drv->phy_id != MARVELL_PHY_ID_88X3310) + return 0; + ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, MV_V2_TEMP, MV_V2_TEMP_UNKNOWN); if (ret < 0)
From: Eric Dumazet edumazet@google.com
[ Upstream commit 2761121af87de45951989a0adada917837d8fa82 ]
Do not assume the attribute has the right size.
Fixes: aea5f654e6b7 ("net/sched: add skbprio scheduler") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_skbprio.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/net/sched/sch_skbprio.c +++ b/net/sched/sch_skbprio.c @@ -169,6 +169,9 @@ static int skbprio_change(struct Qdisc * { struct tc_skbprio_qopt *ctl = nla_data(opt);
+ if (opt->nla_len != nla_attr_size(sizeof(*ctl))) + return -EINVAL; + sch->limit = ctl->limit; return 0; }
From: Willem de Bruijn willemb@google.com
[ Upstream commit 9274124f023b5c56dc4326637d4f787968b03607 ]
Syzkaller again found a path to a kernel crash through bad gso input: a packet with transport header extending beyond skb_headlen(skb).
Tighten validation at kernel entry:
- Verify that the transport header lies within the linear section.
To avoid pulling linux/tcp.h, verify just sizeof tcphdr. tcp_gso_segment will call pskb_may_pull (th->doff * 4) before use.
- Match the gso_type against the ip_proto found by the flow dissector.
Fixes: bfd5f4a3d605 ("packet: Add GSO/csum offload support.") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/virtio_net.h | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-)
--- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -3,6 +3,8 @@ #define _LINUX_VIRTIO_NET_H
#include <linux/if_vlan.h> +#include <uapi/linux/tcp.h> +#include <uapi/linux/udp.h> #include <uapi/linux/virtio_net.h>
static inline int virtio_net_hdr_set_proto(struct sk_buff *skb, @@ -28,17 +30,25 @@ static inline int virtio_net_hdr_to_skb( bool little_endian) { unsigned int gso_type = 0; + unsigned int thlen = 0; + unsigned int ip_proto;
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { case VIRTIO_NET_HDR_GSO_TCPV4: gso_type = SKB_GSO_TCPV4; + ip_proto = IPPROTO_TCP; + thlen = sizeof(struct tcphdr); break; case VIRTIO_NET_HDR_GSO_TCPV6: gso_type = SKB_GSO_TCPV6; + ip_proto = IPPROTO_TCP; + thlen = sizeof(struct tcphdr); break; case VIRTIO_NET_HDR_GSO_UDP: gso_type = SKB_GSO_UDP; + ip_proto = IPPROTO_UDP; + thlen = sizeof(struct udphdr); break; default: return -EINVAL; @@ -57,16 +67,22 @@ static inline int virtio_net_hdr_to_skb(
if (!skb_partial_csum_set(skb, start, off)) return -EINVAL; + + if (skb_transport_offset(skb) + thlen > skb_headlen(skb)) + return -EINVAL; } else { /* gso packets without NEEDS_CSUM do not set transport_offset. * probe and drop if does not match one of the above types. */ if (gso_type && skb->network_header) { + struct flow_keys_basic keys; + if (!skb->protocol) virtio_net_hdr_set_proto(skb, hdr); retry: - skb_probe_transport_header(skb); - if (!skb_transport_header_was_set(skb)) { + if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, + NULL, 0, 0, 0, + 0)) { /* UFO does not specify ipv4 or 6: try both */ if (gso_type & SKB_GSO_UDP && skb->protocol == htons(ETH_P_IP)) { @@ -75,6 +91,12 @@ retry: } return -EINVAL; } + + if (keys.control.thoff + thlen > skb_headlen(skb) || + keys.basic.ip_proto != ip_proto) + return -EINVAL; + + skb_set_transport_header(skb, keys.control.thoff); } }
From: Anthony Felice tony.felice@timesys.com
[ Upstream commit 4b5b71f770e2edefbfe74203777264bfe6a9927c ]
Commit 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode") updated ethernet drivers to use a linkmode bitmap. It mistakenly dropped a bitwise negation in the tc35815 ethernet driver on a bitmask to set the supported/advertising flags.
Found by Anthony via code inspection, not tested as I do not have the required hardware.
Fixes: 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode") Signed-off-by: Anthony Felice tony.felice@timesys.com Reviewed-by: Akshay Bhat akshay.bhat@timesys.com Reviewed-by: Heiner Kallweit hkallweit1@gmail.com Reviewed-by: Andrew Lunn andrew@lunn.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/toshiba/tc35815.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/toshiba/tc35815.c +++ b/drivers/net/ethernet/toshiba/tc35815.c @@ -643,7 +643,7 @@ static int tc_mii_probe(struct net_devic linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT, mask); linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT, mask); } - linkmode_and(phydev->supported, phydev->supported, mask); + linkmode_andnot(phydev->supported, phydev->supported, mask); linkmode_copy(phydev->advertising, phydev->supported);
lp->link = 0;
From: Xiyu Yang xiyuyang19@fudan.edu.cn
[ Upstream commit 095f5614bfe16e5b3e191b34ea41b10d6fdd4ced ]
bpf_exec_tx_verdict() invokes sk_psock_get(), which returns a reference of the specified sk_psock object to "psock" with increased refcnt.
When bpf_exec_tx_verdict() returns, local variable "psock" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of bpf_exec_tx_verdict(). When "policy" equals to NULL but "psock" is not NULL, the function forgets to decrease the refcnt increased by sk_psock_get(), causing a refcnt leak.
Fix this issue by calling sk_psock_put() on this error path before bpf_exec_tx_verdict() returns.
Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_sw.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -800,6 +800,8 @@ static int bpf_exec_tx_verdict(struct sk *copied -= sk_msg_free(sk, msg); tls_free_open_rec(sk); } + if (psock) + sk_psock_put(sk, psock); return err; } more_data:
From: Xiyu Yang xiyuyang19@fudan.edu.cn
[ Upstream commit 62b4011fa7bef9fa00a6aeec26e69685dc1cc21e ]
tls_data_ready() invokes sk_psock_get(), which returns a reference of the specified sk_psock object to "psock" with increased refcnt.
When tls_data_ready() returns, local variable "psock" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of tls_data_ready(). When "psock->ingress_msg" is empty but "psock" is not NULL, the function forgets to decrease the refcnt increased by sk_psock_get(), causing a refcnt leak.
Fix this issue by calling sk_psock_put() on all paths when "psock" is not NULL.
Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_sw.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -2083,8 +2083,9 @@ static void tls_data_ready(struct sock * strp_data_ready(&ctx->strp);
psock = sk_psock_get(sk); - if (psock && !list_empty(&psock->ingress_msg)) { - ctx->saved_data_ready(sk); + if (psock) { + if (!list_empty(&psock->ingress_msg)) + ctx->saved_data_ready(sk); sk_psock_put(sk, psock); } }
From: Matt Jolly Kangie@footclan.ninja
[ Upstream commit 57c7f2bd758eed867295c81d3527fff4fab1ed74 ]
Add support for Dell Wireless 5816e to drivers/net/usb/qmi_wwan.c
Signed-off-by: Matt Jolly Kangie@footclan.ninja Acked-by: Bjørn Mork bjorn@mork.no Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1359,6 +1359,7 @@ static const struct usb_device_id produc {QMI_FIXED_INTF(0x413c, 0x81b3, 8)}, /* Dell Wireless 5809e Gobi(TM) 4G LTE Mobile Broadband Card (rev3) */ {QMI_FIXED_INTF(0x413c, 0x81b6, 8)}, /* Dell Wireless 5811e */ {QMI_FIXED_INTF(0x413c, 0x81b6, 10)}, /* Dell Wireless 5811e */ + {QMI_FIXED_INTF(0x413c, 0x81cc, 8)}, /* Dell Wireless 5816e */ {QMI_FIXED_INTF(0x413c, 0x81d7, 0)}, /* Dell Wireless 5821e */ {QMI_FIXED_INTF(0x413c, 0x81d7, 1)}, /* Dell Wireless 5821e preproduction config */ {QMI_FIXED_INTF(0x413c, 0x81e0, 0)}, /* Dell Wireless 5821e with eSIM support*/
From: Qiushi Wu wu000273@umn.edu
[ Upstream commit bd4af432cc71b5fbfe4833510359a6ad3ada250d ]
In function nfp_abm_vnic_set_mac, pointer nsp is allocated by nfp_nsp_open. But when nfp_nsp_has_hwinfo_lookup fail, the pointer is not released, which can lead to a memory leak bug. Fix this issue by adding nfp_nsp_close(nsp) in the error path.
Fixes: f6e71efdf9fb1 ("nfp: abm: look up MAC addresses via management FW") Signed-off-by: Qiushi Wu wu000273@umn.edu Acked-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/netronome/nfp/abm/main.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/net/ethernet/netronome/nfp/abm/main.c +++ b/drivers/net/ethernet/netronome/nfp/abm/main.c @@ -283,6 +283,7 @@ nfp_abm_vnic_set_mac(struct nfp_pf *pf, if (!nfp_nsp_has_hwinfo_lookup(nsp)) { nfp_warn(pf->cpp, "NSP doesn't support PF MAC generation\n"); eth_hw_addr_random(nn->dp.netdev); + nfp_nsp_close(nsp); return; }
From: Eric Dumazet edumazet@google.com
[ Upstream commit 8738c85c72b3108c9b9a369a39868ba5f8e10ae0 ]
If choke_init() could not allocate q->tab, we would crash later in choke_reset().
BUG: KASAN: null-ptr-deref in memset include/linux/string.h:366 [inline] BUG: KASAN: null-ptr-deref in choke_reset+0x208/0x340 net/sched/sch_choke.c:326 Write of size 8 at addr 0000000000000000 by task syz-executor822/7022
CPU: 1 PID: 7022 Comm: syz-executor822 Not tainted 5.7.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 __kasan_report.cold+0x5/0x4d mm/kasan/report.c:515 kasan_report+0x33/0x50 mm/kasan/common.c:625 check_memory_region_inline mm/kasan/generic.c:187 [inline] check_memory_region+0x141/0x190 mm/kasan/generic.c:193 memset+0x20/0x40 mm/kasan/common.c:85 memset include/linux/string.h:366 [inline] choke_reset+0x208/0x340 net/sched/sch_choke.c:326 qdisc_reset+0x6b/0x520 net/sched/sch_generic.c:910 dev_deactivate_queue.constprop.0+0x13c/0x240 net/sched/sch_generic.c:1138 netdev_for_each_tx_queue include/linux/netdevice.h:2197 [inline] dev_deactivate_many+0xe2/0xba0 net/sched/sch_generic.c:1195 dev_deactivate+0xf8/0x1c0 net/sched/sch_generic.c:1233 qdisc_graft+0xd25/0x1120 net/sched/sch_api.c:1051 tc_modify_qdisc+0xbab/0x1a00 net/sched/sch_api.c:1670 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5454 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
Fixes: 77e62da6e60c ("sch_choke: drop all packets in queue during reset") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Cc: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_choke.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/sched/sch_choke.c +++ b/net/sched/sch_choke.c @@ -323,7 +323,8 @@ static void choke_reset(struct Qdisc *sc
sch->q.qlen = 0; sch->qstats.backlog = 0; - memset(q->tab, 0, (q->tab_mask + 1) * sizeof(struct sk_buff *)); + if (q->tab) + memset(q->tab, 0, (q->tab_mask + 1) * sizeof(struct sk_buff *)); q->head = q->tail = 0; red_restart(&q->vars); }
From: Eric Dumazet edumazet@google.com
[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]
syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue()
More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range.
Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000.
Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_sfq.c | 9 +++++++++ 1 file changed, 9 insertions(+)
--- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
From: Eric Dumazet edumazet@google.com
[ Upstream commit bf5525f3a8e3248be5aa5defe5aaadd60e1c1ba1 ]
We added fields in tcp_zerocopy_receive structure, so make sure to clear all fields to not pass garbage to the kernel.
We were lucky because recent additions added 'out' parameters, still we need to clean our reference implementation, before folks copy/paste it.
Fixes: c8856c051454 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.") Fixes: 33946518d493 ("tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Arjun Roy arjunroy@google.com Cc: Soheil Hassas Yeganeh soheil@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/testing/selftests/net/tcp_mmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/net/tcp_mmap.c +++ b/tools/testing/selftests/net/tcp_mmap.c @@ -165,9 +165,10 @@ void *child_thread(void *arg) socklen_t zc_len = sizeof(zc); int res;
+ memset(&zc, 0, sizeof(zc)); zc.address = (__u64)((unsigned long)addr); zc.length = chunk_size; - zc.recv_skip_hint = 0; + res = getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len); if (res == -1)
From: Eric Dumazet edumazet@google.com
[ Upstream commit a84724178bd7081cf3bd5b558616dd6a9a4ca63b ]
Since chunk_size is no longer an integer, we can not use it directly as an argument of setsockopt().
This patch should fix tcp_mmap for Big Endian kernels.
Fixes: 597b01edafac ("selftests: net: avoid ptl lock contention in tcp_mmap") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Soheil Hassas Yeganeh soheil@google.com Cc: Arjun Roy arjunroy@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/testing/selftests/net/tcp_mmap.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/net/tcp_mmap.c +++ b/tools/testing/selftests/net/tcp_mmap.c @@ -282,12 +282,14 @@ static void setup_sockaddr(int domain, c static void do_accept(int fdlisten) { pthread_attr_t attr; + int rcvlowat;
pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
+ rcvlowat = chunk_size; if (setsockopt(fdlisten, SOL_SOCKET, SO_RCVLOWAT, - &chunk_size, sizeof(chunk_size)) == -1) { + &rcvlowat, sizeof(rcvlowat)) == -1) { perror("setsockopt SO_RCVLOWAT"); }
From: Tuong Lien tuong.t.lien@dektech.com.au
[ Upstream commit 980d69276f3048af43a045be2925dacfb898a7be ]
When an application connects to the TIPC topology server and subscribes to some services, a new connection is created along with some objects - 'tipc_subscription' to store related data correspondingly... However, there is one omission in the connection handling that when the connection or application is orderly shutdown (e.g. via SIGQUIT, etc.), the connection is not closed in kernel, the 'tipc_subscription' objects are not freed too. This results in: - The maximum number of subscriptions (65535) will be reached soon, new subscriptions will be rejected; - TIPC module cannot be removed (unless the objects are somehow forced to release first);
The commit fixes the issue by closing the connection if the 'recvmsg()' returns '0' i.e. when the peer is shutdown gracefully. It also includes the other unexpected cases.
Acked-by: Jon Maloy jmaloy@redhat.com Acked-by: Ying Xue ying.xue@windriver.com Signed-off-by: Tuong Lien tuong.t.lien@dektech.com.au Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tipc/topsrv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -402,10 +402,11 @@ static int tipc_conn_rcv_from_sock(struc read_lock_bh(&sk->sk_callback_lock); ret = tipc_conn_rcv_sub(srv, con, &s); read_unlock_bh(&sk->sk_callback_lock); + if (!ret) + return 0; } - if (ret < 0) - tipc_conn_close(con);
+ tipc_conn_close(con); return ret; }
From: "Toke H�iland-J�rgensen" toke@redhat.com
[ Upstream commit b723748750ece7d844cdf2f52c01d37f83387208 ]
RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header to the inner header if that inner header is already marked as ECT(0). When RFC 6040 decapsulation was implemented, this case of propagation was not added. This simply appears to be an oversight, so let's fix that.
Fixes: eccc1bb8d4b4 ("tunnel: drop packet if ECN present with not-ECT") Reported-by: Bob Briscoe ietf@bobbriscoe.net Reported-by: Olivier Tilmans olivier.tilmans@nokia-bell-labs.com Cc: Dave Taht dave.taht@gmail.com Cc: Stephen Hemminger stephen@networkplumber.org Signed-off-by: Toke Høiland-Jørgensen toke@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/inet_ecn.h | 57 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-)
--- a/include/net/inet_ecn.h +++ b/include/net/inet_ecn.h @@ -99,6 +99,20 @@ static inline int IP_ECN_set_ce(struct i return 1; }
+static inline int IP_ECN_set_ect1(struct iphdr *iph) +{ + u32 check = (__force u32)iph->check; + + if ((iph->tos & INET_ECN_MASK) != INET_ECN_ECT_0) + return 0; + + check += (__force u16)htons(0x100); + + iph->check = (__force __sum16)(check + (check>=0xFFFF)); + iph->tos ^= INET_ECN_MASK; + return 1; +} + static inline void IP_ECN_clear(struct iphdr *iph) { iph->tos &= ~INET_ECN_MASK; @@ -134,6 +148,22 @@ static inline int IP6_ECN_set_ce(struct return 1; }
+static inline int IP6_ECN_set_ect1(struct sk_buff *skb, struct ipv6hdr *iph) +{ + __be32 from, to; + + if ((ipv6_get_dsfield(iph) & INET_ECN_MASK) != INET_ECN_ECT_0) + return 0; + + from = *(__be32 *)iph; + to = from ^ htonl(INET_ECN_MASK << 20); + *(__be32 *)iph = to; + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_add(csum_sub(skb->csum, (__force __wsum)from), + (__force __wsum)to); + return 1; +} + static inline void ipv6_copy_dscp(unsigned int dscp, struct ipv6hdr *inner) { dscp &= ~INET_ECN_MASK; @@ -159,6 +189,25 @@ static inline int INET_ECN_set_ce(struct return 0; }
+static inline int INET_ECN_set_ect1(struct sk_buff *skb) +{ + switch (skb->protocol) { + case cpu_to_be16(ETH_P_IP): + if (skb_network_header(skb) + sizeof(struct iphdr) <= + skb_tail_pointer(skb)) + return IP_ECN_set_ect1(ip_hdr(skb)); + break; + + case cpu_to_be16(ETH_P_IPV6): + if (skb_network_header(skb) + sizeof(struct ipv6hdr) <= + skb_tail_pointer(skb)) + return IP6_ECN_set_ect1(skb, ipv6_hdr(skb)); + break; + } + + return 0; +} + /* * RFC 6040 4.2 * To decapsulate the inner header at the tunnel egress, a compliant @@ -208,8 +257,12 @@ static inline int INET_ECN_decapsulate(s int rc;
rc = __INET_ECN_decapsulate(outer, inner, &set_ce); - if (!rc && set_ce) - INET_ECN_set_ce(skb); + if (!rc) { + if (set_ce) + INET_ECN_set_ce(skb); + else if ((outer & INET_ECN_MASK) == INET_ECN_ECT_1) + INET_ECN_set_ect1(skb); + }
return rc; }
From: Michael Chan michael.chan@broadcom.com
[ Upstream commit c71c4e49afe173823a2a85b0cabc9b3f1176ffa2 ]
Fix the logic that sets the enable/disable flag for the source MAC filter according to firmware spec 1.7.1.
In the original firmware spec. before 1.7.1, the VF spoof check flags were not latched after making the HWRM_FUNC_CFG call, so there was a need to keep the func_flags so that subsequent calls would perserve the VF spoof check setting. A change was made in the 1.7.1 spec so that the flags became latched. So we now set or clear the anti- spoof setting directly without retrieving the old settings in the stored vf->func_flags which are no longer valid. We also remove the unneeded vf->func_flags.
Fixes: 8eb992e876a8 ("bnxt_en: Update firmware interface spec to 1.7.6.2.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 - drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 10 ++-------- 2 files changed, 2 insertions(+), 9 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -1064,7 +1064,6 @@ struct bnxt_vf_info { #define BNXT_VF_LINK_FORCED 0x4 #define BNXT_VF_LINK_UP 0x8 #define BNXT_VF_TRUST 0x10 - u32 func_flags; /* func cfg flags */ u32 min_tx_rate; u32 max_tx_rate; void *hwrm_cmd_req_addr; --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c @@ -85,11 +85,10 @@ int bnxt_set_vf_spoofchk(struct net_devi if (old_setting == setting) return 0;
- func_flags = vf->func_flags; if (setting) - func_flags |= FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_ENABLE; + func_flags = FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_ENABLE; else - func_flags |= FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_DISABLE; + func_flags = FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_DISABLE; /*TODO: if the driver supports VLAN filter on guest VLAN, * the spoof check should also include vlan anti-spoofing */ @@ -98,7 +97,6 @@ int bnxt_set_vf_spoofchk(struct net_devi req.flags = cpu_to_le32(func_flags); rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); if (!rc) { - vf->func_flags = func_flags; if (setting) vf->flags |= BNXT_VF_SPOOFCHK; else @@ -230,7 +228,6 @@ int bnxt_set_vf_mac(struct net_device *d memcpy(vf->mac_addr, mac, ETH_ALEN); bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_MAC_ADDR); memcpy(req.dflt_mac_addr, mac, ETH_ALEN); return hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); @@ -268,7 +265,6 @@ int bnxt_set_vf_vlan(struct net_device *
bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.dflt_vlan = cpu_to_le16(vlan_tag); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_VLAN); rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); @@ -307,7 +303,6 @@ int bnxt_set_vf_bw(struct net_device *de return 0; bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_MAX_BW); req.max_bw = cpu_to_le32(max_tx_rate); req.enables |= cpu_to_le32(FUNC_CFG_REQ_ENABLES_MIN_BW); @@ -479,7 +474,6 @@ static void __bnxt_set_vf_params(struct vf = &bp->pf.vf[vf_id]; bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags);
if (is_valid_ether_addr(vf->mac_addr)) { req.enables |= cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_MAC_ADDR);
From: Vasundhara Volam vasundhara-v.volam@broadcom.com
[ Upstream commit 9e68cb0359b20f99c7b070f1d3305e5e0a9fae6d ]
Broadcom adapters support only maximum of 512 CQs per PF. If user sets MSIx vectors more than supported CQs, firmware is setting incorrect value for msix_vec_per_pf_max parameter. Fix it by reducing the BNXT_MSIX_VEC_MAX value to 512, even though the maximum # of MSIx vectors supported by adapter are 1280.
Fixes: f399e8497826 ("bnxt_en: Use msix_vec_per_pf_max and msix_vec_per_pf_min devlink params.") Signed-off-by: Vasundhara Volam vasundhara-v.volam@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h @@ -43,7 +43,7 @@ static inline void bnxt_link_bp_to_dl(st #define BNXT_NVM_CFG_VER_BITS 24 #define BNXT_NVM_CFG_VER_BYTES 4
-#define BNXT_MSIX_VEC_MAX 1280 +#define BNXT_MSIX_VEC_MAX 512 #define BNXT_MSIX_VEC_MIN_MAX 128
enum bnxt_nvm_dir_type {
From: Michael Chan michael.chan@broadcom.com
[ Upstream commit bae361c54fb6ac6eba3b4762f49ce14beb73ef13 ]
Improve the slot reset sequence by disabling the device to prevent bad DMAs if slot reset fails. Return the proper result instead of always PCI_ERS_RESULT_RECOVERED to the caller.
Fixes: 6316ea6db93d ("bnxt_en: Enable AER support.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -12173,12 +12173,15 @@ static pci_ers_result_t bnxt_io_slot_res bnxt_ulp_start(bp, err); }
- if (result != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) - dev_close(netdev); + if (result != PCI_ERS_RESULT_RECOVERED) { + if (netif_running(netdev)) + dev_close(netdev); + pci_disable_device(pdev); + }
rtnl_unlock();
- return PCI_ERS_RESULT_RECOVERED; + return result; }
/**
From: Michael Chan michael.chan@broadcom.com
[ Upstream commit bbf211b1ecb891c7e0cc7888834504183fc8b534 ]
bnxt_alloc_ctx_pg_tbls() should return error when the memory size of the context memory to set up is zero. By returning success (0), the caller may proceed normally and may crash later when it tries to set up the memory.
Fixes: 08fe9d181606 ("bnxt_en: Add Level 2 context memory paging support.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -6662,7 +6662,7 @@ static int bnxt_alloc_ctx_pg_tbls(struct int rc;
if (!mem_size) - return 0; + return -EINVAL;
ctx_pg->nr_pages = DIV_ROUND_UP(mem_size, BNXT_PAGE_SIZE); if (ctx_pg->nr_pages > MAX_CTX_TOTAL_PAGES) {
From: Michael Chan michael.chan@broadcom.com
[ Upstream commit c72cb303aa6c2ae7e4184f0081c6d11bf03fb96b ]
The current logic in bnxt_fix_features() will inadvertently turn on both CTAG and STAG VLAN offload if the user tries to disable both. Fix it by checking that the user is trying to enable CTAG or STAG before enabling both. The logic is supposed to enable or disable both CTAG and STAG together.
Fixes: 5a9f6b238e59 ("bnxt_en: Enable and disable RX CTAG and RX STAG VLAN acceleration together.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -9794,6 +9794,7 @@ static netdev_features_t bnxt_fix_featur netdev_features_t features) { struct bnxt *bp = netdev_priv(dev); + netdev_features_t vlan_features;
if ((features & NETIF_F_NTUPLE) && !bnxt_rfs_capable(bp)) features &= ~NETIF_F_NTUPLE; @@ -9810,12 +9811,14 @@ static netdev_features_t bnxt_fix_featur /* Both CTAG and STAG VLAN accelaration on the RX side have to be * turned on or off together. */ - if ((features & (NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX)) != - (NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX)) { + vlan_features = features & (NETIF_F_HW_VLAN_CTAG_RX | + NETIF_F_HW_VLAN_STAG_RX); + if (vlan_features != (NETIF_F_HW_VLAN_CTAG_RX | + NETIF_F_HW_VLAN_STAG_RX)) { if (dev->features & NETIF_F_HW_VLAN_CTAG_RX) features &= ~(NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX); - else + else if (vlan_features) features |= NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX; }
From: Erez Shitrit erezsh@mellanox.com
[ Upstream commit 8075411d93b6efe143d9f606f6531077795b7fbf ]
In polling mode, set arm_db member to a value that will avoid CQ event recovery by the HW. Otherwise we might get event without completion function. In addition,empty completion function to was added to protect from unexpected events.
Fixes: 297cccebdc5a ("net/mlx5: DR, Expose an internal API to issue RDMA operations") Signed-off-by: Erez Shitrit erezsh@mellanox.com Reviewed-by: Tariq Toukan tariqt@mellanox.com Reviewed-by: Alex Vesker valex@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c | 14 ++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c @@ -689,6 +689,12 @@ static void dr_cq_event(struct mlx5_core pr_info("CQ event %u on CQ #%u\n", event, mcq->cqn); }
+static void dr_cq_complete(struct mlx5_core_cq *mcq, + struct mlx5_eqe *eqe) +{ + pr_err("CQ completion CQ: #%u\n", mcq->cqn); +} + static struct mlx5dr_cq *dr_create_cq(struct mlx5_core_dev *mdev, struct mlx5_uars_page *uar, size_t ncqe) @@ -750,6 +756,7 @@ static struct mlx5dr_cq *dr_create_cq(st mlx5_fill_page_frag_array(&cq->wq_ctrl.buf, pas);
cq->mcq.event = dr_cq_event; + cq->mcq.comp = dr_cq_complete;
err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out)); kvfree(in); @@ -761,7 +768,12 @@ static struct mlx5dr_cq *dr_create_cq(st cq->mcq.set_ci_db = cq->wq_ctrl.db.db; cq->mcq.arm_db = cq->wq_ctrl.db.db + 1; *cq->mcq.set_ci_db = 0; - *cq->mcq.arm_db = 0; + + /* set no-zero value, in order to avoid the HW to run db-recovery on + * CQ that used in polling mode. + */ + *cq->mcq.arm_db = cpu_to_be32(2 << 28); + cq->mcq.vector = 0; cq->mcq.irqn = irqn; cq->mcq.uar = uar;
From: Moshe Shemesh moshe@mellanox.com
[ Upstream commit f3cb3cebe26ed4c8036adbd9448b372129d3c371 ]
mlx5_cmd_flush() will trigger forced completions to all valid command entries. Triggered by an asynch event such as fast teardown it can happen at any stage of the command, including command initialization. It will trigger forced completion and that can lead to completion on an uninitialized command entry.
Setting MLX5_CMD_ENT_STATE_PENDING_COMP only after command entry is initialized will ensure force completion is treated only if command entry is initialized.
Fixes: 73dd3a4839c1 ("net/mlx5: Avoid using pending command interface slots") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Eran Ben Elisha eranbe@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -888,7 +888,6 @@ static void cmd_work_handler(struct work }
cmd->ent_arr[ent->idx] = ent; - set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state); lay = get_inst(cmd, ent->idx); ent->lay = lay; memset(lay, 0, sizeof(*lay)); @@ -910,6 +909,7 @@ static void cmd_work_handler(struct work
if (ent->callback) schedule_delayed_work(&ent->cb_timeout_work, cb_timeout); + set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);
/* Skip sending command to fw if internal error */ if (pci_channel_offline(dev->pdev) ||
From: Moshe Shemesh moshe@mellanox.com
[ Upstream commit cece6f432cca9f18900463ed01b97a152a03600a ]
Processing commands by cmd_work_handler() while already in Internal Error State will result in entry leak, since the handler process force completion without doorbell. Forced completion doesn't release the entry and event completion will never arrive, so entry should be released.
Fixes: 73dd3a4839c1 ("net/mlx5: Avoid using pending command interface slots") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Eran Ben Elisha eranbe@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -922,6 +922,10 @@ static void cmd_work_handler(struct work MLX5_SET(mbox_out, ent->out, syndrome, drv_synd);
mlx5_cmd_comp_handler(dev, 1UL << ent->idx, true); + /* no doorbell, no need to keep the entry */ + free_ent(cmd, ent->idx); + if (ent->callback) + free_cmd(ent); return; }
From: Roi Dayan roid@mellanox.com
[ Upstream commit 67b38de646894c9a94fe4d6d17719e70cc6028eb ]
Need to allocate the q counters before init_rx which needs them when creating the rq.
Fixes: 8520fa57a4e9 ("net/mlx5e: Create q counters on uplink representors") Signed-off-by: Roi Dayan roid@mellanox.com Reviewed-by: Vlad Buslov vladbu@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -1692,19 +1692,14 @@ static void mlx5e_cleanup_rep_rx(struct
static int mlx5e_init_ul_rep_rx(struct mlx5e_priv *priv) { - int err = mlx5e_init_rep_rx(priv); - - if (err) - return err; - mlx5e_create_q_counters(priv); - return 0; + return mlx5e_init_rep_rx(priv); }
static void mlx5e_cleanup_ul_rep_rx(struct mlx5e_priv *priv) { - mlx5e_destroy_q_counters(priv); mlx5e_cleanup_rep_rx(priv); + mlx5e_destroy_q_counters(priv); }
static int mlx5e_init_uplink_rep_tx(struct mlx5e_rep_priv *rpriv)
From: Dan Carpenter dan.carpenter@oracle.com
[ Upstream commit 39bd16df7c31bb8cf5dfd0c88e42abd5ae10029d ]
The "rss_context" variable comes from the user via ethtool_get_rxfh(). It can be any u32 value except zero. Eventually it gets passed to mvpp22_rss_ctx() and if it is over MVPP22_N_RSS_TABLES (8) then it results in an array overflow.
Fixes: 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -4325,6 +4325,8 @@ static int mvpp2_ethtool_get_rxfh_contex
if (!mvpp22_rss_is_supported()) return -EOPNOTSUPP; + if (rss_context >= MVPP22_N_RSS_TABLES) + return -EINVAL;
if (hfunc) *hfunc = ETH_RSS_HASH_CRC32;
From: Dan Carpenter dan.carpenter@oracle.com
[ Upstream commit 722c0f00d4feea77475a5dc943b53d60824a1e4e ]
The "info->fs.location" is a u32 that comes from the user via the ethtool_set_rxnfc() function. We need to check for invalid values to prevent a buffer overflow.
I copy and pasted this check from the mvpp2_ethtool_cls_rule_ins() function.
Fixes: 90b509b39ac9 ("net: mvpp2: cls: Add Classification offload support") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c @@ -1422,6 +1422,9 @@ int mvpp2_ethtool_cls_rule_del(struct mv struct mvpp2_ethtool_fs *efs; int ret;
+ if (info->fs.location >= MVPP2_N_RFS_ENTRIES_PER_FLOW) + return -EINVAL; + efs = port->rfs_rules[info->fs.location]; if (!efs) return -EINVAL;
From: "Jason A. Donenfeld" Jason@zx2c4.com
[ Upstream commit 130c58606171326c81841a49cc913cd354113dd9 ]
Prior, if the alloc_percpu of packet_percpu_multicore_worker_alloc failed, the previously allocated ptr_ring wouldn't be freed. This commit adds the missing call to ptr_ring_cleanup in the error case.
Reported-by: Sultan Alsawaf sultan@kerneltoast.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/wireguard/queueing.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/net/wireguard/queueing.c +++ b/drivers/net/wireguard/queueing.c @@ -35,8 +35,10 @@ int wg_packet_queue_init(struct crypt_qu if (multicore) { queue->worker = wg_packet_percpu_multicore_worker_alloc( function, queue); - if (!queue->worker) + if (!queue->worker) { + ptr_ring_cleanup(&queue->ring, NULL); return -ENOMEM; + } } else { INIT_WORK(&queue->work, function); }
From: "Toke H�iland-J�rgensen" toke@redhat.com
[ Upstream commit eebabcb26ea1e3295704477c6cd4e772c96a9559 ]
WireGuard currently only propagates ECN markings on tunnel decap according to the old RFC3168 specification. However, the spec has since been updated in RFC6040 to recommend slightly different decapsulation semantics. This was implemented in the kernel as a set of common helpers for ECN decapsulation, so let's just switch over WireGuard to using those, so it can benefit from this enhancement and any future tweaks. We do not drop packets with invalid ECN marking combinations, because WireGuard is frequently used to work around broken ISPs, which could be doing that.
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Reported-by: Olivier Tilmans olivier.tilmans@nokia-bell-labs.com Cc: Dave Taht dave.taht@gmail.com Cc: Rodney W. Grimes ietf@gndrsh.dnsmgr.net Signed-off-by: Toke Høiland-Jørgensen toke@redhat.com Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/wireguard/receive.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/drivers/net/wireguard/receive.c +++ b/drivers/net/wireguard/receive.c @@ -393,13 +393,11 @@ static void wg_packet_consume_data_done( len = ntohs(ip_hdr(skb)->tot_len); if (unlikely(len < sizeof(struct iphdr))) goto dishonest_packet_size; - if (INET_ECN_is_ce(PACKET_CB(skb)->ds)) - IP_ECN_set_ce(ip_hdr(skb)); + INET_ECN_decapsulate(skb, PACKET_CB(skb)->ds, ip_hdr(skb)->tos); } else if (skb->protocol == htons(ETH_P_IPV6)) { len = ntohs(ipv6_hdr(skb)->payload_len) + sizeof(struct ipv6hdr); - if (INET_ECN_is_ce(PACKET_CB(skb)->ds)) - IP6_ECN_set_ce(skb, ipv6_hdr(skb)); + INET_ECN_decapsulate(skb, PACKET_CB(skb)->ds, ipv6_get_dsfield(ipv6_hdr(skb))); } else { goto dishonest_packet_type; }
From: Dejin Zheng zhengdejin5@gmail.com
[ Upstream commit d975cb7ea915e64a3ebcfef8a33051f3e6bf22a8 ]
the related system resources were not released when enetc_hw_alloc() return error in the enetc_pci_mdio_probe(), add iounmap() for error handling label "err_hw_alloc" to fix it.
Fixes: 6517798dd3432a ("enetc: Make MDIO accessors more generic and export to include/linux/fsl") Cc: Andy Shevchenko andy.shevchenko@gmail.com Signed-off-by: Dejin Zheng zhengdejin5@gmail.com Reviewed-by: Vladimir Oltean vladimir.oltean@nxp.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c +++ b/drivers/net/ethernet/freescale/enetc/enetc_pci_mdio.c @@ -74,8 +74,8 @@ err_pci_mem_reg: pci_disable_device(pdev); err_pci_enable: err_mdiobus_alloc: - iounmap(port_regs); err_hw_alloc: + iounmap(port_regs); err_ioremap: return err; }
From: "Jason A. Donenfeld" Jason@zx2c4.com
[ Upstream commit b673e24aad36981f327a6570412ffa7754de8911 ]
It's already possible to create two different interfaces and loop packets between them. This has always been possible with tunnels in the kernel, and isn't specific to wireguard. Therefore, the networking stack already needs to deal with that. At the very least, the packet winds up exceeding the MTU and is discarded at that point. So, since this is already something that happens, there's no need to forbid the not very exceptional case of routing a packet back to the same interface; this loop is no different than others, and we shouldn't special case it, but rather rely on generic handling of loops in general. This also makes it easier to do interesting things with wireguard such as onion routing.
At the same time, we add a selftest for this, ensuring that both onion routing works and infinite routing loops do not crash the kernel. We also add a test case for wireguard interfaces nesting packets and sending traffic between each other, as well as the loop in this case too. We make sure to send some throughput-heavy traffic for this use case, to stress out any possible recursion issues with the locks around workqueues.
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/wireguard/socket.c | 12 ------ tools/testing/selftests/wireguard/netns.sh | 54 +++++++++++++++++++++++++++-- 2 files changed, 51 insertions(+), 15 deletions(-)
--- a/drivers/net/wireguard/socket.c +++ b/drivers/net/wireguard/socket.c @@ -76,12 +76,6 @@ static int send4(struct wg_device *wg, s net_dbg_ratelimited("%s: No route to %pISpfsc, error %d\n", wg->dev->name, &endpoint->addr, ret); goto err; - } else if (unlikely(rt->dst.dev == skb->dev)) { - ip_rt_put(rt); - ret = -ELOOP; - net_dbg_ratelimited("%s: Avoiding routing loop to %pISpfsc\n", - wg->dev->name, &endpoint->addr); - goto err; } if (cache) dst_cache_set_ip4(cache, &rt->dst, fl.saddr); @@ -149,12 +143,6 @@ static int send6(struct wg_device *wg, s net_dbg_ratelimited("%s: No route to %pISpfsc, error %d\n", wg->dev->name, &endpoint->addr, ret); goto err; - } else if (unlikely(dst->dev == skb->dev)) { - dst_release(dst); - ret = -ELOOP; - net_dbg_ratelimited("%s: Avoiding routing loop to %pISpfsc\n", - wg->dev->name, &endpoint->addr); - goto err; } if (cache) dst_cache_set_ip6(cache, dst, &fl.saddr); --- a/tools/testing/selftests/wireguard/netns.sh +++ b/tools/testing/selftests/wireguard/netns.sh @@ -48,8 +48,11 @@ cleanup() { exec 2>/dev/null printf "$orig_message_cost" > /proc/sys/net/core/message_cost ip0 link del dev wg0 + ip0 link del dev wg1 ip1 link del dev wg0 + ip1 link del dev wg1 ip2 link del dev wg0 + ip2 link del dev wg1 local to_kill="$(ip netns pids $netns0) $(ip netns pids $netns1) $(ip netns pids $netns2)" [[ -n $to_kill ]] && kill $to_kill pp ip netns del $netns1 @@ -77,18 +80,20 @@ ip0 link set wg0 netns $netns2 key1="$(pp wg genkey)" key2="$(pp wg genkey)" key3="$(pp wg genkey)" +key4="$(pp wg genkey)" pub1="$(pp wg pubkey <<<"$key1")" pub2="$(pp wg pubkey <<<"$key2")" pub3="$(pp wg pubkey <<<"$key3")" +pub4="$(pp wg pubkey <<<"$key4")" psk="$(pp wg genpsk)" [[ -n $key1 && -n $key2 && -n $psk ]]
configure_peers() { ip1 addr add 192.168.241.1/24 dev wg0 - ip1 addr add fd00::1/24 dev wg0 + ip1 addr add fd00::1/112 dev wg0
ip2 addr add 192.168.241.2/24 dev wg0 - ip2 addr add fd00::2/24 dev wg0 + ip2 addr add fd00::2/112 dev wg0
n1 wg set wg0 \ private-key <(echo "$key1") \ @@ -230,9 +235,38 @@ n1 ping -W 1 -c 1 192.168.241.2 n1 wg set wg0 private-key <(echo "$key3") n2 wg set wg0 peer "$pub3" preshared-key <(echo "$psk") allowed-ips 192.168.241.1/32 peer "$pub1" remove n1 ping -W 1 -c 1 192.168.241.2 +n2 wg set wg0 peer "$pub3" remove
-ip1 link del wg0 +# Test that we can route wg through wg +ip1 addr flush dev wg0 +ip2 addr flush dev wg0 +ip1 addr add fd00::5:1/112 dev wg0 +ip2 addr add fd00::5:2/112 dev wg0 +n1 wg set wg0 private-key <(echo "$key1") peer "$pub2" preshared-key <(echo "$psk") allowed-ips fd00::5:2/128 endpoint 127.0.0.1:2 +n2 wg set wg0 private-key <(echo "$key2") listen-port 2 peer "$pub1" preshared-key <(echo "$psk") allowed-ips fd00::5:1/128 endpoint 127.212.121.99:9998 +ip1 link add wg1 type wireguard +ip2 link add wg1 type wireguard +ip1 addr add 192.168.241.1/24 dev wg1 +ip1 addr add fd00::1/112 dev wg1 +ip2 addr add 192.168.241.2/24 dev wg1 +ip2 addr add fd00::2/112 dev wg1 +ip1 link set mtu 1340 up dev wg1 +ip2 link set mtu 1340 up dev wg1 +n1 wg set wg1 listen-port 5 private-key <(echo "$key3") peer "$pub4" allowed-ips 192.168.241.2/32,fd00::2/128 endpoint [fd00::5:2]:5 +n2 wg set wg1 listen-port 5 private-key <(echo "$key4") peer "$pub3" allowed-ips 192.168.241.1/32,fd00::1/128 endpoint [fd00::5:1]:5 +tests +# Try to set up a routing loop between the two namespaces +ip1 link set netns $netns0 dev wg1 +ip0 addr add 192.168.241.1/24 dev wg1 +ip0 link set up dev wg1 +n0 ping -W 1 -c 1 192.168.241.2 +n1 wg set wg0 peer "$pub2" endpoint 192.168.241.2:7 ip2 link del wg0 +ip2 link del wg1 +! n0 ping -W 1 -c 10 -f 192.168.241.2 || false # Should not crash kernel + +ip0 link del wg1 +ip1 link del wg0
# Test using NAT. We now change the topology to this: # ┌────────────────────────────────────────┐ ┌────────────────────────────────────────────────┐ ┌────────────────────────────────────────┐ @@ -282,6 +316,20 @@ pp sleep 3 n2 ping -W 1 -c 1 192.168.241.1 n1 wg set wg0 peer "$pub2" persistent-keepalive 0
+# Test that onion routing works, even when it loops +n1 wg set wg0 peer "$pub3" allowed-ips 192.168.242.2/32 endpoint 192.168.241.2:5 +ip1 addr add 192.168.242.1/24 dev wg0 +ip2 link add wg1 type wireguard +ip2 addr add 192.168.242.2/24 dev wg1 +n2 wg set wg1 private-key <(echo "$key3") listen-port 5 peer "$pub1" allowed-ips 192.168.242.1/32 +ip2 link set wg1 up +n1 ping -W 1 -c 1 192.168.242.2 +ip2 link del wg1 +n1 wg set wg0 peer "$pub3" endpoint 192.168.242.2:5 +! n1 ping -W 1 -c 1 192.168.242.2 || false # Should not crash kernel +n1 wg set wg0 peer "$pub3" remove +ip1 addr del 192.168.242.1/24 dev wg0 + # Do a wg-quick(8)-style policy routing for the default route, making sure vethc has a v6 address to tease out bugs. ip1 -6 addr add fc00::9/96 dev vethc ip1 -6 route add default via fc00::1
From: "Jason A. Donenfeld" Jason@zx2c4.com
[ Upstream commit 4005f5c3c9d006157ba716594e0d70c88a235c5e ]
Users with pathological hardware reported CPU stalls on CONFIG_ PREEMPT_VOLUNTARY=y, because the ringbuffers would stay full, meaning these workers would never terminate. That turned out not to be okay on systems without forced preemption, which Sultan observed. This commit adds a cond_resched() to the bottom of each loop iteration, so that these workers don't hog the core. Note that we don't need this on the napi poll worker, since that terminates after its budget is expended.
Suggested-by: Sultan Alsawaf sultan@kerneltoast.com Reported-by: Wang Jian larkwang@gmail.com Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/wireguard/receive.c | 2 ++ drivers/net/wireguard/send.c | 4 ++++ 2 files changed, 6 insertions(+)
--- a/drivers/net/wireguard/receive.c +++ b/drivers/net/wireguard/receive.c @@ -516,6 +516,8 @@ void wg_packet_decrypt_worker(struct wor &PACKET_CB(skb)->keypair->receiving)) ? PACKET_STATE_CRYPTED : PACKET_STATE_DEAD; wg_queue_enqueue_per_peer_napi(skb, state); + if (need_resched()) + cond_resched(); } }
--- a/drivers/net/wireguard/send.c +++ b/drivers/net/wireguard/send.c @@ -281,6 +281,8 @@ void wg_packet_tx_worker(struct work_str
wg_noise_keypair_put(keypair, false); wg_peer_put(peer); + if (need_resched()) + cond_resched(); } }
@@ -305,6 +307,8 @@ void wg_packet_encrypt_worker(struct wor wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, state);
+ if (need_resched()) + cond_resched(); } }
From: Jason Gerecke jason.gerecke@wacom.com
commit 778fbf4179991e7652e97d7f1ca1f657ef828422 upstream.
We've recently switched from extracting the value of HID_DG_CONTACTMAX at a fixed offset (which may not be correct for all tablets) to injecting the report into the driver for the generic codepath to handle. Unfortunately, this change was made for *all* tablets, even those which aren't generic. Because `wacom_wac_report` ignores reports from non- generic devices, the contact count never gets initialized. Ultimately this results in the touch device itself failing to probe, and thus the loss of touch input.
This commit adds back the fixed-offset extraction for non-generic devices.
Link: https://github.com/linuxwacom/input-wacom/issues/155 Fixes: 184eccd40389 ("HID: wacom: generic: read HID_DG_CONTACTMAX from any feature report") Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Aaron Armstrong Skomra aaron.skomra@wacom.com CC: stable@vger.kernel.org # 5.3+ Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_sys.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/hid/wacom_sys.c +++ b/drivers/hid/wacom_sys.c @@ -319,9 +319,11 @@ static void wacom_feature_mapping(struct data[0] = field->report->id; ret = wacom_get_report(hdev, HID_FEATURE_REPORT, data, n, WAC_CMD_RETRIES); - if (ret == n) { + if (ret == n && features->type == HID_GENERIC) { ret = hid_report_raw_event(hdev, HID_FEATURE_REPORT, data, n, 0); + } else if (ret == 2 && features->type != HID_GENERIC) { + features->touch_max = data[1]; } else { features->touch_max = 16; hid_warn(hdev, "wacom_feature_mapping: "
From: Jere Leppänen jere.leppanen@nokia.com
commit 145cb2f7177d94bc54563ed26027e952ee0ae03c upstream.
When we start shutdown in sctp_sf_do_dupcook_a(), we want to bundle the SHUTDOWN with the COOKIE-ACK to ensure that the peer receives them at the same time and in the correct order. This bundling was broken by commit 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc"), which assigns a transport for the COOKIE-ACK, but not for the SHUTDOWN.
Fix this by passing a reference to the COOKIE-ACK chunk as an argument to sctp_sf_do_9_2_start_shutdown() and onward to sctp_make_shutdown(). This way the SHUTDOWN chunk is assigned the same transport as the COOKIE-ACK chunk, which allows them to be bundled.
In sctp_sf_do_9_2_start_shutdown(), the void *arg parameter was previously unused. Now that we're taking it into use, it must be a valid pointer to a chunk, or NULL. There is only one call site where it's not, in sctp_sf_autoclose_timer_expire(). Fix that too.
Fixes: 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc") Signed-off-by: Jere Leppänen jere.leppanen@nokia.com Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/sctp/sm_statefuns.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -1865,7 +1865,7 @@ static enum sctp_disposition sctp_sf_do_ */ sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl)); return sctp_sf_do_9_2_start_shutdown(net, ep, asoc, - SCTP_ST_CHUNK(0), NULL, + SCTP_ST_CHUNK(0), repl, commands); } else { sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE, @@ -5470,7 +5470,7 @@ enum sctp_disposition sctp_sf_do_9_2_sta * in the Cumulative TSN Ack field the last sequential TSN it * has received from the peer. */ - reply = sctp_make_shutdown(asoc, NULL); + reply = sctp_make_shutdown(asoc, arg); if (!reply) goto nomem;
@@ -6068,7 +6068,7 @@ enum sctp_disposition sctp_sf_autoclose_ disposition = SCTP_DISPOSITION_CONSUME; if (sctp_outq_is_empty(&asoc->outqueue)) { disposition = sctp_sf_do_9_2_start_shutdown(net, ep, asoc, type, - arg, commands); + NULL, commands); }
return disposition;
From: Jason Gerecke killertofu@gmail.com
commit b43f977dd281945960c26b3ef67bba0fa07d39d9 upstream.
This reverts commit 15893fa40109f5e7c67eeb8da62267d0fdf0be9d.
The referenced commit broke pen and touch input for a variety of devices such as the Cintiq Pro 32. Affected devices may appear to work normally for a short amount of time, but eventually loose track of actual touch state and can leave touch arbitration enabled which prevents the pen from working. The commit is not itself required for any currently-available Bluetooth device, and so we revert it to correct the behavior of broken devices.
This breakage occurs due to a mismatch between the order of collections and the order of usages on some devices. This commit tries to read the contact count before processing events, but will fail if the contact count does not occur prior to the first logical finger collection. This is the case for devices like the Cintiq Pro 32 which place the contact count at the very end of the report.
Without the contact count set, touches will only be partially processed. The `wacom_wac_finger_slot` function will not open any slots since the number of contacts seen is greater than the expectation of 0, but we will still end up calling `input_mt_sync_frame` for each finger anyway. This can cause problems for userspace separate from the issue currently taking place in the kernel. Only once all of the individual finger collections have been processed do we finally get to the enclosing collection which contains the contact count. The value ends up being used for the *next* report, however.
This delayed use of the contact count can cause the driver to loose track of the actual touch state and believe that there are contacts down when there aren't. This leaves touch arbitration enabled and prevents the pen from working. It can also cause userspace to incorrectly treat single- finger input as gestures.
Link: https://github.com/linuxwacom/input-wacom/issues/146 Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Aaron Armstrong Skomra aaron.skomra@wacom.com Fixes: 15893fa40109 ("HID: wacom: generic: read the number of expected touches on a per collection basis") Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_wac.c | 79 +++++++++--------------------------------------- 1 file changed, 16 insertions(+), 63 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -2637,9 +2637,25 @@ static void wacom_wac_finger_pre_report( case HID_DG_TIPSWITCH: hid_data->last_slot_field = equivalent_usage; break; + case HID_DG_CONTACTCOUNT: + hid_data->cc_report = report->id; + hid_data->cc_index = i; + hid_data->cc_value_index = j; + break; } } } + + if (hid_data->cc_report != 0 && + hid_data->cc_index >= 0) { + struct hid_field *field = report->field[hid_data->cc_index]; + int value = field->value[hid_data->cc_value_index]; + if (value) + hid_data->num_expected = value; + } + else { + hid_data->num_expected = wacom_wac->features.touch_max; + } }
static void wacom_wac_finger_report(struct hid_device *hdev, @@ -2649,7 +2665,6 @@ static void wacom_wac_finger_report(stru struct wacom_wac *wacom_wac = &wacom->wacom_wac; struct input_dev *input = wacom_wac->touch_input; unsigned touch_max = wacom_wac->features.touch_max; - struct hid_data *hid_data = &wacom_wac->hid_data;
/* If more packets of data are expected, give us a chance to * process them rather than immediately syncing a partial @@ -2663,7 +2678,6 @@ static void wacom_wac_finger_report(stru
input_sync(input); wacom_wac->hid_data.num_received = 0; - hid_data->num_expected = 0;
/* keep touch state for pen event */ wacom_wac->shared->touch_down = wacom_wac_finger_count_touches(wacom_wac); @@ -2738,73 +2752,12 @@ static void wacom_report_events(struct h } }
-static void wacom_set_num_expected(struct hid_device *hdev, - struct hid_report *report, - int collection_index, - struct hid_field *field, - int field_index) -{ - struct wacom *wacom = hid_get_drvdata(hdev); - struct wacom_wac *wacom_wac = &wacom->wacom_wac; - struct hid_data *hid_data = &wacom_wac->hid_data; - unsigned int original_collection_level = - hdev->collection[collection_index].level; - bool end_collection = false; - int i; - - if (hid_data->num_expected) - return; - - // find the contact count value for this segment - for (i = field_index; i < report->maxfield && !end_collection; i++) { - struct hid_field *field = report->field[i]; - unsigned int field_level = - hdev->collection[field->usage[0].collection_index].level; - unsigned int j; - - if (field_level != original_collection_level) - continue; - - for (j = 0; j < field->maxusage; j++) { - struct hid_usage *usage = &field->usage[j]; - - if (usage->collection_index != collection_index) { - end_collection = true; - break; - } - if (wacom_equivalent_usage(usage->hid) == HID_DG_CONTACTCOUNT) { - hid_data->cc_report = report->id; - hid_data->cc_index = i; - hid_data->cc_value_index = j; - - if (hid_data->cc_report != 0 && - hid_data->cc_index >= 0) { - - struct hid_field *field = - report->field[hid_data->cc_index]; - int value = - field->value[hid_data->cc_value_index]; - - if (value) - hid_data->num_expected = value; - } - } - } - } - - if (hid_data->cc_report == 0 || hid_data->cc_index < 0) - hid_data->num_expected = wacom_wac->features.touch_max; -} - static int wacom_wac_collection(struct hid_device *hdev, struct hid_report *report, int collection_index, struct hid_field *field, int field_index) { struct wacom *wacom = hid_get_drvdata(hdev);
- if (WACOM_FINGER_FIELD(field)) - wacom_set_num_expected(hdev, report, collection_index, field, - field_index); wacom_report_events(hdev, report, collection_index, field_index);
/*
From: Alan Stern stern@rowland.harvard.edu
commit 0ed08faded1da03eb3def61502b27f81aef2e615 upstream.
The syzbot fuzzer discovered a bad race between in the usbhid driver between usbhid_stop() and usbhid_close(). In particular, usbhid_stop() does:
usb_free_urb(usbhid->urbin); ... usbhid->urbin = NULL; /* don't mess up next start */
and usbhid_close() does:
usb_kill_urb(usbhid->urbin);
with no mutual exclusion. If the two routines happen to run concurrently so that usb_kill_urb() is called in between the usb_free_urb() and the NULL assignment, it will access the deallocated urb structure -- a use-after-free bug.
This patch adds a mutex to the usbhid private structure and uses it to enforce mutual exclusion of the usbhid_start(), usbhid_stop(), usbhid_open() and usbhid_close() callbacks.
Reported-and-tested-by: syzbot+7bf5a7b0f0a1f9446f4c@syzkaller.appspotmail.com Signed-off-by: Alan Stern stern@rowland.harvard.edu CC: stable@vger.kernel.org Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/usbhid/hid-core.c | 37 +++++++++++++++++++++++++++++-------- drivers/hid/usbhid/usbhid.h | 1 + 2 files changed, 30 insertions(+), 8 deletions(-)
--- a/drivers/hid/usbhid/hid-core.c +++ b/drivers/hid/usbhid/hid-core.c @@ -682,16 +682,21 @@ static int usbhid_open(struct hid_device struct usbhid_device *usbhid = hid->driver_data; int res;
+ mutex_lock(&usbhid->mutex); + set_bit(HID_OPENED, &usbhid->iofl);
- if (hid->quirks & HID_QUIRK_ALWAYS_POLL) - return 0; + if (hid->quirks & HID_QUIRK_ALWAYS_POLL) { + res = 0; + goto Done; + }
res = usb_autopm_get_interface(usbhid->intf); /* the device must be awake to reliably request remote wakeup */ if (res < 0) { clear_bit(HID_OPENED, &usbhid->iofl); - return -EIO; + res = -EIO; + goto Done; }
usbhid->intf->needs_remote_wakeup = 1; @@ -725,6 +730,9 @@ static int usbhid_open(struct hid_device msleep(50);
clear_bit(HID_RESUME_RUNNING, &usbhid->iofl); + + Done: + mutex_unlock(&usbhid->mutex); return res; }
@@ -732,6 +740,8 @@ static void usbhid_close(struct hid_devi { struct usbhid_device *usbhid = hid->driver_data;
+ mutex_lock(&usbhid->mutex); + /* * Make sure we don't restart data acquisition due to * a resumption we no longer care about by avoiding racing @@ -743,12 +753,13 @@ static void usbhid_close(struct hid_devi clear_bit(HID_IN_POLLING, &usbhid->iofl); spin_unlock_irq(&usbhid->lock);
- if (hid->quirks & HID_QUIRK_ALWAYS_POLL) - return; + if (!(hid->quirks & HID_QUIRK_ALWAYS_POLL)) { + hid_cancel_delayed_stuff(usbhid); + usb_kill_urb(usbhid->urbin); + usbhid->intf->needs_remote_wakeup = 0; + }
- hid_cancel_delayed_stuff(usbhid); - usb_kill_urb(usbhid->urbin); - usbhid->intf->needs_remote_wakeup = 0; + mutex_unlock(&usbhid->mutex); }
/* @@ -1057,6 +1068,8 @@ static int usbhid_start(struct hid_devic unsigned int n, insize = 0; int ret;
+ mutex_lock(&usbhid->mutex); + clear_bit(HID_DISCONNECTED, &usbhid->iofl);
usbhid->bufsize = HID_MIN_BUFFER_SIZE; @@ -1177,6 +1190,8 @@ static int usbhid_start(struct hid_devic usbhid_set_leds(hid); device_set_wakeup_enable(&dev->dev, 1); } + + mutex_unlock(&usbhid->mutex); return 0;
fail: @@ -1187,6 +1202,7 @@ fail: usbhid->urbout = NULL; usbhid->urbctrl = NULL; hid_free_buffers(dev, hid); + mutex_unlock(&usbhid->mutex); return ret; }
@@ -1202,6 +1218,8 @@ static void usbhid_stop(struct hid_devic usbhid->intf->needs_remote_wakeup = 0; }
+ mutex_lock(&usbhid->mutex); + clear_bit(HID_STARTED, &usbhid->iofl); spin_lock_irq(&usbhid->lock); /* Sync with error and led handlers */ set_bit(HID_DISCONNECTED, &usbhid->iofl); @@ -1222,6 +1240,8 @@ static void usbhid_stop(struct hid_devic usbhid->urbout = NULL;
hid_free_buffers(hid_to_usb_dev(hid), hid); + + mutex_unlock(&usbhid->mutex); }
static int usbhid_power(struct hid_device *hid, int lvl) @@ -1382,6 +1402,7 @@ static int usbhid_probe(struct usb_inter INIT_WORK(&usbhid->reset_work, hid_reset); timer_setup(&usbhid->io_retry, hid_retry_timeout, 0); spin_lock_init(&usbhid->lock); + mutex_init(&usbhid->mutex);
ret = hid_add_device(hid); if (ret) { --- a/drivers/hid/usbhid/usbhid.h +++ b/drivers/hid/usbhid/usbhid.h @@ -80,6 +80,7 @@ struct usbhid_device { dma_addr_t outbuf_dma; /* Output buffer dma */ unsigned long last_out; /* record of last output for timeouts */
+ struct mutex mutex; /* start/stop/open/close */ spinlock_t lock; /* fifo spinlock */ unsigned long iofl; /* I/O flags (CTRL_RUNNING, OUT_RUNNING) */ struct timer_list io_retry; /* Retry timer */
From: Jason Gerecke killertofu@gmail.com
commit dcce8ef8f70a8e38e6c47c1bae8b312376c04420 upstream.
The state of the center button was not reported to userspace for the 2nd-gen Intuos Pro S when used over Bluetooth due to the pad handling code not being updated to support its reduced number of buttons. This patch uses the actual number of buttons present on the tablet to assemble a button state bitmap.
Link: https://github.com/linuxwacom/xf86-input-wacom/issues/112 Fixes: cd47de45b855 ("HID: wacom: Add 2nd gen Intuos Pro Small support") Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_wac.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -1427,11 +1427,13 @@ static void wacom_intuos_pro2_bt_pad(str { struct input_dev *pad_input = wacom->pad_input; unsigned char *data = wacom->data; + int nbuttons = wacom->features.numbered_buttons;
- int buttons = data[282] | ((data[281] & 0x40) << 2); + int expresskeys = data[282]; + int center = (data[281] & 0x40) >> 6; int ring = data[285] & 0x7F; bool ringstatus = data[285] & 0x80; - bool prox = buttons || ringstatus; + bool prox = expresskeys || center || ringstatus;
/* Fix touchring data: userspace expects 0 at left and increasing clockwise */ ring = 71 - ring; @@ -1439,7 +1441,8 @@ static void wacom_intuos_pro2_bt_pad(str if (ring > 71) ring -= 72;
- wacom_report_numbered_buttons(pad_input, 9, buttons); + wacom_report_numbered_buttons(pad_input, nbuttons, + expresskeys | (center << (nbuttons - 1)));
input_report_abs(pad_input, ABS_WHEEL, ringstatus ? ring : 0);
From: Oliver Neukum oneukum@suse.com
commit 9f04db234af691007bb785342a06abab5fb34474 upstream.
This device needs US_FL_NO_REPORT_OPCODES to avoid going through prolonged error handling on enumeration.
Signed-off-by: Oliver Neukum oneukum@suse.com Reported-by: Julian Groß julian.g@posteo.de Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20200429155218.7308-1-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/storage/unusual_uas.h | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/drivers/usb/storage/unusual_uas.h +++ b/drivers/usb/storage/unusual_uas.h @@ -28,6 +28,13 @@ * and don't forget to CC: the USB development list linux-usb@vger.kernel.org */
+/* Reported-by: Julian Groß julian.g@posteo.de */ +UNUSUAL_DEV(0x059f, 0x105f, 0x0000, 0x9999, + "LaCie", + "2Big Quadra USB3", + USB_SC_DEVICE, USB_PR_DEVICE, NULL, + US_FL_NO_REPORT_OPCODES), + /* * Apricorn USB3 dongle sometimes returns "USBSUSBSUSBS" in response to SCSI * commands in UAS mode. Observed with the 1.28 firmware; are there others?
From: Bryan O'Donoghue bryan.odonoghue@linaro.org
commit 91edf63d5022bd0464788ffb4acc3d5febbaf81d upstream.
Currently we check to make sure there is no error state on the extcon handle for VBUS when writing to the HS_PHY_GENCONFIG_2 register. When using the USB role-switch API we still need to write to this register absent an extcon handle.
This patch makes the appropriate update to ensure the write happens if role-switching is true.
Fixes: 05559f10ed79 ("usb: chipidea: add role switch class support") Cc: stable stable@vger.kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Philipp Zabel p.zabel@pengutronix.de Cc: linux-usb@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Stephen Boyd swboyd@chromium.org Signed-off-by: Bryan O'Donoghue bryan.odonoghue@linaro.org Signed-off-by: Peter Chen peter.chen@nxp.com Link: https://lore.kernel.org/r/20200507004918.25975-2-peter.chen@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/chipidea/ci_hdrc_msm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/usb/chipidea/ci_hdrc_msm.c +++ b/drivers/usb/chipidea/ci_hdrc_msm.c @@ -114,7 +114,7 @@ static int ci_hdrc_msm_notify_event(stru hw_write_id_reg(ci, HS_PHY_GENCONFIG_2, HS_PHY_ULPI_TX_PKT_EN_CLR_FIX, 0);
- if (!IS_ERR(ci->platdata->vbus_extcon.edev)) { + if (!IS_ERR(ci->platdata->vbus_extcon.edev) || ci->role_switch) { hw_write_id_reg(ci, HS_PHY_GENCONFIG_2, HS_PHY_SESS_VLD_CTRL_EN, HS_PHY_SESS_VLD_CTRL_EN);
From: Oliver Neukum oneukum@suse.com
commit e9b3c610a05c1cdf8e959a6d89c38807ff758ee6 upstream.
We must not process packets shorter than a packet ID
Signed-off-by: Oliver Neukum oneukum@suse.com Reported-and-tested-by: syzbot+d29e9263e13ce0b9f4fd@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable stable@vger.kernel.org Signed-off-by: Johan Hovold johan@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/serial/garmin_gps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/usb/serial/garmin_gps.c +++ b/drivers/usb/serial/garmin_gps.c @@ -1138,8 +1138,8 @@ static void garmin_read_process(struct g send it directly to the tty port */ if (garmin_data_p->flags & FLAGS_QUEUING) { pkt_add(garmin_data_p, data, data_length); - } else if (bulk_data || - getLayerId(data) == GARMIN_LAYERID_APPL) { + } else if (bulk_data || (data_length >= sizeof(u32) && + getLayerId(data) == GARMIN_LAYERID_APPL)) {
spin_lock_irqsave(&garmin_data_p->lock, flags); garmin_data_p->flags |= APP_RESP_SEEN;
From: Masami Hiramatsu mhiramat@kernel.org
commit da0f1f4167e3af69e1d8b32d6d65195ddd2bfb64 upstream.
Fix boottime kprobe events to use API correctly for multiple events.
For example, when we set a multiprobe kprobe events in bootconfig like below,
ftrace.event.kprobes.myevent { probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2" }
This cause an error;
trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2
This shows the 1st argument becomes NULL and multiprobes are merged to 1 probe.
Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2
Cc: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Fixes: 29a154810546 ("tracing: Change trace_boot to use kprobe_event interface") Reviewed-by: Tom Zanussi zanussi@kernel.org Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/trace/trace_boot.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-)
--- a/kernel/trace/trace_boot.c +++ b/kernel/trace/trace_boot.c @@ -95,24 +95,20 @@ trace_boot_add_kprobe_event(struct xbc_n struct xbc_node *anode; char buf[MAX_BUF_LEN]; const char *val; - int ret; + int ret = 0;
- kprobe_event_cmd_init(&cmd, buf, MAX_BUF_LEN); + xbc_node_for_each_array_value(node, "probes", anode, val) { + kprobe_event_cmd_init(&cmd, buf, MAX_BUF_LEN);
- ret = kprobe_event_gen_cmd_start(&cmd, event, NULL); - if (ret) - return ret; + ret = kprobe_event_gen_cmd_start(&cmd, event, val); + if (ret) + break;
- xbc_node_for_each_array_value(node, "probes", anode, val) { - ret = kprobe_event_add_field(&cmd, val); + ret = kprobe_event_gen_cmd_end(&cmd); if (ret) - return ret; + pr_err("Failed to add probe: %s\n", buf); }
- ret = kprobe_event_gen_cmd_end(&cmd); - if (ret) - pr_err("Failed to add probe: %s\n", buf); - return ret; } #else
From: Masami Hiramatsu mhiramat@kernel.org
commit 5b4dcd2d201a395ad4054067bfae4a07554fbd65 upstream.
Reject the new event which has NULL location for kprobes. For kprobes, user must specify at least the location.
Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote...
Cc: Tom Zanussi zanussi@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/trace/trace_kprobe.c | 6 ++++++ 1 file changed, 6 insertions(+)
--- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -940,6 +940,9 @@ EXPORT_SYMBOL_GPL(kprobe_event_cmd_init) * complete command or only the first part of it; in the latter case, * kprobe_event_add_fields() can be used to add more fields following this. * + * Unlikely the synth_event_gen_cmd_start(), @loc must be specified. This + * returns -EINVAL if @loc == NULL. + * * Return: 0 if successful, error otherwise. */ int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, @@ -953,6 +956,9 @@ int __kprobe_event_gen_cmd_start(struct if (cmd->type != DYNEVENT_TYPE_KPROBE) return -EINVAL;
+ if (!loc) + return -EINVAL; + if (kretprobe) snprintf(buf, MAX_EVENT_NAME_LEN, "r:kprobes/%s", name); else
From: Steven Rostedt (VMware) rostedt@goodmis.org
commit d16a8c31077e75ecb9427fbfea59b74eed00f698 upstream.
Running on a slower machine, it is possible that the preempt delay kernel thread may still be executing if the module was immediately removed after added, and this can cause the kernel to crash as the kernel thread might be executing after its code has been removed.
There's no reason that the caller of the code shouldn't just wait for the delay thread to finish, as the thread can also be created by a trigger in the sysfs code, which also has the same issues.
Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com
Cc: stable@vger.kernel.org Fixes: 793937236d1ee ("lib: Add module for testing preemptoff/irqsoff latency tracers") Reported-by: Xiao Yang yangx.jy@cn.fujitsu.com Reviewed-by: Xiao Yang yangx.jy@cn.fujitsu.com Reviewed-by: Joel Fernandes joel@joelfernandes.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/trace/preemptirq_delay_test.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-)
--- a/kernel/trace/preemptirq_delay_test.c +++ b/kernel/trace/preemptirq_delay_test.c @@ -113,22 +113,42 @@ static int preemptirq_delay_run(void *da
for (i = 0; i < s; i++) (testfuncs[i])(i); + + set_current_state(TASK_INTERRUPTIBLE); + while (!kthread_should_stop()) { + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } + + __set_current_state(TASK_RUNNING); + return 0; }
-static struct task_struct *preemptirq_start_test(void) +static int preemptirq_run_test(void) { + struct task_struct *task; + char task_name[50];
snprintf(task_name, sizeof(task_name), "%s_test", test_mode); - return kthread_run(preemptirq_delay_run, NULL, task_name); + task = kthread_run(preemptirq_delay_run, NULL, task_name); + if (IS_ERR(task)) + return PTR_ERR(task); + if (task) + kthread_stop(task); + return 0; }
static ssize_t trigger_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - preemptirq_start_test(); + ssize_t ret; + + ret = preemptirq_run_test(); + if (ret) + return ret; return count; }
@@ -148,11 +168,9 @@ static struct kobject *preemptirq_delay_
static int __init preemptirq_delay_init(void) { - struct task_struct *test_task; int retval;
- test_task = preemptirq_start_test(); - retval = PTR_ERR_OR_ZERO(test_task); + retval = preemptirq_run_test(); if (retval != 0) return retval;
From: Steven Rostedt (VMware) rostedt@goodmis.org
commit 11f5efc3ab66284f7aaacc926e9351d658e2577b upstream.
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults.
Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear.
Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home
Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" tz.stoyanov@gmail.com Suggested-by: Joerg Roedel jroedel@suse.de Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/trace/trace.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
--- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8452,6 +8452,19 @@ static int allocate_trace_buffers(struct */ allocate_snapshot = false; #endif + + /* + * Because of some magic with the way alloc_percpu() works on + * x86_64, we need to synchronize the pgd of all the tables, + * otherwise the trace events that happen in x86_64 page fault + * handlers can't cope with accessing the chance that a + * alloc_percpu()'d memory might be touched in the page fault trace + * event. Oh, and we need to audit all other alloc_percpu() and vmalloc() + * calls in tracing, because something might get triggered within a + * page fault trace event! + */ + vmalloc_sync_mappings(); + return 0; }
From: Jason A. Donenfeld Jason@zx2c4.com
commit a9a8ba90fa5857c2c8a0e32eef2159cec717da11 upstream.
Rather than chunking via PAGE_SIZE, this commit changes the arch implementations to chunk in explicit 4k parts, so that calculations on maximum acceptable latency don't suddenly become invalid on platforms where PAGE_SIZE isn't 4k, such as arm64.
Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Reviewed-by: Eric Biggers ebiggers@google.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-)
--- a/arch/arm/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); --- a/arch/arm64/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2);
From: Jason A. Donenfeld Jason@zx2c4.com
commit 706024a52c614b478b63f7728d202532ce6591a9 upstream.
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a 4k chunk, since it disables preemption. The choice of 4k isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form of PAGE_SIZE, which this commit corrects to be explicitly 4k for the former two).
Ard did some back of the envelope calculations and found that at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k means we have a maximum preemption disabling of 20us, which Sebastian confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation") Cc: Eric Biggers ebiggers@google.com Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Reviewed-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm/crypto/chacha-glue.c | 14 +++++++++++--- arch/arm/crypto/poly1305-glue.c | 15 +++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++--- arch/arm64/crypto/poly1305-glue.c | 15 +++++++++++---- arch/x86/crypto/blake2s-glue.c | 10 ++++------ arch/x86/crypto/chacha_glue.c | 14 +++++++++++--- arch/x86/crypto/poly1305_glue.c | 13 ++++++------- 7 files changed, 65 insertions(+), 30 deletions(-)
--- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *d return; }
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
--- a/arch/arm/crypto/poly1305-glue.c +++ b/arch/arm/crypto/poly1305-glue.c @@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly130 unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && do_neon) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks_arm(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
--- a/arch/arm64/crypto/chacha-neon-glue.c +++ b/arch/arm64/crypto/chacha-neon-glue.c @@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *d !crypto_simd_usable()) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
--- a/arch/arm64/crypto/poly1305-glue.c +++ b/arch/arm64/crypto/poly1305-glue.c @@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly130 unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && crypto_simd_usable()) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
--- a/arch/x86/crypto/blake2s-glue.c +++ b/arch/x86/crypto/blake2s-glue.c @@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2 const u32 inc) { /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8); + BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) { blake2s_compress_generic(state, block, nblocks, inc); return; }
- for (;;) { + do { const size_t blocks = min_t(size_t, nblocks, - PAGE_SIZE / BLAKE2S_BLOCK_SIZE); + SZ_4K / BLAKE2S_BLOCK_SIZE);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && @@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2 kernel_fpu_end();
nblocks -= blocks; - if (!nblocks) - break; block += blocks * BLAKE2S_BLOCK_SIZE; - } + } while (nblocks); } EXPORT_SYMBOL(blake2s_compress_arch);
--- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -154,9 +154,17 @@ void chacha_crypt_arch(u32 *state, u8 *d bytes <= CHACHA_BLOCK_SIZE) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_fpu_begin(); - chacha_dosimd(state, dst, src, bytes, nrounds); - kernel_fpu_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_fpu_begin(); + chacha_dosimd(state, dst, src, todo, nrounds); + kernel_fpu_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
--- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *c struct poly1305_arch_internal *state = ctx;
/* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || - PAGE_SIZE % POLY1305_BLOCK_SIZE); + BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || + SZ_4K % POLY1305_BLOCK_SIZE);
if (!IS_ENABLED(CONFIG_AS_AVX) || !static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || @@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *c return; }
- for (;;) { - const size_t bytes = min_t(size_t, len, PAGE_SIZE); + do { + const size_t bytes = min_t(size_t, len, SZ_4K);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) @@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *c else poly1305_blocks_avx(ctx, inp, bytes, padbit); kernel_fpu_end(); + len -= bytes; - if (!len) - break; inp += bytes; - } + } while (len); }
static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
From: Christian Borntraeger borntraeger@de.ibm.com
commit 5615e74f48dcc982655543e979b6c3f3f877e6f6 upstream.
In LPAR we will only get an intercept for FC==3 for the PQAP instruction. Running nested under z/VM can result in other intercepts as well as ECA_APIE is an effective bit: If one hypervisor layer has turned this bit off, the end result will be that we will get intercepts for all function codes. Usually the first one will be a query like PQAP(QCI). So the WARN_ON_ONCE is not right. Let us simply remove it.
Cc: Pierre Morel pmorel@linux.ibm.com Cc: Tony Krowiak akrowiak@linux.ibm.com Cc: stable@vger.kernel.org # v5.3+ Fixes: e5282de93105 ("s390: ap: kvm: add PQAP interception for AQIC") Link: https://lore.kernel.org/kvm/20200505083515.2720-1-borntraeger@de.ibm.com Reported-by: Qian Cai cailca@icloud.com Signed-off-by: Christian Borntraeger borntraeger@de.ibm.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Cornelia Huck cohuck@redhat.com Signed-off-by: Christian Borntraeger borntraeger@de.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/s390/kvm/priv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -626,10 +626,12 @@ static int handle_pqap(struct kvm_vcpu * * available for the guest are AQIC and TAPQ with the t bit set * since we do not set IC.3 (FIII) we currently will only intercept * the AQIC function code. + * Note: running nested under z/VM can result in intercepts for other + * function codes, e.g. PQAP(QCI). We do not support this and bail out. */ reg0 = vcpu->run->s.regs.gprs[0]; fc = (reg0 >> 24) & 0xff; - if (WARN_ON_ONCE(fc != 0x03)) + if (fc != 0x03) return -EOPNOTSUPP;
/* PQAP instruction is allowed for guest kernel only */
From: Sean Christopherson sean.j.christopherson@intel.com
commit c7cb2d650c9e78c03bd2d1c0db89891825f8c0f4 upstream.
Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit 089dd8e53126e ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set.
Reported-by: Qian Cai cai@lca.pw Cc: Rick Edgecombe rick.p.edgecombe@intel.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: stable@vger.kernel.org Fixes: f2fde6a5bcfcf ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson sean.j.christopherson@intel.com Message-Id: 20200506035355.2242-1-sean.j.christopherson@intel.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/vmx/vmenter.S | 3 +++ 1 file changed, 3 insertions(+)
--- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -86,6 +86,9 @@ SYM_FUNC_START(vmx_vmexit) /* IMPORTANT: Stuff the RSB immediately after VM-Exit, before RET! */ FILL_RETURN_BUFFER %_ASM_AX, RSB_CLEAR_LOOPS, X86_FEATURE_RETPOLINE
+ /* Clear RFLAGS.CF and RFLAGS.ZF to preserve VM-Exit, i.e. !VM-Fail. */ + or $1, %_ASM_AX + pop %_ASM_AX .Lvmexit_skip_rsb: #endif
From: Marc Zyngier maz@kernel.org
commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream.
When deciding whether a guest has to be stopped we check whether this is a private interrupt or not. Unfortunately, there's an off-by-one bug here, and we fail to recognize a whole range of interrupts as being global (GICv2 SPIs 32-63).
Fix the condition from > to be >=.
Cc: stable@vger.kernel.org Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race") Reported-by: André Przywara andre.przywara@arm.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- virt/kvm/arm/vgic/vgic-mmio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/virt/kvm/arm/vgic/vgic-mmio.c +++ b/virt/kvm/arm/vgic/vgic-mmio.c @@ -368,7 +368,7 @@ static void vgic_mmio_change_active(stru static void vgic_change_active_prepare(struct kvm_vcpu *vcpu, u32 intid) { if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 || - intid > VGIC_NR_PRIVATE_IRQS) + intid >= VGIC_NR_PRIVATE_IRQS) kvm_arm_halt_guest(vcpu->kvm); }
@@ -376,7 +376,7 @@ static void vgic_change_active_prepare(s static void vgic_change_active_finish(struct kvm_vcpu *vcpu, u32 intid) { if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 || - intid > VGIC_NR_PRIVATE_IRQS) + intid >= VGIC_NR_PRIVATE_IRQS) kvm_arm_resume_guest(vcpu->kvm); }
From: Marc Zyngier maz@kernel.org
commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream.
In the unlikely event that a 32bit vcpu traps into the hypervisor on an instruction that is located right at the end of the 32bit range, the emulation of that instruction is going to increment PC past the 32bit range. This isn't great, as userspace can then observe this value and get a bit confused.
Conversly, userspace can do things like (in the context of a 64bit guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0, set PC to a 64bit value, change PSTATE to AArch32-USR, and observe that PC hasn't been truncated. More confusion.
Fix both by: - truncating PC increments for 32bit guests - sanitizing all 32bit regs every time a core reg is changed by userspace, and that PSTATE indicates a 32bit mode.
Cc: stable@vger.kernel.org Acked-by: Will Deacon will@kernel.org Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm64/kvm/guest.c | 7 +++++++ virt/kvm/arm/hyp/aarch32.c | 8 ++++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
--- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -201,6 +201,13 @@ static int set_core_reg(struct kvm_vcpu }
memcpy((u32 *)regs + off, valp, KVM_REG_SIZE(reg->id)); + + if (*vcpu_cpsr(vcpu) & PSR_MODE32_BIT) { + int i; + + for (i = 0; i < 16; i++) + *vcpu_reg32(vcpu, i) = (u32)*vcpu_reg32(vcpu, i); + } out: return err; } --- a/virt/kvm/arm/hyp/aarch32.c +++ b/virt/kvm/arm/hyp/aarch32.c @@ -125,12 +125,16 @@ static void __hyp_text kvm_adjust_itstat */ void __hyp_text kvm_skip_instr32(struct kvm_vcpu *vcpu, bool is_wide_instr) { + u32 pc = *vcpu_pc(vcpu); bool is_thumb;
is_thumb = !!(*vcpu_cpsr(vcpu) & PSR_AA32_T_BIT); if (is_thumb && !is_wide_instr) - *vcpu_pc(vcpu) += 2; + pc += 2; else - *vcpu_pc(vcpu) += 4; + pc += 4; + + *vcpu_pc(vcpu) = pc; + kvm_adjust_itstate(vcpu); }
From: Mark Rutland mark.rutland@arm.com
commit 027d0c7101f50cf03aeea9eebf484afd4920c8d3 upstream.
The static analyzer in GCC 10 spotted that in huge_pte_alloc() we may pass a NULL pmdp into pte_alloc_map() when pmd_alloc() returns NULL:
| CC arch/arm64/mm/pageattr.o | CC arch/arm64/mm/hugetlbpage.o | from arch/arm64/mm/hugetlbpage.c:10: | arch/arm64/mm/hugetlbpage.c: In function ‘huge_pte_alloc’: | ./arch/arm64/include/asm/pgtable-types.h:28:24: warning: dereference of NULL ‘pmdp’ [CWE-690] [-Wanalyzer-null-dereference] | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’ | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’ | |arch/arm64/mm/hugetlbpage.c:232:10: | |./arch/arm64/include/asm/pgtable-types.h:28:24: | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’ | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’
This can only occur when the kernel cannot allocate a page, and so is unlikely to happen in practice before other systems start failing.
We can avoid this by bailing out if pmd_alloc() fails, as we do earlier in the function if pud_alloc() fails.
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Mark Rutland mark.rutland@arm.com Reported-by: Kyrill Tkachov kyrylo.tkachov@arm.com Cc: stable@vger.kernel.org # 4.5.x- Cc: Will Deacon will@kernel.org Signed-off-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm64/mm/hugetlbpage.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -230,6 +230,8 @@ pte_t *huge_pte_alloc(struct mm_struct * ptep = (pte_t *)pudp; } else if (sz == (CONT_PTE_SIZE)) { pmdp = pmd_alloc(mm, pudp, addr); + if (!pmdp) + return NULL;
WARN_ON(addr & (sz - 1)); /*
From: Ulf Hansson ulf.hansson@linaro.org
commit 9495b7e92f716ab2bd6814fab5e97ab4a39adfdd upstream.
It's currently the platform driver's responsibility to initialize the pointer, dma_parms, for its corresponding struct device. The benefit with this approach allows us to avoid the initialization and to not waste memory for the struct device_dma_parameters, as this can be decided on a case by case basis.
However, it has turned out that this approach is not very practical. Not only does it lead to open coding, but also to real errors. In principle callers of dma_set_max_seg_size() doesn't check the error code, but just assumes it succeeds.
For these reasons, let's do the initialization from the common platform bus at the device registration point. This also follows the way the PCI devices are being managed, see pci_device_add().
Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org Tested-by: Haibo Chen haibo.chen@nxp.com Reviewed-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20200422100954.31211-1-ulf.hansson@linaro.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/base/platform.c | 2 ++ include/linux/platform_device.h | 1 + 2 files changed, 3 insertions(+)
--- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -361,6 +361,8 @@ struct platform_object { */ static void setup_pdev_dma_masks(struct platform_device *pdev) { + pdev->dev.dma_parms = &pdev->dma_parms; + if (!pdev->dev.coherent_dma_mask) pdev->dev.coherent_dma_mask = DMA_BIT_MASK(32); if (!pdev->dev.dma_mask) { --- a/include/linux/platform_device.h +++ b/include/linux/platform_device.h @@ -25,6 +25,7 @@ struct platform_device { bool id_auto; struct device dev; u64 platform_dma_mask; + struct device_dma_parameters dma_parms; u32 num_resources; struct resource *resource;
From: Ulf Hansson ulf.hansson@linaro.org
commit f458488425f1cc9a396aa1d09bb00c48783936da upstream.
It's currently the amba driver's responsibility to initialize the pointer, dma_parms, for its corresponding struct device. The benefit with this approach allows us to avoid the initialization and to not waste memory for the struct device_dma_parameters, as this can be decided on a case by case basis.
However, it has turned out that this approach is not very practical. Not only does it lead to open coding, but also to real errors. In principle callers of dma_set_max_seg_size() doesn't check the error code, but just assumes it succeeds.
For these reasons, let's do the initialization from the common amba bus at the device registration point. This also follows the way the PCI devices are being managed, see pci_device_add().
Suggested-by: Christoph Hellwig hch@lst.de Cc: Russell King linux@armlinux.org.uk Cc: stable@vger.kernel.org Tested-by: Haibo Chen haibo.chen@nxp.com Reviewed-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20200422101013.31267-1-ulf.hansson@linaro.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/amba/bus.c | 1 + include/linux/amba/bus.h | 1 + 2 files changed, 2 insertions(+)
--- a/drivers/amba/bus.c +++ b/drivers/amba/bus.c @@ -645,6 +645,7 @@ static void amba_device_initialize(struc dev->dev.release = amba_device_release; dev->dev.bus = &amba_bustype; dev->dev.dma_mask = &dev->dev.coherent_dma_mask; + dev->dev.dma_parms = &dev->dma_parms; dev->res.name = dev_name(&dev->dev); }
--- a/include/linux/amba/bus.h +++ b/include/linux/amba/bus.h @@ -65,6 +65,7 @@ struct amba_device { struct device dev; struct resource res; struct clk *pclk; + struct device_dma_parameters dma_parms; unsigned int periphid; unsigned int cid; struct amba_cs_uci_id uci;
From: Tomas Winkler tomas.winkler@intel.com
commit d76bc8200f9cf8b6746e66b37317ba477eda25c4 upstream.
Disable the MEI driver on LBG SPS (server) platforms, some corner flows such as recovery mode does not work, and the driver doesn't have working use cases.
Cc: stable@vger.kernel.org Signed-off-by: Tomas Winkler tomas.winkler@intel.com Link: https://lore.kernel.org/r/20200428211200.12200-1-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/misc/mei/hw-me.c | 8 ++++++++ drivers/misc/mei/hw-me.h | 4 ++++ drivers/misc/mei/pci-me.c | 2 +- 3 files changed, 13 insertions(+), 1 deletion(-)
--- a/drivers/misc/mei/hw-me.c +++ b/drivers/misc/mei/hw-me.c @@ -1465,6 +1465,13 @@ static const struct mei_cfg mei_me_pch12 MEI_CFG_DMA_128, };
+/* LBG with quirk for SPS Firmware exclusion */ +static const struct mei_cfg mei_me_pch12_sps_cfg = { + MEI_CFG_PCH8_HFS, + MEI_CFG_FW_VER_SUPP, + MEI_CFG_FW_SPS, +}; + /* Tiger Lake and newer devices */ static const struct mei_cfg mei_me_pch15_cfg = { MEI_CFG_PCH8_HFS, @@ -1487,6 +1494,7 @@ static const struct mei_cfg *const mei_c [MEI_ME_PCH8_CFG] = &mei_me_pch8_cfg, [MEI_ME_PCH8_SPS_CFG] = &mei_me_pch8_sps_cfg, [MEI_ME_PCH12_CFG] = &mei_me_pch12_cfg, + [MEI_ME_PCH12_SPS_CFG] = &mei_me_pch12_sps_cfg, [MEI_ME_PCH15_CFG] = &mei_me_pch15_cfg, };
--- a/drivers/misc/mei/hw-me.h +++ b/drivers/misc/mei/hw-me.h @@ -80,6 +80,9 @@ struct mei_me_hw { * servers platforms with quirk for * SPS firmware exclusion. * @MEI_ME_PCH12_CFG: Platform Controller Hub Gen12 and newer + * @MEI_ME_PCH12_SPS_CFG: Platform Controller Hub Gen12 and newer + * servers platforms with quirk for + * SPS firmware exclusion. * @MEI_ME_PCH15_CFG: Platform Controller Hub Gen15 and newer * @MEI_ME_NUM_CFG: Upper Sentinel. */ @@ -93,6 +96,7 @@ enum mei_cfg_idx { MEI_ME_PCH8_CFG, MEI_ME_PCH8_SPS_CFG, MEI_ME_PCH12_CFG, + MEI_ME_PCH12_SPS_CFG, MEI_ME_PCH15_CFG, MEI_ME_NUM_CFG, }; --- a/drivers/misc/mei/pci-me.c +++ b/drivers/misc/mei/pci-me.c @@ -79,7 +79,7 @@ static const struct pci_device_id mei_me {MEI_PCI_DEVICE(MEI_DEV_ID_SPT_2, MEI_ME_PCH8_CFG)}, {MEI_PCI_DEVICE(MEI_DEV_ID_SPT_H, MEI_ME_PCH8_SPS_CFG)}, {MEI_PCI_DEVICE(MEI_DEV_ID_SPT_H_2, MEI_ME_PCH8_SPS_CFG)}, - {MEI_PCI_DEVICE(MEI_DEV_ID_LBG, MEI_ME_PCH12_CFG)}, + {MEI_PCI_DEVICE(MEI_DEV_ID_LBG, MEI_ME_PCH12_SPS_CFG)},
{MEI_PCI_DEVICE(MEI_DEV_ID_BXT_M, MEI_ME_PCH8_CFG)}, {MEI_PCI_DEVICE(MEI_DEV_ID_APL_I, MEI_ME_PCH8_CFG)},
From: H. Nikolaus Schaller hns@goldelico.com
commit c59359a02d14a7256cd508a4886b7d2012df2363 upstream.
so that the driver can load by matching the device tree if compiled as module.
Cc: stable@vger.kernel.org # v5.3+ Fixes: 90b86fcc47b4 ("DRM: Add KMS driver for the Ingenic JZ47xx SoCs") Signed-off-by: H. Nikolaus Schaller hns@goldelico.com Signed-off-by: Paul Cercueil paul@crapouillou.net Link: https://patchwork.freedesktop.org/patch/msgid/1694a29b7a3449b6b662cec33d1b33... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/gpu/drm/ingenic/ingenic-drm.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/gpu/drm/ingenic/ingenic-drm.c +++ b/drivers/gpu/drm/ingenic/ingenic-drm.c @@ -843,6 +843,7 @@ static const struct of_device_id ingenic { .compatible = "ingenic,jz4770-lcd", .data = &jz4770_soc_info }, { /* sentinel */ }, }; +MODULE_DEVICE_TABLE(of, ingenic_drm_of_match);
static struct platform_driver ingenic_drm_driver = { .driver = {
From: Daniel Kolesa daniel@octaforge.org
commit 59dfb0c64d3853d20dc84f4561f28d4f5a2ddc7d upstream.
The dcn20_validate_bandwidth function would have code touching the incorrect registers emitted outside of the boundaries of the DC_FP_START/END macros, at least on ppc64le. Work around the problem by wrapping the whole function instead.
Signed-off-by: Daniel Kolesa daniel@octaforge.org Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.6.x Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 31 +++++++++++++----- 1 file changed, 23 insertions(+), 8 deletions(-)
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c +++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c @@ -3034,25 +3034,32 @@ validate_out: return out; }
- -bool dcn20_validate_bandwidth(struct dc *dc, struct dc_state *context, - bool fast_validate) +/* + * This must be noinline to ensure anything that deals with FP registers + * is contained within this call; previously our compiling with hard-float + * would result in fp instructions being emitted outside of the boundaries + * of the DC_FP_START/END macros, which makes sense as the compiler has no + * idea about what is wrapped and what is not + * + * This is largely just a workaround to avoid breakage introduced with 5.6, + * ideally all fp-using code should be moved into its own file, only that + * should be compiled with hard-float, and all code exported from there + * should be strictly wrapped with DC_FP_START/END + */ +static noinline bool dcn20_validate_bandwidth_fp(struct dc *dc, + struct dc_state *context, bool fast_validate) { bool voltage_supported = false; bool full_pstate_supported = false; bool dummy_pstate_supported = false; double p_state_latency_us;
- DC_FP_START(); p_state_latency_us = context->bw_ctx.dml.soc.dram_clock_change_latency_us; context->bw_ctx.dml.soc.disable_dram_clock_change_vactive_support = dc->debug.disable_dram_clock_change_vactive_support;
if (fast_validate) { - voltage_supported = dcn20_validate_bandwidth_internal(dc, context, true); - - DC_FP_END(); - return voltage_supported; + return dcn20_validate_bandwidth_internal(dc, context, true); }
// Best case, we support full UCLK switch latency @@ -3081,7 +3088,15 @@ bool dcn20_validate_bandwidth(struct dc
restore_dml_state: context->bw_ctx.dml.soc.dram_clock_change_latency_us = p_state_latency_us; + return voltage_supported; +}
+bool dcn20_validate_bandwidth(struct dc *dc, struct dc_state *context, + bool fast_validate) +{ + bool voltage_supported = false; + DC_FP_START(); + voltage_supported = dcn20_validate_bandwidth_fp(dc, context, fast_validate); DC_FP_END(); return voltage_supported; }
From: Oleg Nesterov oleg@redhat.com
commit b5f2006144c6ae941726037120fa1001ddede784 upstream.
Commit cc731525f26a ("signal: Remove kernel interal si_code magic") changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info() to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h> #include <mqueue.h> #include <unistd.h> #include <sys/wait.h> #include <assert.h>
static int notified;
static void sigh(int sig) { notified = 1; }
int main(void) { signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL); assert(fd >= 0);
struct sigevent se = { .sigev_notify = SIGEV_SIGNAL, .sigev_signo = SIGIO, }; assert(mq_notify(fd, &se) == 0);
if (!fork()) { assert(setuid(1) == 0); mq_send(fd, "",1,0); return 0; }
wait(NULL); mq_unlink("/mq"); assert(notified); return 0; }
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere] Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic") Reported-by: Yoji yoji.fujihar.min@gmail.com Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Manfred Spraul manfred@colorfullife.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: "Eric W. Biederman" ebiederm@xmission.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Markus Elfring elfring@users.sourceforge.net Cc: 1vier1@web.de Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.c... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- ipc/mqueue.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-)
--- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -142,6 +142,7 @@ struct mqueue_inode_info {
struct sigevent notify; struct pid *notify_owner; + u32 notify_self_exec_id; struct user_namespace *notify_user_ns; struct user_struct *user; /* user who created, for accounting */ struct sock *notify_sock; @@ -774,28 +775,44 @@ static void __do_notify(struct mqueue_in * synchronously. */ if (info->notify_owner && info->attr.mq_curmsgs == 1) { - struct kernel_siginfo sig_i; switch (info->notify.sigev_notify) { case SIGEV_NONE: break; - case SIGEV_SIGNAL: - /* sends signal */ + case SIGEV_SIGNAL: { + struct kernel_siginfo sig_i; + struct task_struct *task; + + /* do_mq_notify() accepts sigev_signo == 0, why?? */ + if (!info->notify.sigev_signo) + break;
clear_siginfo(&sig_i); sig_i.si_signo = info->notify.sigev_signo; sig_i.si_errno = 0; sig_i.si_code = SI_MESGQ; sig_i.si_value = info->notify.sigev_value; - /* map current pid/uid into info->owner's namespaces */ rcu_read_lock(); + /* map current pid/uid into info->owner's namespaces */ sig_i.si_pid = task_tgid_nr_ns(current, ns_of_pid(info->notify_owner)); - sig_i.si_uid = from_kuid_munged(info->notify_user_ns, current_uid()); + sig_i.si_uid = from_kuid_munged(info->notify_user_ns, + current_uid()); + /* + * We can't use kill_pid_info(), this signal should + * bypass check_kill_permission(). It is from kernel + * but si_fromuser() can't know this. + * We do check the self_exec_id, to avoid sending + * signals to programs that don't expect them. + */ + task = pid_task(info->notify_owner, PIDTYPE_TGID); + if (task && task->self_exec_id == + info->notify_self_exec_id) { + do_send_sig_info(info->notify.sigev_signo, + &sig_i, task, PIDTYPE_TGID); + } rcu_read_unlock(); - - kill_pid_info(info->notify.sigev_signo, - &sig_i, info->notify_owner); break; + } case SIGEV_THREAD: set_cookie(info->notify_cookie, NOTIFY_WOKENUP); netlink_sendskb(info->notify_sock, info->notify_cookie); @@ -1384,6 +1401,7 @@ retry: info->notify.sigev_signo = notification->sigev_signo; info->notify.sigev_value = notification->sigev_value; info->notify.sigev_notify = SIGEV_SIGNAL; + info->notify_self_exec_id = current->self_exec_id; break; }
From: Roman Penyaev rpenyaev@suse.de
commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream.
This patch does two things:
- fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
- improves performance for events delivery.
The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function.
This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905)
The idea of the current patch is to use init_wait() instead of init_waitqueue_entry().
Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups.
Problem is very well reproduced by the epoll60 test case [1].
Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case:
With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373):
real 0m6.970s user 0m49.786s sys 0m0.113s
After this patch:
real 0m5.220s user 0m36.879s sys 0m0.019s
The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events:
With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373):
threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309
After this patch:
threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767
(number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better)
[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
Signed-off-by: Roman Penyaev rpenyaev@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Jason Baron jbaron@akamai.com Cc: Khazhismel Kumykov khazhy@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Heiher r@hev.cc Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/eventpoll.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1800,7 +1800,6 @@ static int ep_poll(struct eventpoll *ep, { int res = 0, eavail, timed_out = 0; u64 slack = 0; - bool waiter = false; wait_queue_entry_t wait; ktime_t expires, *to = NULL;
@@ -1845,21 +1844,23 @@ fetch_events: */ ep_reset_busy_poll_napi_id(ep);
- /* - * We don't have any available event to return to the caller. We need - * to sleep here, and we will be woken by ep_poll_callback() when events - * become available. - */ - if (!waiter) { - waiter = true; - init_waitqueue_entry(&wait, current); - + do { + /* + * Internally init_wait() uses autoremove_wake_function(), + * thus wait entry is removed from the wait queue on each + * wakeup. Why it is important? In case of several waiters + * each new wakeup will hit the next waiter, giving it the + * chance to harvest new event. Otherwise wakeup can be + * lost. This is also good performance-wise, because on + * normal wakeup path no need to call __remove_wait_queue() + * explicitly, thus ep->lock is not taken, which halts the + * event delivery. + */ + init_wait(&wait); write_lock_irq(&ep->lock); __add_wait_queue_exclusive(&ep->wq, &wait); write_unlock_irq(&ep->lock); - }
- for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state @@ -1889,10 +1890,20 @@ fetch_events: timed_out = 1; break; } - } + + /* We were woken up, thus go and try to harvest some events */ + eavail = 1; + + } while (0);
__set_current_state(TASK_RUNNING);
+ if (!list_empty_careful(&wait.entry)) { + write_lock_irq(&ep->lock); + __remove_wait_queue(&ep->wq, &wait); + write_unlock_irq(&ep->lock); + } + send_events: /* * Try to transfer events to user space. In case we get 0 events and @@ -1903,12 +1914,6 @@ send_events: !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events;
- if (waiter) { - write_lock_irq(&ep->lock); - __remove_wait_queue(&ep->wq, &wait); - write_unlock_irq(&ep->lock); - } - return res; }
From: Khazhismel Kumykov khazhy@google.com
commit 0c54a6a44bf3d41e76ce3f583a6ece267618df2e upstream.
In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch.
[khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov khazhy@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Roman Penyaev rpenyaev@suse.de Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Roman Penyaev rpenyaev@suse.de Cc: Heiher r@hev.cc Cc: Jason Baron jbaron@akamai.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/eventpoll.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1149,6 +1149,10 @@ static inline bool chain_epi_lockless(st { struct eventpoll *ep = epi->ep;
+ /* Fast preliminary check */ + if (epi->next != EP_UNACTIVE_PTR) + return false; + /* Check that the same epi has not been just chained from another CPU */ if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) return false; @@ -1215,16 +1219,12 @@ static int ep_poll_callback(wait_queue_e * chained in ep->ovflist and requeued later on. */ if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { - if (epi->next == EP_UNACTIVE_PTR && - chain_epi_lockless(epi)) + if (chain_epi_lockless(epi)) + ep_pm_stay_awake_rcu(epi); + } else if (!ep_is_linked(epi)) { + /* In the usual case, add event to ready list. */ + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) ep_pm_stay_awake_rcu(epi); - goto out_unlock; - } - - /* If this file is already in the ready list we exit soon */ - if (!ep_is_linked(epi) && - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) { - ep_pm_stay_awake_rcu(epi); }
/*
From: David Hildenbrand david@redhat.com
commit e84fe99b68ce353c37ceeecc95dce9696c976556 upstream.
Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up.
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40
The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger.
Signed-off-by: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Pavel Tatashin pasha.tatashin@soleen.com Reviewed-by: Pankaj Gupta pankaj.gupta.linux@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Shile Zhang shile.zhang@linux.alibaba.com Acked-by: Michal Hocko mhocko@suse.com Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Shile Zhang shile.zhang@linux.alibaba.com Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michal Hocko mhocko@kernel.org Cc: Alexander Duyck alexander.duyck@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Oscar Salvador osalvador@suse.de Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 1 + 1 file changed, 1 insertion(+)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1555,6 +1555,7 @@ void set_zone_contiguous(struct zone *zo if (!__pageblock_pfn_to_page(block_start_pfn, block_end_pfn, zone)) return; + cond_resched(); }
/* We confirm that there is no hole */
From: Henry Willard henry.willard@oracle.com
commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.
Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") adds a boost_watermark() function which increases the min watermark in a zone by at least pageblock_nr_pages or the number of pages in a page block.
On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or 512M. It does this regardless of the number of managed pages managed in the zone or the likelihood of success.
This can put the zone immediately under water in terms of allocating pages from the zone, and can cause a small machine to fail immediately due to OoM. Unlike set_recommended_min_free_kbytes(), which substantially increases min_free_kbytes and is tied to THP, boost_watermark() can be called even if THP is not active.
The problem is most likely to appear on architectures such as Arm64 where pageblock_nr_pages is very large.
It is desirable to run the kdump capture kernel in as small a space as possible to avoid wasting memory. In some architectures, such as Arm64, there are restrictions on where the capture kernel can run, and therefore, the space available. A capture kernel running in 768M can fail due to OoM immediately after boost_watermark() sets the min in zone DMA32, where most of the memory is, to 512M. It fails even though there is over 500M of free memory. With boost_watermark() suppressed, the capture kernel can run successfully in 448M.
This patch limits boost_watermark() to boosting a zone's min watermark only when there are enough pages that the boost will produce positive results. In this case that is estimated to be four times as many pages as pageblock_nr_pages.
Mel said:
: There is no harm in marking it stable. Clearly it does not happen very : often but it's not impossible. 32-bit x86 is a lot less common now : which would previously have been vulnerable to triggering this easily. : ppc64 has a larger base page size but typically only has one zone. : arm64 is likely the most vulnerable, particularly when CMA is : configured with a small movable zone.
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Henry Willard henry.willard@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: David Hildenbrand david@redhat.com Acked-by: Mel Gorman mgorman@techsingularity.net Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@orac... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2351,6 +2351,14 @@ static inline void boost_watermark(struc
if (!watermark_boost_factor) return; + /* + * Don't bother in zones that are unlikely to produce results. + * On small machines, including kdump capture kernels running + * in a small area, boosting the watermark can cause an out of + * memory situation immediately. + */ + if ((pageblock_nr_pages * 4) > zone_managed_pages(zone)) + return;
max_boost = mult_frac(zone->_watermark[WMARK_HIGH], watermark_boost_factor, 10000);
From: Jeff Layton jlayton@kernel.org
commit 0fa8263367db9287aa0632f96c1a5f93cc478150 upstream.
Eduard reported a problem mounting cephfs on s390 arch. The feature mask sent by the MDS is little-endian, so we need to convert it before storing and testing against it.
Cc: stable@vger.kernel.org Reported-and-Tested-by: Eduard Shishkin edward6@linux.ibm.com Signed-off-by: Jeff Layton jlayton@kernel.org Reviewed-by: "Yan, Zheng" zyan@redhat.com Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ceph/mds_client.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
--- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3116,8 +3116,7 @@ static void handle_session(struct ceph_m void *end = p + msg->front.iov_len; struct ceph_mds_session_head *h; u32 op; - u64 seq; - unsigned long features = 0; + u64 seq, features = 0; int wake = 0; bool blacklisted = false;
@@ -3136,9 +3135,8 @@ static void handle_session(struct ceph_m goto bad; /* version >= 3, feature bits */ ceph_decode_32_safe(&p, end, len, bad); - ceph_decode_need(&p, end, len, bad); - memcpy(&features, p, min_t(size_t, len, sizeof(features))); - p += len; + ceph_decode_64_safe(&p, end, features, bad); + p += len - sizeof(features); }
mutex_lock(&mdsc->mutex);
From: Luis Henriques lhenriques@suse.com
commit 12ae44a40a1be891bdc6463f8c7072b4ede746ef upstream.
A misconfigured cephx can easily result in having the kernel client flooding the logs with:
ceph: Can't lookup inode 1 (err: -13)
Change this message to debug level.
Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques lhenriques@suse.com Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ceph/quota.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -159,8 +159,8 @@ static struct inode *lookup_quotarealm_i }
if (IS_ERR(in)) { - pr_warn("Can't lookup inode %llx (err: %ld)\n", - realm->ino, PTR_ERR(in)); + dout("Can't lookup inode %llx (err: %ld)\n", + realm->ino, PTR_ERR(in)); qri->timeout = jiffies + msecs_to_jiffies(60 * 1000); /* XXX */ } else { qri->timeout = 0;
From: Oscar Carter oscar.carter@gmx.com
commit 769acc3656d93aaacada814939743361d284fd87 upstream.
Check the return value of gasket_get_bar_index function as it can return a negative one (-EINVAL). If this happens, a negative index is used in the "gasket_dev->bar_data" array.
Addresses-Coverity-ID: 1438542 ("Negative array index read") Fixes: 9a69f5087ccc2 ("drivers/staging: Gasket driver framework + Apex driver") Signed-off-by: Oscar Carter oscar.carter@gmx.com Cc: stable stable@vger.kernel.org Reviewed-by: Richard Yeh rcy@google.com Link: https://lore.kernel.org/r/20200501155118.13380-1-oscar.carter@gmx.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/staging/gasket/gasket_core.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -926,6 +926,10 @@ do_map_region(const struct gasket_dev *g gasket_get_bar_index(gasket_dev, (vma->vm_pgoff << PAGE_SHIFT) + driver_desc->legacy_mmap_address_offset); + + if (bar_index < 0) + return DO_MAP_REGION_INVALID; + phys_base = gasket_dev->bar_data[bar_index].phys_base + phys_offset; while (mapped_bytes < map_length) { /*
From: Luis Chamberlain mcgrof@kernel.org
commit 3740d93e37902b31159a82da2d5c8812ed825404 upstream.
Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did.
This has been reported by Sergey when one triggers a a coredump with the following configuration:
``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ```
The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file.
[ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ```
Cc: stable@vger.kernel.org # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon chainsaw@gentoo.org Reported-by: Sergey Kvachonok ravenexp@gmail.com Tested-by: Sergei Trofimovich slyfox@gentoo.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/coredump.c | 8 ++++++++ kernel/umh.c | 5 +++++ 2 files changed, 13 insertions(+)
--- a/fs/coredump.c +++ b/fs/coredump.c @@ -788,6 +788,14 @@ void do_coredump(const kernel_siginfo_t if (displaced) put_files_struct(displaced); if (!dump_interrupted()) { + /* + * umh disabled with CONFIG_STATIC_USERMODEHELPER_PATH="" would + * have this set to NULL. + */ + if (!cprm.file) { + pr_info("Core dump to |%s disabled\n", cn.corename); + goto close_fail; + } file_start_write(cprm.file); core_dumped = binfmt->core_dump(&cprm); file_end_write(cprm.file); --- a/kernel/umh.c +++ b/kernel/umh.c @@ -544,6 +544,11 @@ EXPORT_SYMBOL_GPL(fork_usermode_blob); * Runs a user-space application. The application is started * asynchronously if wait is not set, and runs as a child of system workqueues. * (ie. it runs with full root capabilities and optimized affinity). + * + * Note: successful return value does not guarantee the helper was called at + * all. You can't rely on sub_info->{init,cleanup} being called even for + * UMH_WAIT_* wait modes as STATIC_USERMODEHELPER_PATH="" turns all helpers + * into a successful no-op. */ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) {
From: Vincent Chen vincent.chen@sifive.com
commit c749bb2d554825e007cbc43b791f54e124dadfce upstream.
The current max_pfn equals to zero. In this case, I found it caused users cannot get some page information through /proc such as kpagecount in v5.6 kernel because of new sanity checks. The following message is displayed by stress-ng test suite with the command "stress-ng --verbose --physpage 1 -t 1" on HiFive unleashed board.
# stress-ng --verbose --physpage 1 -t 1 stress-ng: debug: [109] 4 processors online, 4 processors configured stress-ng: info: [109] dispatching hogs: 1 physpage stress-ng: debug: [109] cache allocate: reducing cache level from L3 (too high) to L0 stress-ng: debug: [109] get_cpu_cache: invalid cache_level: 0 stress-ng: info: [109] cache allocate: using built-in defaults as no suitable cache found stress-ng: debug: [109] cache allocate: default cache size: 2048K stress-ng: debug: [109] starting stressors stress-ng: debug: [109] 1 stressor spawned stress-ng: debug: [110] stress-ng-physpage: started [110] (instance 0) stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd34de000 in /proc/kpagecount, errno=0 (Success) stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd32db078 in /proc/kpagecount, errno=0 (Success) ... stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd32db078 in /proc/kpagecount, errno=0 (Success) stress-ng: debug: [110] stress-ng-physpage: exited [110] (instance 0) stress-ng: debug: [109] process [110] terminated stress-ng: info: [109] successful run completed in 1.00s #
After applying this patch, the kernel can pass the test.
# stress-ng --verbose --physpage 1 -t 1 stress-ng: debug: [104] 4 processors online, 4 processors configured stress-ng: info: [104] dispatching hogs: 1 physpage stress-ng: info: [104] cache allocate: using defaults, can't determine cache details from sysfs stress-ng: debug: [104] cache allocate: default cache size: 2048K stress-ng: debug: [104] starting stressors stress-ng: debug: [104] 1 stressor spawned stress-ng: debug: [105] stress-ng-physpage: started [105] (instance 0) stress-ng: debug: [105] stress-ng-physpage: exited [105] (instance 0) stress-ng: debug: [104] process [105] terminated stress-ng: info: [104] successful run completed in 1.01s #
Cc: stable@vger.kernel.org Signed-off-by: Vincent Chen vincent.chen@sifive.com Reviewed-by: Anup Patel anup@brainfault.org Reviewed-by: Yash Shah yash.shah@sifive.com Tested-by: Yash Shah yash.shah@sifive.com Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/riscv/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -149,7 +149,8 @@ void __init setup_bootmem(void) memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
set_max_mapnr(PFN_DOWN(mem_size)); - max_low_pfn = PFN_DOWN(memblock_end_of_DRAM()); + max_pfn = PFN_DOWN(memblock_end_of_DRAM()); + max_low_pfn = max_pfn;
#ifdef CONFIG_BLK_DEV_INITRD setup_initrd();
From: Tejun Heo tj@kernel.org
commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation.
However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen.
1. The iocg is in the process of being deactivated by the periodic timer.
2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active.
3. The iocg is deactivated.
4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay.
5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs.
This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups.
The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path.
This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: Tejun Heo tj@kernel.org Reported-by: Vlad Dmitriev vvd@fb.com Cc: stable@vger.kernel.org # v5.4+ Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- block/blk-iocost.c | 117 ++++++++++++++++++++++++----------------- tools/cgroup/iocost_monitor.py | 7 ++ 2 files changed, 77 insertions(+), 47 deletions(-)
--- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -469,7 +469,7 @@ struct ioc_gq { */ atomic64_t vtime; atomic64_t done_vtime; - atomic64_t abs_vdebt; + u64 abs_vdebt; u64 last_vtime;
/* @@ -1145,7 +1145,7 @@ static void iocg_kick_waitq(struct ioc_g struct iocg_wake_ctx ctx = { .iocg = iocg }; u64 margin_ns = (u64)(ioc->period_us * WAITQ_TIMER_MARGIN_PCT / 100) * NSEC_PER_USEC; - u64 abs_vdebt, vdebt, vshortage, expires, oexpires; + u64 vdebt, vshortage, expires, oexpires; s64 vbudget; u32 hw_inuse;
@@ -1155,18 +1155,15 @@ static void iocg_kick_waitq(struct ioc_g vbudget = now->vnow - atomic64_read(&iocg->vtime);
/* pay off debt */ - abs_vdebt = atomic64_read(&iocg->abs_vdebt); - vdebt = abs_cost_to_cost(abs_vdebt, hw_inuse); + vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse); if (vdebt && vbudget > 0) { u64 delta = min_t(u64, vbudget, vdebt); u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse), - abs_vdebt); + iocg->abs_vdebt);
atomic64_add(delta, &iocg->vtime); atomic64_add(delta, &iocg->done_vtime); - atomic64_sub(abs_delta, &iocg->abs_vdebt); - if (WARN_ON_ONCE(atomic64_read(&iocg->abs_vdebt) < 0)) - atomic64_set(&iocg->abs_vdebt, 0); + iocg->abs_vdebt -= abs_delta; }
/* @@ -1222,12 +1219,18 @@ static bool iocg_kick_delay(struct ioc_g u64 expires, oexpires; u32 hw_inuse;
+ lockdep_assert_held(&iocg->waitq.lock); + /* debt-adjust vtime */ current_hweight(iocg, NULL, &hw_inuse); - vtime += abs_cost_to_cost(atomic64_read(&iocg->abs_vdebt), hw_inuse); + vtime += abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
- /* clear or maintain depending on the overage */ - if (time_before_eq64(vtime, now->vnow)) { + /* + * Clear or maintain depending on the overage. Non-zero vdebt is what + * guarantees that @iocg is online and future iocg_kick_delay() will + * clear use_delay. Don't leave it on when there's no vdebt. + */ + if (!iocg->abs_vdebt || time_before_eq64(vtime, now->vnow)) { blkcg_clear_delay(blkg); return false; } @@ -1261,9 +1264,12 @@ static enum hrtimer_restart iocg_delay_t { struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer); struct ioc_now now; + unsigned long flags;
+ spin_lock_irqsave(&iocg->waitq.lock, flags); ioc_now(iocg->ioc, &now); iocg_kick_delay(iocg, &now, 0); + spin_unlock_irqrestore(&iocg->waitq.lock, flags);
return HRTIMER_NORESTART; } @@ -1371,14 +1377,13 @@ static void ioc_timer_fn(struct timer_li * should have woken up in the last period and expire idle iocgs. */ list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) { - if (!waitqueue_active(&iocg->waitq) && - !atomic64_read(&iocg->abs_vdebt) && !iocg_is_idle(iocg)) + if (!waitqueue_active(&iocg->waitq) && iocg->abs_vdebt && + !iocg_is_idle(iocg)) continue;
spin_lock(&iocg->waitq.lock);
- if (waitqueue_active(&iocg->waitq) || - atomic64_read(&iocg->abs_vdebt)) { + if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) { /* might be oversleeping vtime / hweight changes, kick */ iocg_kick_waitq(iocg, &now); iocg_kick_delay(iocg, &now, 0); @@ -1721,28 +1726,49 @@ static void ioc_rqos_throttle(struct rq_ * tests are racy but the races aren't systemic - we only miss once * in a while which is fine. */ - if (!waitqueue_active(&iocg->waitq) && - !atomic64_read(&iocg->abs_vdebt) && + if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt && time_before_eq64(vtime + cost, now.vnow)) { iocg_commit_bio(iocg, bio, cost); return; }
/* - * We're over budget. If @bio has to be issued regardless, - * remember the abs_cost instead of advancing vtime. - * iocg_kick_waitq() will pay off the debt before waking more IOs. + * We activated above but w/o any synchronization. Deactivation is + * synchronized with waitq.lock and we won't get deactivated as long + * as we're waiting or has debt, so we're good if we're activated + * here. In the unlikely case that we aren't, just issue the IO. + */ + spin_lock_irq(&iocg->waitq.lock); + + if (unlikely(list_empty(&iocg->active_list))) { + spin_unlock_irq(&iocg->waitq.lock); + iocg_commit_bio(iocg, bio, cost); + return; + } + + /* + * We're over budget. If @bio has to be issued regardless, remember + * the abs_cost instead of advancing vtime. iocg_kick_waitq() will pay + * off the debt before waking more IOs. + * * This way, the debt is continuously paid off each period with the - * actual budget available to the cgroup. If we just wound vtime, - * we would incorrectly use the current hw_inuse for the entire - * amount which, for example, can lead to the cgroup staying - * blocked for a long time even with substantially raised hw_inuse. + * actual budget available to the cgroup. If we just wound vtime, we + * would incorrectly use the current hw_inuse for the entire amount + * which, for example, can lead to the cgroup staying blocked for a + * long time even with substantially raised hw_inuse. + * + * An iocg with vdebt should stay online so that the timer can keep + * deducting its vdebt and [de]activate use_delay mechanism + * accordingly. We don't want to race against the timer trying to + * clear them and leave @iocg inactive w/ dangling use_delay heavily + * penalizing the cgroup and its descendants. */ if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) { - atomic64_add(abs_cost, &iocg->abs_vdebt); + iocg->abs_vdebt += abs_cost; if (iocg_kick_delay(iocg, &now, cost)) blkcg_schedule_throttle(rqos->q, (bio->bi_opf & REQ_SWAP) == REQ_SWAP); + spin_unlock_irq(&iocg->waitq.lock); return; }
@@ -1759,20 +1785,6 @@ static void ioc_rqos_throttle(struct rq_ * All waiters are on iocg->waitq and the wait states are * synchronized using waitq.lock. */ - spin_lock_irq(&iocg->waitq.lock); - - /* - * We activated above but w/o any synchronization. Deactivation is - * synchronized with waitq.lock and we won't get deactivated as - * long as we're waiting, so we're good if we're activated here. - * In the unlikely case that we are deactivated, just issue the IO. - */ - if (unlikely(list_empty(&iocg->active_list))) { - spin_unlock_irq(&iocg->waitq.lock); - iocg_commit_bio(iocg, bio, cost); - return; - } - init_waitqueue_func_entry(&wait.wait, iocg_wake_fn); wait.wait.private = current; wait.bio = bio; @@ -1804,6 +1816,7 @@ static void ioc_rqos_merge(struct rq_qos struct ioc_now now; u32 hw_inuse; u64 abs_cost, cost; + unsigned long flags;
/* bypass if disabled or for root cgroup */ if (!ioc->enabled || !iocg->level) @@ -1823,15 +1836,28 @@ static void ioc_rqos_merge(struct rq_qos iocg->cursor = bio_end;
/* - * Charge if there's enough vtime budget and the existing request - * has cost assigned. Otherwise, account it as debt. See debt - * handling in ioc_rqos_throttle() for details. + * Charge if there's enough vtime budget and the existing request has + * cost assigned. */ if (rq->bio && rq->bio->bi_iocost_cost && - time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) + time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) { iocg_commit_bio(iocg, bio, cost); - else - atomic64_add(abs_cost, &iocg->abs_vdebt); + return; + } + + /* + * Otherwise, account it as debt if @iocg is online, which it should + * be for the vast majority of cases. See debt handling in + * ioc_rqos_throttle() for details. + */ + spin_lock_irqsave(&iocg->waitq.lock, flags); + if (likely(!list_empty(&iocg->active_list))) { + iocg->abs_vdebt += abs_cost; + iocg_kick_delay(iocg, &now, cost); + } else { + iocg_commit_bio(iocg, bio, cost); + } + spin_unlock_irqrestore(&iocg->waitq.lock, flags); }
static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio) @@ -2001,7 +2027,6 @@ static void ioc_pd_init(struct blkg_poli iocg->ioc = ioc; atomic64_set(&iocg->vtime, now.vnow); atomic64_set(&iocg->done_vtime, now.vnow); - atomic64_set(&iocg->abs_vdebt, 0); atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period)); INIT_LIST_HEAD(&iocg->active_list); iocg->hweight_active = HWEIGHT_WHOLE; --- a/tools/cgroup/iocost_monitor.py +++ b/tools/cgroup/iocost_monitor.py @@ -159,7 +159,12 @@ class IocgStat: else: self.inflight_pct = 0
- self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 + # vdebt used to be an atomic64_t and is now u64, support both + try: + self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 + except: + self.debt_ms = iocg.abs_vdebt.value_() / VTIME_PER_USEC / 1000 + self.use_delay = blkg.use_delay.counter.value_() self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000
From: George Spelvin lkml@sdf.org
commit fd0c42c4dea54335967c5a86f15fc064235a2797 upstream.
and change to pseudorandom numbers, as this is a traffic dithering operation that doesn't need crypto-grade.
The previous code operated in 4 steps:
1. Generate a random byte 0 <= rand_tq <= 255 2. Multiply it by BATADV_TQ_MAX_VALUE - tq 3. Divide by 255 (= BATADV_TQ_MAX_VALUE) 4. Return BATADV_TQ_MAX_VALUE - rand_tq
This would apperar to scale (BATADV_TQ_MAX_VALUE - tq) by a random value between 0/255 and 255/255.
But! The intermediate value between steps 3 and 4 is stored in a u8 variable. So it's truncated, and most of the time, is less than 255, after which the division produces 0. Specifically, if tq is odd, the product is always even, and can never be 255. If tq is even, there's exactly one random byte value that will produce a product byte of 255.
Thus, the return value is 255 (511/512 of the time) or 254 (1/512 of the time).
If we assume that the truncation is a bug, and the code is meant to scale the input, a simpler way of looking at it is that it's returning a random value between tq and BATADV_TQ_MAX_VALUE, inclusive.
Well, we have an optimized function for doing just that.
Fixes: 3c12de9a5c75 ("batman-adv: network coding - code and transmit packets if possible") Signed-off-by: George Spelvin lkml@sdf.org Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/network-coding.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
--- a/net/batman-adv/network-coding.c +++ b/net/batman-adv/network-coding.c @@ -1009,15 +1009,8 @@ static struct batadv_nc_path *batadv_nc_ */ static u8 batadv_nc_random_weight_tq(u8 tq) { - u8 rand_val, rand_tq; - - get_random_bytes(&rand_val, sizeof(rand_val)); - /* randomize the estimated packet loss (max TQ - estimated TQ) */ - rand_tq = rand_val * (BATADV_TQ_MAX_VALUE - tq); - - /* normalize the randomized packet loss */ - rand_tq /= BATADV_TQ_MAX_VALUE; + u8 rand_tq = prandom_u32_max(BATADV_TQ_MAX_VALUE + 1 - tq);
/* convert to (randomized) estimated tq again */ return BATADV_TQ_MAX_VALUE - rand_tq;
From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit f872de8185acf1b48b954ba5bd8f9bc0a0d14016 upstream.
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(), which gets a batadv_hard_iface object from net_dev with increased refcnt and its reference is assigned to a local pointer 'hard_iface'.
When batadv_show_throughput_override() returns, "hard_iface" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The issue happens in the normal path of batadv_show_throughput_override(), which forgets to decrease the refcnt increased by batadv_hardif_get_by_netdev() before the function returns, causing a refcnt leak.
Fix this issue by calling batadv_hardif_put() before the batadv_show_throughput_override() returns in the normal path.
Fixes: 0b5ecc6811bd ("batman-adv: add throughput override attribute to hard_ifaces") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/sysfs.c | 1 + 1 file changed, 1 insertion(+)
--- a/net/batman-adv/sysfs.c +++ b/net/batman-adv/sysfs.c @@ -1190,6 +1190,7 @@ static ssize_t batadv_show_throughput_ov
tp_override = atomic_read(&hard_iface->bat_v.throughput_override);
+ batadv_hardif_put(hard_iface); return sprintf(buff, "%u.%u MBit\n", tp_override / 10, tp_override % 10); }
From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit 6107c5da0fca8b50b4d3215e94d619d38cc4a18c upstream.
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(), which gets a batadv_hard_iface object from net_dev with increased refcnt and its reference is assigned to a local pointer 'hard_iface'.
When batadv_store_throughput_override() returns, "hard_iface" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The issue happens in one error path of batadv_store_throughput_override(). When batadv_parse_throughput() returns NULL, the refcnt increased by batadv_hardif_get_by_netdev() is not decreased, causing a refcnt leak.
Fix this issue by jumping to "out" label when batadv_parse_throughput() returns NULL.
Fixes: 0b5ecc6811bd ("batman-adv: add throughput override attribute to hard_ifaces") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/batman-adv/sysfs.c +++ b/net/batman-adv/sysfs.c @@ -1150,7 +1150,7 @@ static ssize_t batadv_store_throughput_o ret = batadv_parse_throughput(net_dev, buff, "throughput_override", &tp_override); if (!ret) - return count; + goto out;
old_tp_override = atomic_read(&hard_iface->bat_v.throughput_override); if (old_tp_override == tp_override)
From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit 6f91a3f7af4186099dd10fa530dd7e0d9c29747d upstream.
batadv_v_ogm_process() invokes batadv_hardif_neigh_get(), which returns a reference of the neighbor object to "hardif_neigh" with increased refcount.
When batadv_v_ogm_process() returns, "hardif_neigh" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling paths of batadv_v_ogm_process(). When batadv_v_ogm_orig_get() fails to get the orig node and returns NULL, the refcnt increased by batadv_hardif_neigh_get() is not decreased, causing a refcnt leak.
Fix this issue by jumping to "out" label when batadv_v_ogm_orig_get() fails to get the orig node.
Fixes: 9323158ef9f4 ("batman-adv: OGMv2 - implement originators logic") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/bat_v_ogm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/batman-adv/bat_v_ogm.c +++ b/net/batman-adv/bat_v_ogm.c @@ -893,7 +893,7 @@ static void batadv_v_ogm_process(const s
orig_node = batadv_v_ogm_orig_get(bat_priv, ogm_packet->orig); if (!orig_node) - return; + goto out;
neigh_node = batadv_neigh_node_get_or_create(orig_node, if_incoming, ethhdr->h_source);
From: Rick Edgecombe rick.p.edgecombe@intel.com
commit ab5130186d7476dcee0d4e787d19a521ca552ce9 upstream.
As an optimization, cpa_flush() was changed to optionally only flush the range in @cpa if it was small enough. However, this range does not include any direct map aliases changed in cpa_process_alias(). So small set_memory_() calls that touch that alias don't get the direct map changes flushed. This situation can happen when the virtual address taking variants are passed an address in vmalloc or modules space.
In these cases, force a full TLB flush.
Note this issue does not extend to cases where the set_memory_() calls are passed a direct map address, or page array, etc, as the primary target. In those cases the direct map would be flushed.
Fixes: 935f5839827e ("x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation") Signed-off-by: Rick Edgecombe rick.p.edgecombe@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200424105343.GA20730@hirez.programming.kicks-ass... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/mm/pat/set_memory.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
--- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -42,7 +42,8 @@ struct cpa_data { unsigned long pfn; unsigned int flags; unsigned int force_split : 1, - force_static_prot : 1; + force_static_prot : 1, + force_flush_all : 1; struct page **pages; };
@@ -352,10 +353,10 @@ static void cpa_flush(struct cpa_data *d return; }
- if (cpa->numpages <= tlb_single_page_flush_ceiling) - on_each_cpu(__cpa_flush_tlb, cpa, 1); - else + if (cpa->force_flush_all || cpa->numpages > tlb_single_page_flush_ceiling) flush_tlb_all(); + else + on_each_cpu(__cpa_flush_tlb, cpa, 1);
if (!cache) return; @@ -1595,6 +1596,8 @@ static int cpa_process_alias(struct cpa_ alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY); alias_cpa.curpage = 0;
+ cpa->force_flush_all = 1; + ret = __change_page_attr_set_clr(&alias_cpa, 0); if (ret) return ret; @@ -1615,6 +1618,7 @@ static int cpa_process_alias(struct cpa_ alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY); alias_cpa.curpage = 0;
+ cpa->force_flush_all = 1; /* * The high mapping range is imprecise, so ignore the * return value.
From: Josh Poimboeuf jpoimboe@redhat.com
commit 06a9750edcffa808494d56da939085c35904e618 upstream.
The PUSH_AND_CLEAR_REGS macro zeroes each register immediately after pushing it. If an NMI or exception hits after a register is cleared, but before the UNWIND_HINT_REGS annotation, the ORC unwinder will wrongly think the previous value of the register was zero. This can confuse the unwinding process and cause it to exit early.
Because ORC is simpler than DWARF, there are a limited number of unwind annotation states, so it's not possible to add an individual unwind hint after each push/clear combination. Instead, the register clearing instructions need to be consolidated and moved to after the UNWIND_HINT_REGS annotation.
Fixes: 3f01daecd545 ("x86/entry/64: Introduce the PUSH_AND_CLEAN_REGS macro") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/68fd3d0bc92ae2d62ff7879d15d3684217d51f08.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/calling.h | 40 +++++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 19 deletions(-)
--- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -98,13 +98,6 @@ For 32-bit we have the following convent #define SIZEOF_PTREGS 21*8
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0 - /* - * Push registers and sanitize registers of values that a - * speculation attack might otherwise want to exploit. The - * lower registers are likely clobbered well before they - * could be put to use in a speculative execution gadget. - * Interleave XOR with PUSH for better uop scheduling: - */ .if \save_ret pushq %rsi /* pt_regs->si */ movq 8(%rsp), %rsi /* temporarily store the return address in %rsi */ @@ -114,34 +107,43 @@ For 32-bit we have the following convent pushq %rsi /* pt_regs->si */ .endif pushq \rdx /* pt_regs->dx */ - xorl %edx, %edx /* nospec dx */ pushq %rcx /* pt_regs->cx */ - xorl %ecx, %ecx /* nospec cx */ pushq \rax /* pt_regs->ax */ pushq %r8 /* pt_regs->r8 */ - xorl %r8d, %r8d /* nospec r8 */ pushq %r9 /* pt_regs->r9 */ - xorl %r9d, %r9d /* nospec r9 */ pushq %r10 /* pt_regs->r10 */ - xorl %r10d, %r10d /* nospec r10 */ pushq %r11 /* pt_regs->r11 */ - xorl %r11d, %r11d /* nospec r11*/ pushq %rbx /* pt_regs->rbx */ - xorl %ebx, %ebx /* nospec rbx*/ pushq %rbp /* pt_regs->rbp */ - xorl %ebp, %ebp /* nospec rbp*/ pushq %r12 /* pt_regs->r12 */ - xorl %r12d, %r12d /* nospec r12*/ pushq %r13 /* pt_regs->r13 */ - xorl %r13d, %r13d /* nospec r13*/ pushq %r14 /* pt_regs->r14 */ - xorl %r14d, %r14d /* nospec r14*/ pushq %r15 /* pt_regs->r15 */ - xorl %r15d, %r15d /* nospec r15*/ UNWIND_HINT_REGS + .if \save_ret pushq %rsi /* return address on top of stack */ .endif + + /* + * Sanitize registers of values that a speculation attack might + * otherwise want to exploit. The lower registers are likely clobbered + * well before they could be put to use in a speculative execution + * gadget. + */ + xorl %edx, %edx /* nospec dx */ + xorl %ecx, %ecx /* nospec cx */ + xorl %r8d, %r8d /* nospec r8 */ + xorl %r9d, %r9d /* nospec r9 */ + xorl %r10d, %r10d /* nospec r10 */ + xorl %r11d, %r11d /* nospec r11 */ + xorl %ebx, %ebx /* nospec rbx */ + xorl %ebp, %ebp /* nospec rbp */ + xorl %r12d, %r12d /* nospec r12 */ + xorl %r13d, %r13d /* nospec r13 */ + xorl %r14d, %r14d /* nospec r14 */ + xorl %r15d, %r15d /* nospec r15 */ + .endm
.macro POP_REGS pop_rdi=1 skip_r11rcx=0
From: Josh Poimboeuf jpoimboe@redhat.com
commit 1fb143634a38095b641a3a21220774799772dc4c upstream.
In swapgs_restore_regs_and_return_to_usermode, after the stack is switched to the trampoline stack, the existing UNWIND_HINT_REGS hint is no longer valid, which can result in the following ORC unwinder warning:
WARNING: can't dereference registers at 000000003aeb0cdd for ip swapgs_restore_regs_and_return_to_usermode+0x93/0xa0
For full correctness, we could try to add complicated unwind hints so the unwinder could continue to find the registers, but when when it's this close to kernel exit, unwind hints aren't really needed anymore and it's fine to just use an empty hint which tells the unwinder to stop.
For consistency, also move the UNWIND_HINT_EMPTY in entry_SYSCALL_64_after_hwframe to a similar location.
Fixes: 3e3b9293d392 ("x86/entry/64: Return to userspace from the trampoline stack") Reported-by: Vince Weaver vincent.weaver@maine.edu Reported-by: Dave Jones dsj@fb.com Reported-by: Dr. David Alan Gilbert dgilbert@redhat.com Reported-by: Joe Mario jmario@redhat.com Reported-by: Jann Horn jannh@google.com Reported-by: Linus Torvalds torvalds@linux-foundation.org Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/60ea8f562987ed2d9ace2977502fe481c0d7c9a0.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -249,7 +249,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_h */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ - UNWIND_HINT_EMPTY POP_REGS pop_rdi=0 skip_r11rcx=1
/* @@ -258,6 +257,7 @@ syscall_return_via_sysret: */ movq %rsp, %rdi movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp + UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */ pushq (%rdi) /* RDI */ @@ -637,6 +637,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_ */ movq %rsp, %rdi movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp + UNWIND_HINT_EMPTY
/* Copy the IRET frame to the trampoline stack. */ pushq 6*8(%rdi) /* SS */
From: Josh Poimboeuf jpoimboe@redhat.com
commit 96c64806b4bf35f5edb465cafa6cec490e424a30 upstream.
UNWIND_HINT_FUNC has some limitations: specifically, it doesn't reset all the registers to undefined. This causes objtool to get confused about the RBP push in __switch_to_asm(), resulting in bad ORC data.
While __switch_to_asm() does do some stack magic, it's otherwise a normal callable-from-C function, so just annotate it as a function, which makes objtool happy and allows it to produces the correct hints automatically.
Fixes: 8c1f75587a18 ("x86/entry/64: Add unwind hint annotations") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/03d0411920d10f7418f2e909210d8e9a3b2ab081.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -279,8 +279,7 @@ SYM_CODE_END(entry_SYSCALL_64) * %rdi: prev task * %rsi: next task */ -SYM_CODE_START(__switch_to_asm) - UNWIND_HINT_FUNC +SYM_FUNC_START(__switch_to_asm) /* * Save callee-saved registers * This must match the order in inactive_task_frame @@ -321,7 +320,7 @@ SYM_CODE_START(__switch_to_asm) popq %rbp
jmp __switch_to -SYM_CODE_END(__switch_to_asm) +SYM_FUNC_END(__switch_to_asm)
/* * A newly forked process directly context switches into this address.
From: Jann Horn jannh@google.com
commit f977df7b7ca45a4ac4b66d30a8931d0434c394b1 upstream.
The LEAQ instruction in rewind_stack_do_exit() moves the stack pointer directly below the pt_regs at the top of the task stack before calling do_exit(). Tell the unwinder to expect pt_regs.
Fixes: 8c1f75587a18 ("x86/entry/64: Add unwind hint annotations") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/68c33e17ae5963854916a46f522624f8e1d264f2.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1739,7 +1739,7 @@ SYM_CODE_START(rewind_stack_do_exit)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rax leaq -PTREGS_SIZE(%rax), %rsp - UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE + UNWIND_HINT_REGS
call do_exit SYM_CODE_END(rewind_stack_do_exit)
From: Miroslav Benes mbenes@suse.cz
commit f1d9a2abff66aa8156fbc1493abed468db63ea48 upstream.
When unwinding an inactive task, the ORC unwinder skips the first frame by default. If both the 'regs' and 'first_frame' parameters of unwind_start() are NULL, 'state->sp' and 'first_frame' are later initialized to the same value for an inactive task. Given there is a "less than or equal to" comparison used at the end of __unwind_start() for skipping stack frames, the first frame is skipped.
Drop the equal part of the comparison and make the behavior equivalent to the frame pointer unwinder.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/7f08db872ab59e807016910acdbe82f744de7065.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -651,7 +651,7 @@ void __unwind_start(struct unwind_state /* Otherwise, skip ahead to the user-specified starting frame: */ while (!unwind_done(state) && (!on_stack(&state->stack_info, first_frame, sizeof(long)) || - state->sp <= (unsigned long)first_frame)) + state->sp < (unsigned long)first_frame)) unwind_next_frame(state);
return;
From: Josh Poimboeuf jpoimboe@redhat.com
commit 98d0c8ebf77e0ba7c54a9ae05ea588f0e9e3f46e upstream.
If the unwinder is called before the ORC data has been initialized, orc_find() returns NULL, and it tries to fall back to using frame pointers. This can cause some unexpected warnings during boot.
Move the 'orc_init' check from orc_find() to __unwind_init(), so that it doesn't even try to unwind from an uninitialized state.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/069d1499ad606d85532eb32ce39b2441679667d5.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -142,9 +142,6 @@ static struct orc_entry *orc_find(unsign { static struct orc_entry *orc;
- if (!orc_init) - return NULL; - if (ip == 0) return &null_orc_entry;
@@ -585,6 +582,9 @@ EXPORT_SYMBOL_GPL(unwind_next_frame); void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long *first_frame) { + if (!orc_init) + goto done; + memset(state, 0, sizeof(*state)); state->task = task;
From: Josh Poimboeuf jpoimboe@redhat.com
commit a0f81bf26888048100bf017fadf438a5bdffa8d8 upstream.
If the ORC entry type is unknown, nothing else can be done other than reporting an error. Exit the function instead of breaking out of the switch statement.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/a7fa668ca6eabbe81ab18b2424f15adbbfdc810a.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -531,7 +531,7 @@ bool unwind_next_frame(struct unwind_sta default: orc_warn("unknown .orc_unwind entry type %d for ip %pB\n", orc->type, (void *)orig_ip); - break; + goto err; }
/* Find BP: */
From: Josh Poimboeuf jpoimboe@redhat.com
commit 81b67439d147677d844d492fcbd03712ea438f42 upstream.
The following execution path is possible:
fsnotify() [ realign the stack and store previous SP in R10 ] <IRQ> [ only IRET regs saved ] common_interrupt() interrupt_entry() <NMI> [ full pt_regs saved ] ... [ unwind stack ]
When the unwinder goes through the NMI and the IRQ on the stack, and then sees fsnotify(), it doesn't have access to the value of R10, because it only has the five IRET registers. So the unwind stops prematurely.
However, because the interrupt_entry() code is careful not to clobber R10 before saving the full regs, the unwinder should be able to read R10 from the previously saved full pt_regs associated with the NMI.
Handle this case properly. When encountering an IRET regs frame immediately after a full pt_regs frame, use the pt_regs as a backup which can be used to get the C register values.
Also, note that a call frame resets the 'prev_regs' value, because a function is free to clobber the registers. For this fix to work, the IRET and full regs frames must be adjacent, with no FUNC frames in between. So replace the FUNC hint in interrupt_entry() with an IRET_REGS hint.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/97a408167cc09f1cfa0de31a7b70dd88868d743f.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 4 +-- arch/x86/include/asm/unwind.h | 2 - arch/x86/kernel/unwind_orc.c | 51 ++++++++++++++++++++++++++++++++---------- 3 files changed, 43 insertions(+), 14 deletions(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -511,7 +511,7 @@ SYM_CODE_END(spurious_entries_start) * +----------------------------------------------------+ */ SYM_CODE_START(interrupt_entry) - UNWIND_HINT_FUNC + UNWIND_HINT_IRET_REGS offset=16 ASM_CLAC cld
@@ -543,9 +543,9 @@ SYM_CODE_START(interrupt_entry) pushq 5*8(%rdi) /* regs->eflags */ pushq 4*8(%rdi) /* regs->cs */ pushq 3*8(%rdi) /* regs->ip */ + UNWIND_HINT_IRET_REGS pushq 2*8(%rdi) /* regs->orig_ax */ pushq 8(%rdi) /* return address */ - UNWIND_HINT_FUNC
movq (%rdi), %rdi jmp 2f --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -19,7 +19,7 @@ struct unwind_state { #if defined(CONFIG_UNWINDER_ORC) bool signal, full_regs; unsigned long sp, bp, ip; - struct pt_regs *regs; + struct pt_regs *regs, *prev_regs; #elif defined(CONFIG_UNWINDER_FRAME_POINTER) bool got_irq; unsigned long *bp, *orig_sp, ip; --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -378,9 +378,38 @@ static bool deref_stack_iret_regs(struct return true; }
+/* + * If state->regs is non-NULL, and points to a full pt_regs, just get the reg + * value from state->regs. + * + * Otherwise, if state->regs just points to IRET regs, and the previous frame + * had full regs, it's safe to get the value from the previous regs. This can + * happen when early/late IRQ entry code gets interrupted by an NMI. + */ +static bool get_reg(struct unwind_state *state, unsigned int reg_off, + unsigned long *val) +{ + unsigned int reg = reg_off/8; + + if (!state->regs) + return false; + + if (state->full_regs) { + *val = ((unsigned long *)state->regs)[reg]; + return true; + } + + if (state->prev_regs) { + *val = ((unsigned long *)state->prev_regs)[reg]; + return true; + } + + return false; +} + bool unwind_next_frame(struct unwind_state *state) { - unsigned long ip_p, sp, orig_ip = state->ip, prev_sp = state->sp; + unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; enum stack_type prev_type = state->stack_info.type; struct orc_entry *orc; bool indirect = false; @@ -442,39 +471,35 @@ bool unwind_next_frame(struct unwind_sta break;
case ORC_REG_R10: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, r10), &sp)) { orc_warn("missing regs for base reg R10 at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->r10; break;
case ORC_REG_R13: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, r13), &sp)) { orc_warn("missing regs for base reg R13 at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->r13; break;
case ORC_REG_DI: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, di), &sp)) { orc_warn("missing regs for base reg DI at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->di; break;
case ORC_REG_DX: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, dx), &sp)) { orc_warn("missing regs for base reg DX at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->dx; break;
default: @@ -501,6 +526,7 @@ bool unwind_next_frame(struct unwind_sta
state->sp = sp; state->regs = NULL; + state->prev_regs = NULL; state->signal = false; break;
@@ -512,6 +538,7 @@ bool unwind_next_frame(struct unwind_sta }
state->regs = (struct pt_regs *)sp; + state->prev_regs = NULL; state->full_regs = true; state->signal = true; break; @@ -523,6 +550,8 @@ bool unwind_next_frame(struct unwind_sta goto err; }
+ if (state->full_regs) + state->prev_regs = state->regs; state->regs = (void *)sp - IRET_FRAME_OFFSET; state->full_regs = false; state->signal = true; @@ -537,8 +566,8 @@ bool unwind_next_frame(struct unwind_sta /* Find BP: */ switch (orc->bp_reg) { case ORC_REG_UNDEFINED: - if (state->regs && state->full_regs) - state->bp = state->regs->bp; + if (get_reg(state, offsetof(struct pt_regs, bp), &tmp)) + state->bp = tmp; break;
case ORC_REG_PREV_SP:
From: Suravee Suthikulpanit suravee.suthikulpanit@amd.com
commit 637543a8d61c6afe4e9be64bfb43c78701a83375 upstream.
Current logic incorrectly uses the enum ioapic_irq_destination_types to check the posted interrupt destination types. However, the value was set using APIC_DM_XXX macros, which are left-shifted by 8 bits.
Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead.
Fixes: (fdcf75621375 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes') Cc: Alexander Graf graf@amazon.com Signed-off-by: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Message-Id: 1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com Reviewed-by: Maxim Levitsky mlevitsk@redhat.com Tested-by: Maxim Levitsky mlevitsk@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/include/asm/kvm_host.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1664,8 +1664,8 @@ void kvm_set_msi_irq(struct kvm *kvm, st static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq) { /* We can only post Fixed and LowPrio IRQs */ - return (irq->delivery_mode == dest_Fixed || - irq->delivery_mode == dest_LowestPrio); + return (irq->delivery_mode == APIC_DM_FIXED || + irq->delivery_mode == APIC_DM_LOWEST); }
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
From: Janakarajan Natarajan Janakarajan.Natarajan@amd.com
commit 996ed22c7a5251d76dcdfe5026ef8230e90066d9 upstream.
When trying to lock read-only pages, sev_pin_memory() fails because FOLL_WRITE is used as the flag for get_user_pages_fast().
Commit 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") updated the get_user_pages_fast() call sites to use flags, but incorrectly updated the call in sev_pin_memory(). As the original coding of this call was correct, revert the change made by that commit.
Fixes: 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") Signed-off-by: Janakarajan Natarajan Janakarajan.Natarajan@amd.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Ira Weiny ira.weiny@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Sean Christopherson sean.j.christopherson@intel.com Cc: Vitaly Kuznetsov vkuznets@redhat.com Cc: Wanpeng Li wanpengli@tencent.com Cc: Jim Mattson jmattson@google.com Cc: Joerg Roedel joro@8bytes.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H . Peter Anvin" hpa@zytor.com Cc: Mike Marshall hubcap@omnibond.com Cc: Brijesh Singh brijesh.singh@amd.com Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.co... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1886,7 +1886,7 @@ static struct page **sev_pin_memory(stru return NULL;
/* Pin the user virtual address. */ - npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); + npinned = get_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); if (npinned != npages) { pr_err("SEV: Failure locking %lu pages.\n", npages); goto err;
From: Guillaume Nault gnault@redhat.com
commit ea64d8d6c675c0bb712689b13810301de9d8f77a upstream.
If the UDP header of a local VXLAN endpoint is NAT-ed, and the VXLAN device has disabled UDP checksums and enabled Tx checksum offloading, then the skb passed to udp_manip_pkt() has hdr->check == 0 (outer checksum disabled) and skb->ip_summed == CHECKSUM_PARTIAL (inner packet checksum offloaded).
Because of the ->ip_summed value, udp_manip_pkt() tries to update the outer checksum with the new address and port, leading to an invalid checksum sent on the wire, as the original null checksum obviously didn't take the old address and port into account.
So, we can't take ->ip_summed into account in udp_manip_pkt(), as it might not refer to the checksum we're acting on. Instead, we can base the decision to update the UDP checksum entirely on the value of hdr->check, because it's null if and only if checksum is disabled:
* A fully computed checksum can't be 0, since a 0 checksum is represented by the CSUM_MANGLED_0 value instead.
* A partial checksum can't be 0, since the pseudo-header always adds at least one non-zero value (the UDP protocol type 0x11) and adding more values to the sum can't make it wrap to 0 as the carry is then added to the wrapped number.
* A disabled checksum uses the special value 0.
The problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented.
Fixes: 5b1158e909ec ("[NETFILTER]: Add NAT support for nf_conntrack") Signed-off-by: Guillaume Nault gnault@redhat.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/netfilter/nf_nat_proto.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
--- a/net/netfilter/nf_nat_proto.c +++ b/net/netfilter/nf_nat_proto.c @@ -68,15 +68,13 @@ static bool udp_manip_pkt(struct sk_buff enum nf_nat_manip_type maniptype) { struct udphdr *hdr; - bool do_csum;
if (skb_ensure_writable(skb, hdroff + sizeof(*hdr))) return false;
hdr = (struct udphdr *)(skb->data + hdroff); - do_csum = hdr->check || skb->ip_summed == CHECKSUM_PARTIAL; + __udp_manip_pkt(skb, iphdroff, hdr, tuple, maniptype, !!hdr->check);
- __udp_manip_pkt(skb, iphdroff, hdr, tuple, maniptype, do_csum); return true; }
From: Arnd Bergmann arnd@arndb.de
commit c165d57b552aaca607fa5daf3fb524a6efe3c5a3 upstream.
gcc-10 points out that a code path exists where a pointer to a stack variable may be passed back to the caller:
net/netfilter/nfnetlink_osf.c: In function 'nf_osf_hdr_ctx_init': cc1: warning: function may return address of local variable [-Wreturn-local-addr] net/netfilter/nfnetlink_osf.c:171:16: note: declared here 171 | struct tcphdr _tcph; | ^~~~~
I am not sure whether this can happen in practice, but moving the variable declaration into the callers avoids the problem.
Fixes: 31a9c29210e2 ("netfilter: nf_osf: add struct nf_osf_hdr_ctx") Signed-off-by: Arnd Bergmann arnd@arndb.de Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/netfilter/nfnetlink_osf.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
--- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -165,12 +165,12 @@ static bool nf_osf_match_one(const struc static const struct tcphdr *nf_osf_hdr_ctx_init(struct nf_osf_hdr_ctx *ctx, const struct sk_buff *skb, const struct iphdr *ip, - unsigned char *opts) + unsigned char *opts, + struct tcphdr *_tcph) { const struct tcphdr *tcp; - struct tcphdr _tcph;
- tcp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(struct tcphdr), &_tcph); + tcp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(struct tcphdr), _tcph); if (!tcp) return NULL;
@@ -205,10 +205,11 @@ nf_osf_match(const struct sk_buff *skb, int fmatch = FMATCH_WRONG; struct nf_osf_hdr_ctx ctx; const struct tcphdr *tcp; + struct tcphdr _tcph;
memset(&ctx, 0, sizeof(ctx));
- tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts); + tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts, &_tcph); if (!tcp) return false;
@@ -265,10 +266,11 @@ bool nf_osf_find(const struct sk_buff *s const struct nf_osf_finger *kf; struct nf_osf_hdr_ctx ctx; const struct tcphdr *tcp; + struct tcphdr _tcph;
memset(&ctx, 0, sizeof(ctx));
- tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts); + tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts, &_tcph); if (!tcp) return false;
From: Paolo Bonzini pbonzini@redhat.com
commit 8be8f932e3db5fe4ed178b8892eeffeab530273a upstream.
Commit f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") introduces the following infinite loop:
BUG: stack guard page was hit at 000000008f595917 \ (stack is 00000000bdefe5a4..00000000ae2b06f5) kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI RIP: 0010:kvm_set_irq+0x51/0x160 [kvm] Call Trace: irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] ....
The re-entrancy happens because the irq state is the OR of the interrupt state and the resamplefd state. That is, we don't want to show the state as 0 until we've had a chance to set the resamplefd. But if the interrupt has _not_ gone low then ioapic_set_irq is invoked again, causing an infinite loop.
This can only happen for a level-triggered interrupt, otherwise irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID high and then low. Fortunately, in the case of level-triggered interrupts the VMEXIT already happens because TMR is set. Thus, fix the bug by restricting the lazy invocation of the ack notifier to edge-triggered interrupts, the only ones that need it.
Tested-by: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Reported-by: borisvk@bstnet.org Suggested-by: Paolo Bonzini pbonzini@redhat.com Link: https://www.spinics.net/lists/kvm/msg213512.html Fixes: f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489 Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/ioapic.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
--- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -225,12 +225,12 @@ static int ioapic_set_irq(struct kvm_ioa }
/* - * AMD SVM AVIC accelerate EOI write and do not trap, - * in-kernel IOAPIC will not be able to receive the EOI. - * In this case, we do lazy update of the pending EOI when - * trying to set IOAPIC irq. + * AMD SVM AVIC accelerate EOI write iff the interrupt is edge + * triggered, in which case the in-kernel IOAPIC will not be able + * to receive the EOI. In this case, we do a lazy update of the + * pending EOI when trying to set IOAPIC irq. */ - if (kvm_apicv_activated(ioapic->kvm)) + if (edge && kvm_apicv_activated(ioapic->kvm)) ioapic_lazy_update_eoi(ioapic, irq);
/*
From: Josh Poimboeuf jpoimboe@redhat.com
commit d8dd25a461e4eec7190cb9d66616aceacc5110ad upstream.
When the current frame address (CFA) is stored on the stack (i.e., cfa->base == CFI_SP_INDIRECT), objtool neglects to adjust the stack offset when there are subsequent pushes or pops. This results in bad ORC data at the end of the ENTER_IRQ_STACK macro, when it puts the previous stack pointer on the stack and does a subsequent push.
This fixes the following unwinder warning:
WARNING: can't dereference registers at 00000000f0a6bdba for ip interrupt_entry+0x9f/0xa0
Fixes: 627fce14809b ("objtool: Add ORC unwind table generation") Reported-by: Vince Weaver vincent.weaver@maine.edu Reported-by: Dave Jones dsj@fb.com Reported-by: Steven Rostedt rostedt@goodmis.org Reported-by: Vegard Nossum vegard.nossum@oracle.com Reported-by: Joe Mario jmario@redhat.com Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/853d5d691b29e250333332f09b8e27410b2d9924.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1403,7 +1403,7 @@ static int update_insn_state_regs(struct struct cfi_reg *cfa = &state->cfa; struct stack_op *op = &insn->stack_op;
- if (cfa->base != CFI_SP) + if (cfa->base != CFI_SP && cfa->base != CFI_SP_INDIRECT) return 0;
/* push */
From: Julia Lawall Julia.Lawall@inria.fr
commit fb3637a113349f53830f7d6ca45891b7192cd28f upstream.
Elsewhere in the file, there is a list_for_each_entry with &vdev->resv_regions as the second argument, suggesting that &vdev->resv_regions is the list head. So exchange the arguments on the list_add call to put the list head in the second argument.
Fixes: 2a5a31487445 ("iommu/virtio: Add probe request") Signed-off-by: Julia Lawall Julia.Lawall@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Reviewed-by: Jean-Philippe Brucker jean-philippe@linaro.org Link: https://lore.kernel.org/r/1588704467-13431-1-git-send-email-Julia.Lawall@inr... Signed-off-by: Joerg Roedel jroedel@suse.de
--- drivers/iommu/virtio-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -453,7 +453,7 @@ static int viommu_add_resv_mem(struct vi if (!region) return -ENOMEM;
- list_add(&vdev->resv_regions, ®ion->list); + list_add(®ion->list, &vdev->resv_regions); return 0; }
From: Ivan Delalande colona@arista.com
commit e08df079b23e2e982df15aa340bfbaf50f297504 upstream.
If the trapping instruction contains a ':', for a memory access through segment registers for example, the sed substitution will insert the '*' marker in the middle of the instruction instead of the line address:
2b: 65 48 0f c7 0f cmpxchg16b %gs:*(%rdi) <-- trapping instruction
I started to think I had forgotten some quirk of the assembly syntax before noticing that it was actually coming from the script. Fix it to add the address marker at the right place for these instructions:
28: 49 8b 06 mov (%r14),%rax 2b:* 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) <-- trapping instruction 30: 0f 94 c0 sete %al
Fixes: 18ff44b189e2 ("scripts/decodecode: make faulting insn ptr more robust") Signed-off-by: Ivan Delalande colona@arista.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Borislav Petkov bp@suse.de Link: http://lkml.kernel.org/r/20200419223653.GA31248@visor Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- scripts/decodecode | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/scripts/decodecode +++ b/scripts/decodecode @@ -126,7 +126,7 @@ faultlinenum=$(( $(wc -l $T.oo | cut -d faultline=`cat $T.dis | head -1 | cut -d":" -f2-` faultline=`echo "$faultline" | sed -e 's/[/\[/g; s/]/\]/g'`
-cat $T.oo | sed -e "${faultlinenum}s/^(.*:)(.*)/\1*\2\t\t<-- trapping instruction/" +cat $T.oo | sed -e "${faultlinenum}s/^([^:]*:)(.*)/\1*\2\t\t<-- trapping instruction/" echo cat $T.aa cleanup
From: Yafang Shao laoar.shao@gmail.com
commit 11d6761218d19ca06ae5387f4e3692c4fa9e7493 upstream.
When I run my memcg testcase which creates lots of memcgs, I found there're unexpected out of memory logs while there're still enough available free memory. The error log is
mkdir: cannot create directory 'foo.65533': Cannot allocate memory
The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs, an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right errno should be -ENOSPC "No space left on device", which is an appropriate errno for userspace's failed mkdir.
As the errno really misled me, we should make it right. After this patch, the error log will be
mkdir: cannot create directory 'foo.65533': No space left on device
[akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@kernel.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/memcontrol.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
--- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4977,19 +4977,22 @@ static struct mem_cgroup *mem_cgroup_all unsigned int size; int node; int __maybe_unused i; + long error = -ENOMEM;
size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
memcg = kzalloc(size, GFP_KERNEL); if (!memcg) - return NULL; + return ERR_PTR(error);
memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, GFP_KERNEL); - if (memcg->id.id < 0) + if (memcg->id.id < 0) { + error = memcg->id.id; goto fail; + }
memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); if (!memcg->vmstats_local) @@ -5033,7 +5036,7 @@ static struct mem_cgroup *mem_cgroup_all fail: mem_cgroup_id_remove(memcg); __mem_cgroup_free(memcg); - return NULL; + return ERR_PTR(error); }
static struct cgroup_subsys_state * __ref @@ -5044,8 +5047,8 @@ mem_cgroup_css_alloc(struct cgroup_subsy long error = -ENOMEM;
memcg = mem_cgroup_alloc(); - if (!memcg) - return ERR_PTR(error); + if (IS_ERR(memcg)) + return ERR_CAST(memcg);
memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; @@ -5095,7 +5098,7 @@ mem_cgroup_css_alloc(struct cgroup_subsy fail: mem_cgroup_id_remove(memcg); mem_cgroup_free(memcg); - return ERR_PTR(-ENOMEM); + return ERR_PTR(error); }
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
From: Christoph Hellwig hch@lst.de
[ Upstream commit eb7ae5e06bb6e6ac6bb86872d27c43ebab92f6b2 ]
bdi_dev_name is not a fast path function, move it out of line. This prepares for using it from modular callers without having to export an implementation detail like bdi_unknown_name.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/backing-dev.h | 9 +-------- mm/backing-dev.c | 10 +++++++++- 2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index f88197c1ffc2d..c9ad5c3b7b4b2 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -505,13 +505,6 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi) (1 << WB_async_congested)); }
-extern const char *bdi_unknown_name; - -static inline const char *bdi_dev_name(struct backing_dev_info *bdi) -{ - if (!bdi || !bdi->dev) - return bdi_unknown_name; - return dev_name(bdi->dev); -} +const char *bdi_dev_name(struct backing_dev_info *bdi);
#endif /* _LINUX_BACKING_DEV_H */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 62f05f605fb5b..680e5028d0fc5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -21,7 +21,7 @@ struct backing_dev_info noop_backing_dev_info = { EXPORT_SYMBOL_GPL(noop_backing_dev_info);
static struct class *bdi_class; -const char *bdi_unknown_name = "(unknown)"; +static const char *bdi_unknown_name = "(unknown)";
/* * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU @@ -1043,6 +1043,14 @@ void bdi_put(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_put);
+const char *bdi_dev_name(struct backing_dev_info *bdi) +{ + if (!bdi || !bdi->dev) + return bdi_unknown_name; + return dev_name(bdi->dev); +} +EXPORT_SYMBOL_GPL(bdi_dev_name); + static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
From: Christoph Hellwig hch@lst.de
[ Upstream commit 6bd87eec23cbc9ed222bed0f5b5b02bf300e9a8d ]
Cache a copy of the name for the life time of the backing_dev_info structure so that we can reference it even after unregistering.
Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears") Reported-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/backing-dev-defs.h | 1 + mm/backing-dev.c | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 4fc87dee005ab..2849bdbb3acbe 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -220,6 +220,7 @@ struct backing_dev_info { wait_queue_head_t wb_waitq;
struct device *dev; + char dev_name[64]; struct device *owner;
struct timer_list laptop_mode_wb_timer; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 680e5028d0fc5..3f2480e4c5af3 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -938,7 +938,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) if (bdi->dev) /* The driver needs to use separate queues per device */ return 0;
- dev = device_create_vargs(bdi_class, NULL, MKDEV(0, 0), bdi, fmt, args); + vsnprintf(bdi->dev_name, sizeof(bdi->dev_name), fmt, args); + dev = device_create(bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name); if (IS_ERR(dev)) return PTR_ERR(dev);
@@ -1047,7 +1048,7 @@ const char *bdi_dev_name(struct backing_dev_info *bdi) { if (!bdi || !bdi->dev) return bdi_unknown_name; - return dev_name(bdi->dev); + return bdi->dev_name; } EXPORT_SYMBOL_GPL(bdi_dev_name);
From: Max Kellermann mk@cm4all.com
Based on commit 63ff822358b276137059520cf16e587e8073e80f upstream.
If an operation's flag `needs_file` is set, the function io_req_set_file() calls io_file_get() to obtain a `struct file*`.
This fails for `O_PATH` file descriptors, because io_file_get() calls fget(), which rejects `O_PATH` file descriptors. To support `O_PATH`, fdget_raw() must be used (like path_init() in `fs/namei.c` does). This rejection causes io_req_set_file() to throw `-EBADF`. This breaks the operations `openat`, `openat2` and `statx`, where `O_PATH` file descriptors are commonly used.
This could be solved by adding support for `O_PATH` file descriptors with another `io_op_def` flag, but since those three operations don't need the `struct file*` but operate directly on the numeric file descriptors, the best solution here is to simply remove `needs_file` (and the accompanying flag `fd_non_reg`).
Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann mk@cm4all.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/io_uring.c | 6 ------ 1 file changed, 6 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -696,8 +696,6 @@ static const struct io_op_def io_op_defs .needs_file = 1, }, [IORING_OP_OPENAT] = { - .needs_file = 1, - .fd_non_neg = 1, .file_table = 1, .needs_fs = 1, }, @@ -711,8 +709,6 @@ static const struct io_op_def io_op_defs }, [IORING_OP_STATX] = { .needs_mm = 1, - .needs_file = 1, - .fd_non_neg = 1, .needs_fs = 1, .file_table = 1, }, @@ -743,8 +739,6 @@ static const struct io_op_def io_op_defs .unbound_nonreg_file = 1, }, [IORING_OP_OPENAT2] = { - .needs_file = 1, - .fd_non_neg = 1, .file_table = 1, .needs_fs = 1, },
From: Amir Goldstein amir73il@gmail.com
[ Upstream commit dfc2d2594e4a79204a3967585245f00644b8f838 ]
The event inode field is used only for comparison in queue merges and cannot be dereferenced after handle_event(), because it does not hold a refcount on the inode.
Replace it with an abstract id to do the same thing.
Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- fs/notify/fanotify/fanotify.c | 4 ++-- fs/notify/inotify/inotify_fsnotify.c | 4 ++-- fs/notify/inotify/inotify_user.c | 2 +- include/linux/fsnotify_backend.h | 7 +++---- 4 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 5778d1347b351..14d0ac4664595 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -26,7 +26,7 @@ static bool should_merge(struct fsnotify_event *old_fsn, old = FANOTIFY_E(old_fsn); new = FANOTIFY_E(new_fsn);
- if (old_fsn->inode != new_fsn->inode || old->pid != new->pid || + if (old_fsn->objectid != new_fsn->objectid || old->pid != new->pid || old->fh_type != new->fh_type || old->fh_len != new->fh_len) return false;
@@ -314,7 +314,7 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, if (!event) goto out; init: __maybe_unused - fsnotify_init_event(&event->fse, inode); + fsnotify_init_event(&event->fse, (unsigned long)inode); event->mask = mask; if (FAN_GROUP_FLAG(group, FAN_REPORT_TID)) event->pid = get_pid(task_pid(current)); diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c index d510223d302ca..589dee9629938 100644 --- a/fs/notify/inotify/inotify_fsnotify.c +++ b/fs/notify/inotify/inotify_fsnotify.c @@ -39,7 +39,7 @@ static bool event_compare(struct fsnotify_event *old_fsn, if (old->mask & FS_IN_IGNORED) return false; if ((old->mask == new->mask) && - (old_fsn->inode == new_fsn->inode) && + (old_fsn->objectid == new_fsn->objectid) && (old->name_len == new->name_len) && (!old->name_len || !strcmp(old->name, new->name))) return true; @@ -118,7 +118,7 @@ int inotify_handle_event(struct fsnotify_group *group, mask &= ~IN_ISDIR;
fsn_event = &event->fse; - fsnotify_init_event(fsn_event, inode); + fsnotify_init_event(fsn_event, (unsigned long)inode); event->mask = mask; event->wd = i_mark->wd; event->sync_cookie = cookie; diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c index 107537a543fd8..81ffc8629fc4b 100644 --- a/fs/notify/inotify/inotify_user.c +++ b/fs/notify/inotify/inotify_user.c @@ -635,7 +635,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events) return ERR_PTR(-ENOMEM); } group->overflow_event = &oevent->fse; - fsnotify_init_event(group->overflow_event, NULL); + fsnotify_init_event(group->overflow_event, 0); oevent->mask = FS_Q_OVERFLOW; oevent->wd = -1; oevent->sync_cookie = 0; diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 1915bdba2fad9..64cfb5446f4d4 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -133,8 +133,7 @@ struct fsnotify_ops { */ struct fsnotify_event { struct list_head list; - /* inode may ONLY be dereferenced during handle_event(). */ - struct inode *inode; /* either the inode the event happened to or its parent */ + unsigned long objectid; /* identifier for queue merges */ };
/* @@ -500,10 +499,10 @@ extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info); extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
static inline void fsnotify_init_event(struct fsnotify_event *event, - struct inode *inode) + unsigned long objectid) { INIT_LIST_HEAD(&event->list); - event->inode = inode; + event->objectid = objectid; }
#else
From: Amir Goldstein amir73il@gmail.com
[ Upstream commit f367a62a7cad2447d835a9f14fc63997a9137246 ]
With inotify, when a watch is set on a directory and on its child, an event on the child is reported twice, once with wd of the parent watch and once with wd of the child watch without the filename.
With fanotify, when a watch is set on a directory and on its child, an event on the child is reported twice, but it has the exact same information - either an open file descriptor of the child or an encoded fid of the child.
The reason that the two identical events are not merged is because the object id used for merging events in the queue is the child inode in one event and parent inode in the other.
For events with path or dentry data, use the victim inode instead of the watched inode as the object id for event merging, so that the event reported on parent will be merged with the event reported on the child.
Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- fs/notify/fanotify/fanotify.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 14d0ac4664595..f5d30573f4a99 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -314,7 +314,12 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, if (!event) goto out; init: __maybe_unused - fsnotify_init_event(&event->fse, (unsigned long)inode); + /* + * Use the victim inode instead of the watching inode as the id for + * event queue, so event reported on parent is merged with event + * reported on child when both directory and child watches exist. + */ + fsnotify_init_event(&event->fse, (unsigned long)id); event->mask = mask; if (FAN_GROUP_FLAG(group, FAN_REPORT_TID)) event->pid = get_pid(task_pid(current));
On 13/05/2020 10:43, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
All tests are passing for Tegra ...
Test results for stable-v5.6: 13 builds: 13 pass, 0 fail 26 boots: 26 pass, 0 fail 42 tests: 42 pass, 0 fail
Linux version: 5.6.13-rc1-gf1d28d1c7608 Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000, tegra194-p2972-0000, tegra20-ventana, tegra210-p2371-2180, tegra210-p3450-0000, tegra30-cardhu-a04
Cheers Jon
On Wed, May 13, 2020 at 02:46:31PM +0100, Jon Hunter wrote:
On 13/05/2020 10:43, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
All tests are passing for Tegra ...
Test results for stable-v5.6: 13 builds: 13 pass, 0 fail 26 boots: 26 pass, 0 fail 42 tests: 42 pass, 0 fail
Linux version: 5.6.13-rc1-gf1d28d1c7608 Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000, tegra194-p2972-0000, tegra20-ventana, tegra210-p2371-2180, tegra210-p3450-0000, tegra30-cardhu-a04
Wonderful, thanks for testing all of these and letting me know.
greg k-h
On Wed, May 13, 2020 at 11:43:39AM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
Build results: total: 155 pass: 155 fail: 0 Qemu test results: total: 431 pass: 431 fail: 0
Guenter
On Wed, May 13, 2020 at 10:04:12AM -0700, Guenter Roeck wrote:
On Wed, May 13, 2020 at 11:43:39AM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
Build results: total: 155 pass: 155 fail: 0 Qemu test results: total: 431 pass: 431 fail: 0
Thanks for testing them all.
greg k-h
On Wed, 13 May 2020 at 15:22, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Summary ------------------------------------------------------------------------
kernel: 5.6.13-rc1 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-5.6.y git commit: f1d28d1c7608478dd10b7a36c40f2375bcc1648e git describe: v5.6.12-119-gf1d28d1c7608 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-5.6-oe/build/v5.6.12-119-...
No regressions (compared to build v5.6.12)
No fixes (compared to build v5.6.12)
Ran 35302 total tests in the following environments and test suites.
Environments -------------- - dragonboard-410c - hi6220-hikey - i386 - juno-r2 - juno-r2-compat - juno-r2-kasan - nxp-ls2088 - qemu_arm - qemu_arm64 - qemu_i386 - qemu_x86_64 - x15 - x86 - x86-kasan
Test Suites ----------- * build * install-android-platform-tools-r2600 * install-android-platform-tools-r2800 * kselftest * kselftest/drivers * kselftest/filesystems * libgpiod * linux-log-parser * ltp-containers-tests * ltp-cve-tests * ltp-fcntl-locktests-tests * ltp-filecaps-tests * ltp-fs_bind-tests * ltp-fs_perms_simple-tests * ltp-fsx-tests * ltp-ipc-tests * ltp-sched-tests * ltp-syscalls-tests * perf * ltp-commands-tests * ltp-dio-tests * ltp-io-tests * ltp-math-tests * ltp-nptl-tests * ltp-pty-tests * ltp-securebits-tests * network-basic-tests * kselftest/net * kselftest/networking * libhugetlbfs * ltp-cap_bounds-tests * ltp-cpuhotplug-tests * ltp-crypto-tests * ltp-fs-tests * ltp-hugetlb-tests * ltp-mm-tests * ltp-open-posix-tests * v4l2-compliance * kvm-unit-tests * kselftest-vsyscall-mode-native * kselftest-vsyscall-mode-native/drivers * kselftest-vsyscall-mode-native/filesystems * kselftest-vsyscall-mode-native/net * kselftest-vsyscall-mode-native/networking * kselftest-vsyscall-mode-none * kselftest-vsyscall-mode-none/drivers * kselftest-vsyscall-mode-none/filesystems * kselftest-vsyscall-mode-none/net * kselftest-vsyscall-mode-none/networking
On Wed, May 13, 2020 at 10:56:21PM +0530, Naresh Kamboju wrote:
On Wed, 13 May 2020 at 15:22, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Thanks for testing all of these trees.
greg k-h
On 5/13/20 3:43 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
thanks, -- Shuah
On Wed, May 13, 2020 at 04:59:52PM -0600, shuah wrote:
On 5/13/20 3:43 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.6.13 release. There are 118 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.6.13-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.6.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
Thanks for testing all of these and letting me know.
greg k-h
linux-stable-mirror@lists.linaro.org