 
            This is the start of the stable review cycle for the 5.4.41 release. There are 90 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.41-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.4.41-rc1
Amir Goldstein amir73il@gmail.com fanotify: merge duplicate events on parent and child
Amir Goldstein amir73il@gmail.com fsnotify: replace inode pointer with an object id
Christoph Hellwig hch@lst.de bdi: add a ->dev_name field to struct backing_dev_info
Christoph Hellwig hch@lst.de bdi: move bdi_dev_name out of line
Yafang Shao laoar.shao@gmail.com mm, memcg: fix error return value of mem_cgroup_css_alloc()
Ivan Delalande colona@arista.com scripts/decodecode: fix trapping instruction formatting
Julia Lawall Julia.Lawall@inria.fr iommu/virtio: Reverse arguments to list_add
Josh Poimboeuf jpoimboe@redhat.com objtool: Fix stack offset tracking for indirect CFAs
Arnd Bergmann arnd@arndb.de netfilter: nf_osf: avoid passing pointer to local var
Guillaume Nault gnault@redhat.com netfilter: nat: never update the UDP checksum when it's 0
Janakarajan Natarajan Janakarajan.Natarajan@amd.com arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
Suravee Suthikulpanit suravee.suthikulpanit@amd.com KVM: x86: Fixes posted interrupt check for IRQs delivery modes
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Fix premature unwind stoppage due to IRET frames
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Fix error path for bad ORC entry type
Josh Poimboeuf jpoimboe@redhat.com x86/unwind/orc: Prevent unwinding before ORC initialization
Miroslav Benes mbenes@suse.cz x86/unwind/orc: Don't skip the first frame for inactive tasks
Jann Horn jannh@google.com x86/entry/64: Fix unwind hints in rewind_stack_do_exit()
Josh Poimboeuf jpoimboe@redhat.com x86/entry/64: Fix unwind hints in kernel exit path
Josh Poimboeuf jpoimboe@redhat.com x86/entry/64: Fix unwind hints in register clearing code
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_v_ogm_process
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_store_throughput_override
Xiyu Yang xiyuyang19@fudan.edu.cn batman-adv: Fix refcnt leak in batadv_show_throughput_override
George Spelvin lkml@sdf.org batman-adv: fix batadv_nc_random_weight_tq
Tejun Heo tj@kernel.org iocost: protect iocg->abs_vdebt with iocg->waitq.lock
Vincent Chen vincent.chen@sifive.com riscv: set max_pfn to the PFN of the last page
Luis Chamberlain mcgrof@kernel.org coredump: fix crash when umh is disabled
Oscar Carter oscar.carter@gmx.com staging: gasket: Check the return value of gasket_get_bar_index()
Luis Henriques lhenriques@suse.com ceph: demote quotarealm lookup warning to a debug message
Jeff Layton jlayton@kernel.org ceph: fix endianness bug when handling MDS session feature bits
Henry Willard henry.willard@oracle.com mm: limit boost_watermark on small zones
David Hildenbrand david@redhat.com mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
Khazhismel Kumykov khazhy@google.com eventpoll: fix missing wakeup for ovflist in ep_poll_callback
Roman Penyaev rpenyaev@suse.de epoll: atomically remove wait entry on wake up
Oleg Nesterov oleg@redhat.com ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
H. Nikolaus Schaller hns@goldelico.com drm: ingenic-drm: add MODULE_DEVICE_TABLE
Mark Rutland mark.rutland@arm.com arm64: hugetlb: avoid potential NULL dereference
Marc Zyngier maz@kernel.org KVM: arm64: Fix 32bit PC wrap-around
Marc Zyngier maz@kernel.org KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVER
Sean Christopherson sean.j.christopherson@intel.com KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path
Christian Borntraeger borntraeger@de.ibm.com KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction
Jason A. Donenfeld Jason@zx2c4.com crypto: arch/nhpoly1305 - process in explicit 4k chunks
Steven Rostedt (VMware) rostedt@goodmis.org tracing: Add a vmalloc_sync_mappings() for safe measure
Oliver Neukum oneukum@suse.com USB: serial: garmin_gps: add sanity checking for data length
Bryan O'Donoghue bryan.odonoghue@linaro.org usb: chipidea: msm: Ensure proper controller reset using role switch API
Oliver Neukum oneukum@suse.com USB: uas: add quirk for LaCie 2Big Quadra
Jason Gerecke killertofu@gmail.com HID: wacom: Report 2nd-gen Intuos Pro S center button status over BT
Alan Stern stern@rowland.harvard.edu HID: usbhid: Fix race between usbhid_close() and usbhid_stop()
Jason Gerecke killertofu@gmail.com Revert "HID: wacom: generic: read the number of expected touches on a per collection basis"
Jere Leppänen jere.leppanen@nokia.com sctp: Fix bundling of SHUTDOWN with COOKIE-ACK
Jason Gerecke jason.gerecke@wacom.com HID: wacom: Read HID_DG_CONTACTMAX directly for non-generic devices
Dan Carpenter dan.carpenter@oracle.com net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del()
Dan Carpenter dan.carpenter@oracle.com net: mvpp2: prevent buffer overflow in mvpp22_rss_ctx()
Moshe Shemesh moshe@mellanox.com net/mlx5: Fix command entry leak in Internal Error State
Moshe Shemesh moshe@mellanox.com net/mlx5: Fix forced completion access non initialized command entry
Erez Shitrit erezsh@mellanox.com net/mlx5: DR, On creation set CQ's arm_db member to right value
Michael Chan michael.chan@broadcom.com bnxt_en: Fix VLAN acceleration handling in bnxt_fix_features().
Michael Chan michael.chan@broadcom.com bnxt_en: Return error when allocating zero size context memory.
Michael Chan michael.chan@broadcom.com bnxt_en: Improve AER slot reset.
Vasundhara Volam vasundhara-v.volam@broadcom.com bnxt_en: Reduce BNXT_MSIX_VEC_MAX value to supported CQs per PF.
Michael Chan michael.chan@broadcom.com bnxt_en: Fix VF anti-spoof filter setup.
Toke Høiland-Jørgensen toke@redhat.com tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040
Tuong Lien tuong.t.lien@dektech.com.au tipc: fix partial topology connection closure
Eric Dumazet edumazet@google.com sch_sfq: validate silly quantum values
Eric Dumazet edumazet@google.com sch_choke: avoid potential panic in choke_reset()
Qiushi Wu wu000273@umn.edu nfp: abm: fix a memory leak bug
Matt Jolly Kangie@footclan.ninja net: usb: qmi_wwan: add support for DW5816e
Xiyu Yang xiyuyang19@fudan.edu.cn net/tls: Fix sk_psock refcnt leak when in tls_data_ready()
Xiyu Yang xiyuyang19@fudan.edu.cn net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict()
Anthony Felice tony.felice@timesys.com net: tc35815: Fix phydev supported/advertising mask
Willem de Bruijn willemb@google.com net: stricter validation of untrusted gso packets
Eric Dumazet edumazet@google.com net_sched: sch_skbprio: add message validation to skbprio_change()
Tariq Toukan tariqt@mellanox.com net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc()
Scott Dial scott@scottdial.com net: macsec: preserve ingress frame ordering
Dejin Zheng zhengdejin5@gmail.com net: macb: fix an issue about leak related system resources
Florian Fainelli f.fainelli@gmail.com net: dsa: Do not leave DSA master with NULL netdev_ops
Roman Mashak mrv@mojatatu.com neigh: send protocol value in neighbor create notification
Jiri Pirko jiri@mellanox.com mlxsw: spectrum_acl_tcam: Position vchunk in a vregion list properly
David Ahern dsahern@kernel.org ipv6: Use global sernum for dst validation with nexthop objects
Eric Dumazet edumazet@google.com fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks
Julia Lawall Julia.Lawall@inria.fr dp83640: reverse arguments to list_add_tail
Jakub Kicinski kuba@kernel.org devlink: fix return value after hitting end in region read
Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com tty: xilinx_uartps: Fix missing id assignment to the console
Nicolas Pitre nico@fluxnic.net vt: fix unicode console freeing with a common interface
Evan Quan evan.quan@amd.com drm/amdgpu: drop redundant cg/pg ungate on runpm enter
Evan Quan evan.quan@amd.com drm/amdgpu: move kfd suspend after ip_suspend_phase1
Andy Shevchenko andriy.shevchenko@linux.intel.com net: macb: Fix runtime PM refcounting
Masami Hiramatsu mhiramat@kernel.org tracing/kprobes: Fix a double initialization typo
Sagi Grimberg sagi@grimberg.me nvme: fix possible hang when ns scanning fails during error recovery
Christoph Hellwig hch@lst.de nvme: refactor nvme_identify_ns_descs error handling
Matt Jolly Kangie@footclan.ninja USB: serial: qcserial: Add DW5816e support
-------------
Diffstat:
Makefile | 4 +- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/kvm/guest.c | 7 ++ arch/arm64/mm/hugetlbpage.c | 2 + arch/riscv/mm/init.c | 3 +- arch/s390/kvm/priv.c | 4 +- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- arch/x86/entry/calling.h | 40 +++---- arch/x86/entry/entry_64.S | 9 +- arch/x86/include/asm/kvm_host.h | 4 +- arch/x86/include/asm/unwind.h | 2 +- arch/x86/kernel/unwind_orc.c | 61 ++++++++--- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx/vmenter.S | 3 + block/blk-iocost.c | 117 +++++++++++++-------- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +- drivers/gpu/drm/ingenic/ingenic-drm.c | 1 + drivers/hid/usbhid/hid-core.c | 37 +++++-- drivers/hid/usbhid/usbhid.h | 1 + drivers/hid/wacom_sys.c | 4 +- drivers/hid/wacom_wac.c | 88 ++++------------ drivers/iommu/virtio-iommu.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 20 ++-- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 - drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 10 +- drivers/net/ethernet/cadence/macb_main.c | 24 ++--- drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c | 3 + drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 + drivers/net/ethernet/mellanox/mlx4/main.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 6 +- .../ethernet/mellanox/mlx5/core/steering/dr_send.c | 14 ++- .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c | 12 ++- drivers/net/ethernet/netronome/nfp/abm/main.c | 1 + drivers/net/ethernet/toshiba/tc35815.c | 2 +- drivers/net/macsec.c | 3 +- drivers/net/phy/dp83640.c | 2 +- drivers/net/usb/qmi_wwan.c | 1 + drivers/nvme/host/core.c | 28 +++-- drivers/staging/gasket/gasket_core.c | 4 + drivers/tty/serial/xilinx_uartps.c | 1 + drivers/tty/vt/vt.c | 9 +- drivers/usb/chipidea/ci_hdrc_msm.c | 2 +- drivers/usb/serial/garmin_gps.c | 4 +- drivers/usb/serial/qcserial.c | 1 + drivers/usb/storage/unusual_uas.h | 7 ++ fs/ceph/mds_client.c | 8 +- fs/ceph/quota.c | 4 +- fs/coredump.c | 8 ++ fs/eventpoll.c | 61 ++++++----- fs/notify/fanotify/fanotify.c | 9 +- fs/notify/inotify/inotify_fsnotify.c | 4 +- fs/notify/inotify/inotify_user.c | 2 +- include/linux/backing-dev-defs.h | 1 + include/linux/backing-dev.h | 9 +- include/linux/fsnotify_backend.h | 7 +- include/linux/virtio_net.h | 26 ++++- include/net/inet_ecn.h | 57 +++++++++- include/net/ip6_fib.h | 4 + include/net/net_namespace.h | 7 ++ ipc/mqueue.c | 34 ++++-- kernel/trace/trace.c | 13 +++ kernel/trace/trace_kprobe.c | 2 +- kernel/umh.c | 5 + mm/backing-dev.c | 13 ++- mm/memcontrol.c | 15 +-- mm/page_alloc.c | 9 ++ net/batman-adv/bat_v_ogm.c | 2 +- net/batman-adv/network-coding.c | 9 +- net/batman-adv/sysfs.c | 3 +- net/core/devlink.c | 5 + net/core/neighbour.c | 6 +- net/dsa/master.c | 3 +- net/ipv6/route.c | 25 +++++ net/netfilter/nf_nat_proto.c | 4 +- net/netfilter/nfnetlink_osf.c | 12 ++- net/sched/sch_choke.c | 3 +- net/sched/sch_fq_codel.c | 2 +- net/sched/sch_sfq.c | 9 ++ net/sched/sch_skbprio.c | 3 + net/sctp/sm_statefuns.c | 6 +- net/tipc/topsrv.c | 5 +- net/tls/tls_sw.c | 7 +- scripts/decodecode | 2 +- tools/cgroup/iocost_monitor.py | 7 +- tools/objtool/check.c | 2 +- virt/kvm/arm/hyp/aarch32.c | 8 +- virt/kvm/arm/vgic/vgic-mmio.c | 4 +- 90 files changed, 648 insertions(+), 346 deletions(-)
 
            From: Matt Jolly Kangie@footclan.ninja
commit 78d6de3cfbd342918d31cf68d0d2eda401338aef upstream.
Add support for Dell Wireless 5816e to drivers/usb/serial/qcserial.c
Signed-off-by: Matt Jolly Kangie@footclan.ninja Cc: stable stable@vger.kernel.org Signed-off-by: Johan Hovold johan@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/serial/qcserial.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/usb/serial/qcserial.c +++ b/drivers/usb/serial/qcserial.c @@ -173,6 +173,7 @@ static const struct usb_device_id id_tab {DEVICE_SWI(0x413c, 0x81b3)}, /* Dell Wireless 5809e Gobi(TM) 4G LTE Mobile Broadband Card (rev3) */ {DEVICE_SWI(0x413c, 0x81b5)}, /* Dell Wireless 5811e QDL */ {DEVICE_SWI(0x413c, 0x81b6)}, /* Dell Wireless 5811e QDL */ + {DEVICE_SWI(0x413c, 0x81cc)}, /* Dell Wireless 5816e */ {DEVICE_SWI(0x413c, 0x81cf)}, /* Dell Wireless 5819 */ {DEVICE_SWI(0x413c, 0x81d0)}, /* Dell Wireless 5819 */ {DEVICE_SWI(0x413c, 0x81d1)}, /* Dell Wireless 5818 */
 
            From: Christoph Hellwig hch@lst.de
[ Upstream commit fb314eb0cbb2e11540d1ae1a7b28346397f621ef ]
Move the handling of an error into the function from the caller, and only do it for an actual error on the admin command itself, not the command parsing, as that should be enough to deal with devices claiming a bogus version compliance.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 31b7dcd791c20..66147df86d883 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1071,8 +1071,17 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
status = nvme_submit_sync_cmd(ctrl->admin_q, &c, data, NVME_IDENTIFY_DATA_SIZE); - if (status) + if (status) { + dev_warn(ctrl->device, + "Identify Descriptors failed (%d)\n", status); + /* + * Don't treat an error as fatal, as we potentially already + * have a NGUID or EUI-64. + */ + if (status > 0) + status = 0; goto free_data; + }
for (pos = 0; pos < NVME_IDENTIFY_DATA_SIZE; pos += len) { struct nvme_ns_id_desc *cur = data + pos; @@ -1730,26 +1739,15 @@ static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns) static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, struct nvme_id_ns *id, struct nvme_ns_ids *ids) { - int ret = 0; - memset(ids, 0, sizeof(*ids));
if (ctrl->vs >= NVME_VS(1, 1, 0)) memcpy(ids->eui64, id->eui64, sizeof(id->eui64)); if (ctrl->vs >= NVME_VS(1, 2, 0)) memcpy(ids->nguid, id->nguid, sizeof(id->nguid)); - if (ctrl->vs >= NVME_VS(1, 3, 0)) { - /* Don't treat error as fatal we potentially - * already have a NGUID or EUI-64 - */ - ret = nvme_identify_ns_descs(ctrl, nsid, ids); - if (ret) - dev_warn(ctrl->device, - "Identify Descriptors failed (%d)\n", ret); - if (ret > 0) - ret = 0; - } - return ret; + if (ctrl->vs >= NVME_VS(1, 3, 0)) + return nvme_identify_ns_descs(ctrl, nsid, ids); + return 0; }
static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
 
            From: Sagi Grimberg sagi@grimberg.me
[ Upstream commit 59c7c3caaaf8750df4ec3255082f15eb4e371514 ]
When the controller is reconnecting, the host fails I/O and admin commands as the host cannot reach the controller. ns scanning may revalidate namespaces during that period and it is wrong to remove namespaces due to these failures as we may hang (see 205da2434301).
One command that may fail is nvme_identify_ns_descs. Since we return success due to having ns identify descriptor list optional, we continue to compare ns identifiers in nvme_revalidate_disk, obviously fail and return -ENODEV to nvme_validate_ns, which will remove the namespace.
Exactly what we don't want to happen.
Fixes: 22802bf742c2 ("nvme: Namepace identification descriptor list is optional") Tested-by: Anton Eidelman anton@lightbitslabs.com Signed-off-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Keith Busch kbusch@kernel.org Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 66147df86d883..f0e0af3aa714e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1078,7 +1078,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid, * Don't treat an error as fatal, as we potentially already * have a NGUID or EUI-64. */ - if (status > 0) + if (status > 0 && !(status & NVME_SC_DNR)) status = 0; goto free_data; }
 
            From: Masami Hiramatsu mhiramat@kernel.org
[ Upstream commit dcbd21c9fca5e954fd4e3d91884907eb6d47187e ]
Fix a typo that resulted in an unnecessary double initialization to addr.
Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote...
Cc: Tom Zanussi zanussi@kernel.org Cc: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Fixes: c7411a1a126f ("tracing/kprobe: Check whether the non-suffixed symbol is notrace") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 2f0f7fcee73e6..fba4b48451f6c 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -454,7 +454,7 @@ static bool __within_notrace_func(unsigned long addr)
static bool within_notrace_func(struct trace_kprobe *tk) { - unsigned long addr = addr = trace_kprobe_address(tk); + unsigned long addr = trace_kprobe_address(tk); char symname[KSYM_NAME_LEN], *p;
if (!__within_notrace_func(addr))
 
            From: Andy Shevchenko andriy.shevchenko@linux.intel.com
[ Upstream commit 0ce205d4660c312cdeb4a81066616dcc6f3799c4 ]
The commit e6a41c23df0d, while trying to fix an issue,
("net: macb: ensure interface is not suspended on at91rm9200")
introduced a refcounting regression, because in error case refcounter must be balanced. Fix it by calling pm_runtime_put_noidle() in error case.
While here, fix the same mistake in other couple of places.
Fixes: e6a41c23df0d ("net: macb: ensure interface is not suspended on at91rm9200") Cc: Alexandre Belloni alexandre.belloni@bootlin.com Cc: Claudiu Beznea claudiu.beznea@microchip.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/cadence/macb_main.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 234c13ebbc41b..dd2a605c5c2ed 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -334,8 +334,10 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum) int status;
status = pm_runtime_get_sync(&bp->pdev->dev); - if (status < 0) + if (status < 0) { + pm_runtime_put_noidle(&bp->pdev->dev); goto mdio_pm_exit; + }
status = macb_mdio_wait_for_idle(bp); if (status < 0) @@ -367,8 +369,10 @@ static int macb_mdio_write(struct mii_bus *bus, int mii_id, int regnum, int status;
status = pm_runtime_get_sync(&bp->pdev->dev); - if (status < 0) + if (status < 0) { + pm_runtime_put_noidle(&bp->pdev->dev); goto mdio_pm_exit; + }
status = macb_mdio_wait_for_idle(bp); if (status < 0) @@ -3691,8 +3695,10 @@ static int at91ether_open(struct net_device *dev) int ret;
ret = pm_runtime_get_sync(&lp->pdev->dev); - if (ret < 0) + if (ret < 0) { + pm_runtime_put_noidle(&lp->pdev->dev); return ret; + }
/* Clear internal statistics */ ctl = macb_readl(lp, NCR);
 
            From: Evan Quan evan.quan@amd.com
[ Upstream commit c457a273e118bb96e1db8d1825f313e6cafe4258 ]
This sequence change should be safe as what did in ip_suspend_phase1 is to suspend DCE only. And this is a prerequisite for coming redundant cg/pg ungate dropping.
Fixes: 487eca11a321ef ("drm/amdgpu: fix gfx hang during suspend with video playback (v2)") Signed-off-by: Evan Quan evan.quan@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 630e8342d1625..ca2a0770aad2e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3073,12 +3073,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool suspend, bool fbcon) amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
- amdgpu_amdkfd_suspend(adev); - amdgpu_ras_suspend(adev);
r = amdgpu_device_ip_suspend_phase1(adev);
+ amdgpu_amdkfd_suspend(adev); + /* evict vram memory */ amdgpu_bo_evict_vram(adev);
 
            From: Evan Quan evan.quan@amd.com
[ Upstream commit f7b52890daba570bc8162d43c96b5583bbdd4edd ]
CG/PG ungate is already performed in ip_suspend_phase1. Otherwise, the CG/PG ungate will be performed twice. That will cause gfxoff disablement is performed twice also on runpm enter while gfxoff enablemnt once on rump exit. That will put gfxoff into disabled state.
Fixes: b2a7e9735ab286 ("drm/amdgpu: fix the hw hang during perform system reboot and reset") Signed-off-by: Evan Quan evan.quan@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ca2a0770aad2e..5e1dce4241547 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3070,9 +3070,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool suspend, bool fbcon) } }
- amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); - amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); - amdgpu_ras_suspend(adev);
r = amdgpu_device_ip_suspend_phase1(adev);
 
            From: Nicolas Pitre nico@fluxnic.net
[ Upstream commit 57d38f26d81e4275748b69372f31df545dcd9b71 ]
By directly using kfree() in different places we risk missing one if it is switched to using vfree(), especially if the corresponding vmalloc() is hidden away within a common abstraction.
Oh wait, that's exactly what happened here.
So let's fix this by creating a common abstraction for the free case as well.
Signed-off-by: Nicolas Pitre nico@fluxnic.net Reported-by: syzbot+0bfda3ade1ee9288a1be@syzkaller.appspotmail.com Fixes: 9a98e7a80f95 ("vt: don't use kmalloc() for the unicode screen buffer") Cc: stable@vger.kernel.org Reviewed-by: Sam Ravnborg sam@ravnborg.org Link: https://lore.kernel.org/r/nycvar.YSQ.7.76.2005021043110.2671@knanqh.ubzr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/tty/vt/vt.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index 8b3ecef50394a..fd0361d72738b 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -365,9 +365,14 @@ static struct uni_screen *vc_uniscr_alloc(unsigned int cols, unsigned int rows) return uniscr; }
+static void vc_uniscr_free(struct uni_screen *uniscr) +{ + vfree(uniscr); +} + static void vc_uniscr_set(struct vc_data *vc, struct uni_screen *new_uniscr) { - vfree(vc->vc_uni_screen); + vc_uniscr_free(vc->vc_uni_screen); vc->vc_uni_screen = new_uniscr; }
@@ -1230,7 +1235,7 @@ static int vc_do_resize(struct tty_struct *tty, struct vc_data *vc, err = resize_screen(vc, new_cols, new_rows, user); if (err) { kfree(newscreen); - kfree(new_uniscr); + vc_uniscr_free(new_uniscr); return err; }
 
            From: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com
[ Upstream commit 2ae11c46d5fdc46cb396e35911c713d271056d35 ]
When serial console has been assigned to ttyPS1 (which is serial1 alias) console index is not updated property and pointing to index -1 (statically initialized) which ends up in situation where nothing has been printed on the port.
The commit 18cc7ac8a28e ("Revert "serial: uartps: Register own uart console and driver structures"") didn't contain this line which was removed by accident.
Fixes: 18cc7ac8a28e ("Revert "serial: uartps: Register own uart console and driver structures"") Signed-off-by: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com Cc: stable stable@vger.kernel.org Signed-off-by: Michal Simek michal.simek@xilinx.com Link: https://lore.kernel.org/r/ed3111533ef5bd342ee5ec504812240b870f0853.158860244... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/tty/serial/xilinx_uartps.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/tty/serial/xilinx_uartps.c b/drivers/tty/serial/xilinx_uartps.c index fe098cf14e6a2..3cb9aacfe0b2a 100644 --- a/drivers/tty/serial/xilinx_uartps.c +++ b/drivers/tty/serial/xilinx_uartps.c @@ -1445,6 +1445,7 @@ static int cdns_uart_probe(struct platform_device *pdev) cdns_uart_uart_driver.nr = CDNS_UART_NR_PORTS; #ifdef CONFIG_SERIAL_XILINX_PS_UART_CONSOLE cdns_uart_uart_driver.cons = &cdns_uart_console; + cdns_uart_console.index = id; #endif
rc = uart_register_driver(&cdns_uart_uart_driver);
 
            From: Jakub Kicinski kuba@kernel.org
[ Upstream commit 610a9346c138b9c2c93d38bf5f3728e74ae9cbd5 ]
Commit d5b90e99e1d5 ("devlink: report 0 after hitting end in region read") fixed region dump, but region read still returns a spurious error:
$ devlink region read netdevsim/netdevsim1/dummy snapshot 0 addr 0 len 128 0000000000000000 a6 f4 c4 1c 21 35 95 a6 9d 34 c3 5b 87 5b 35 79 0000000000000010 f3 a0 d7 ee 4f 2f 82 7f c6 dd c4 f6 a5 c3 1b ae 0000000000000020 a4 fd c8 62 07 59 48 03 70 3b c7 09 86 88 7f 68 0000000000000030 6f 45 5d 6d 7d 0e 16 38 a9 d0 7a 4b 1e 1e 2e a6 0000000000000040 e6 1d ae 06 d6 18 00 85 ca 62 e8 7e 11 7e f6 0f 0000000000000050 79 7e f7 0f f3 94 68 bd e6 40 22 85 b6 be 6f b1 0000000000000060 af db ef 5e 34 f0 98 4b 62 9a e3 1b 8b 93 fc 17 devlink answers: Invalid argument 0000000000000070 61 e8 11 11 66 10 a5 f7 b1 ea 8d 40 60 53 ed 12
This is a minimal fix, I'll follow up with a restructuring so we don't have two checks for the same condition.
Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read") Signed-off-by: Jakub Kicinski kuba@kernel.org Reviewed-by: Jacob Keller jacob.e.keller@intel.com Reviewed-by: Jiri Pirko jiri@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/devlink.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -3907,6 +3907,11 @@ static int devlink_nl_cmd_region_read_du end_offset = nla_get_u64(attrs[DEVLINK_ATTR_REGION_CHUNK_ADDR]); end_offset += nla_get_u64(attrs[DEVLINK_ATTR_REGION_CHUNK_LEN]); dump = false; + + if (start_offset == end_offset) { + err = 0; + goto nla_put_failure; + } }
err = devlink_nl_region_read_snapshot_fill(skb, devlink,
 
            From: Julia Lawall Julia.Lawall@inria.fr
[ Upstream commit 865308373ed49c9fb05720d14cbf1315349b32a9 ]
In this code, it appears that phyter_clocks is a list head, based on the previous list_for_each, and that clock->list is intended to be a list element, given that it has just been initialized in dp83640_clock_init. Accordingly, switch the arguments to list_add_tail, which takes the list head as the second argument.
Fixes: cb646e2b02b27 ("ptp: Added a clock driver for the National Semiconductor PHYTER.") Signed-off-by: Julia Lawall Julia.Lawall@inria.fr Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/phy/dp83640.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/phy/dp83640.c +++ b/drivers/net/phy/dp83640.c @@ -1119,7 +1119,7 @@ static struct dp83640_clock *dp83640_clo goto out; } dp83640_clock_init(clock, bus); - list_add_tail(&phyter_clocks, &clock->list); + list_add_tail(&clock->list, &phyter_clocks); out: mutex_unlock(&phyter_clocks_lock);
 
            From: Eric Dumazet edumazet@google.com
[ Upstream commit 14695212d4cd8b0c997f6121b6df8520038ce076 ]
My intent was to not let users set a zero drop_batch_size, it seems I once again messed with min()/max().
Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()") Signed-off-by: Eric Dumazet edumazet@google.com Acked-by: Toke Høiland-Jørgensen toke@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_fq_codel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -417,7 +417,7 @@ static int fq_codel_change(struct Qdisc q->quantum = max(256U, nla_get_u32(tb[TCA_FQ_CODEL_QUANTUM]));
if (tb[TCA_FQ_CODEL_DROP_BATCH_SIZE]) - q->drop_batch_size = min(1U, nla_get_u32(tb[TCA_FQ_CODEL_DROP_BATCH_SIZE])); + q->drop_batch_size = max(1U, nla_get_u32(tb[TCA_FQ_CODEL_DROP_BATCH_SIZE]));
if (tb[TCA_FQ_CODEL_MEMORY_LIMIT]) q->memory_limit = min(1U << 31, nla_get_u32(tb[TCA_FQ_CODEL_MEMORY_LIMIT]));
 
            From: David Ahern dsahern@kernel.org
[ Upstream commit 8f34e53b60b337e559f1ea19e2780ff95ab2fa65 ]
Nik reported a bug with pcpu dst cache when nexthop objects are used illustrated by the following: $ ip netns add foo $ ip -netns foo li set lo up $ ip -netns foo addr add 2001:db8:11::1/128 dev lo $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1 $ ip li add veth1 type veth peer name veth2 $ ip li set veth1 up $ ip addr add 2001:db8:10::1/64 dev veth1 $ ip li set dev veth2 netns foo $ ip -netns foo li set veth2 up $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2 $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1 $ ip -6 route add 2001:db8:11::1/128 nhid 100
Create a pcpu entry on cpu 0: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
Re-add the route entry: $ ip -6 ro del 2001:db8:11::1 $ ip -6 route add 2001:db8:11::1/128 nhid 100
Route get on cpu 0 returns the stale pcpu: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 RTNETLINK answers: Network is unreachable
While cpu 1 works: $ taskset -a -c 1 ip -6 route get 2001:db8:11::1 2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
Conversion of FIB entries to work with external nexthop objects missed an important difference between IPv4 and IPv6 - how dst entries are invalidated when the FIB changes. IPv4 has a per-network namespace generation id (rt_genid) that is bumped on changes to the FIB. Checking if a dst_entry is still valid means comparing rt_genid in the rtable to the current value of rt_genid for the namespace.
IPv6 also has a per network namespace counter, fib6_sernum, but the count is saved per fib6_node. With the per-node counter only dst_entries based on fib entries under the node are invalidated when changes are made to the routes - limiting the scope of invalidations. IPv6 uses a reference in the rt6_info, 'from', to track the corresponding fib entry used to create the dst_entry. When validating a dst_entry, the 'from' is used to backtrack to the fib6_node and check the sernum of it to the cookie passed to the dst_check operation.
With the inline format (nexthop definition inline with the fib6_info), dst_entries cached in the fib6_nh have a 1:1 correlation between fib entries, nexthop data and dst_entries. With external nexthops, IPv6 looks more like IPv4 which means multiple fib entries across disparate fib6_nodes can all reference the same fib6_nh. That means validation of dst_entries based on external nexthops needs to use the IPv4 format - the per-network namespace counter.
Add sernum to rt6_info and set it when creating a pcpu dst entry. Update rt6_get_cookie to return sernum if it is set and update dst_check for IPv6 to look for sernum set and based the check on it if so. Finally, rt6_get_pcpu_route needs to validate the cached entry before returning a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and __mkroute_output for IPv4).
This problem only affects routes using the new, external nexthops.
Thanks to the kbuild test robot for catching the IS_ENABLED needed around rt_genid_ipv6 before I sent this out.
Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects") Reported-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David Ahern dsahern@kernel.org Reviewed-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Tested-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/ip6_fib.h | 4 ++++ include/net/net_namespace.h | 7 +++++++ net/ipv6/route.c | 25 +++++++++++++++++++++++++ 3 files changed, 36 insertions(+)
--- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -177,6 +177,7 @@ struct fib6_info { struct rt6_info { struct dst_entry dst; struct fib6_info __rcu *from; + int sernum;
struct rt6key rt6i_dst; struct rt6key rt6i_src; @@ -260,6 +261,9 @@ static inline u32 rt6_get_cookie(const s struct fib6_info *from; u32 cookie = 0;
+ if (rt->sernum) + return rt->sernum; + rcu_read_lock();
from = rcu_dereference(rt->from); --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -428,6 +428,13 @@ static inline int rt_genid_ipv4(struct n return atomic_read(&net->ipv4.rt_genid); }
+#if IS_ENABLED(CONFIG_IPV6) +static inline int rt_genid_ipv6(const struct net *net) +{ + return atomic_read(&net->ipv6.fib6_sernum); +} +#endif + static inline void rt_genid_bump_ipv4(struct net *net) { atomic_inc(&net->ipv4.rt_genid); --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1388,9 +1388,18 @@ static struct rt6_info *ip6_rt_pcpu_allo } ip6_rt_copy_init(pcpu_rt, res); pcpu_rt->rt6i_flags |= RTF_PCPU; + + if (f6i->nh) + pcpu_rt->sernum = rt_genid_ipv6(dev_net(dev)); + return pcpu_rt; }
+static bool rt6_is_valid(const struct rt6_info *rt6) +{ + return rt6->sernum == rt_genid_ipv6(dev_net(rt6->dst.dev)); +} + /* It should be called with rcu_read_lock() acquired */ static struct rt6_info *rt6_get_pcpu_route(const struct fib6_result *res) { @@ -1398,6 +1407,19 @@ static struct rt6_info *rt6_get_pcpu_rou
pcpu_rt = this_cpu_read(*res->nh->rt6i_pcpu);
+ if (pcpu_rt && pcpu_rt->sernum && !rt6_is_valid(pcpu_rt)) { + struct rt6_info *prev, **p; + + p = this_cpu_ptr(res->nh->rt6i_pcpu); + prev = xchg(p, NULL); + if (prev) { + dst_dev_put(&prev->dst); + dst_release(&prev->dst); + } + + pcpu_rt = NULL; + } + return pcpu_rt; }
@@ -2599,6 +2621,9 @@ static struct dst_entry *ip6_dst_check(s
rt = container_of(dst, struct rt6_info, dst);
+ if (rt->sernum) + return rt6_is_valid(rt) ? dst : NULL; + rcu_read_lock();
/* All IPV6 dsts are created with ->obsolete set to the value
 
            From: Jiri Pirko jiri@mellanox.com
[ Upstream commit 6ef4889fc0b3aa6ab928e7565935ac6f762cee6e ]
Vregion helpers to get min and max priority depend on the correct ordering of vchunks in the vregion list. However, the current code always adds new chunk to the end of the list, no matter what the priority is. Fix this by finding the correct place in the list and put vchunk there.
Fixes: 22a677661f56 ("mlxsw: spectrum: Introduce ACL core with simple TCAM implementation") Signed-off-by: Jiri Pirko jiri@mellanox.com Signed-off-by: Ido Schimmel idosch@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c @@ -986,8 +986,9 @@ mlxsw_sp_acl_tcam_vchunk_create(struct m unsigned int priority, struct mlxsw_afk_element_usage *elusage) { + struct mlxsw_sp_acl_tcam_vchunk *vchunk, *vchunk2; struct mlxsw_sp_acl_tcam_vregion *vregion; - struct mlxsw_sp_acl_tcam_vchunk *vchunk; + struct list_head *pos; int err;
if (priority == MLXSW_SP_ACL_TCAM_CATCHALL_PRIO) @@ -1025,7 +1026,14 @@ mlxsw_sp_acl_tcam_vchunk_create(struct m }
mlxsw_sp_acl_tcam_rehash_ctx_vregion_changed(vregion); - list_add_tail(&vchunk->list, &vregion->vchunk_list); + + /* Position the vchunk inside the list according to priority */ + list_for_each(pos, &vregion->vchunk_list) { + vchunk2 = list_entry(pos, typeof(*vchunk2), list); + if (vchunk2->priority > priority) + break; + } + list_add_tail(&vchunk->list, pos); mutex_unlock(&vregion->lock);
return vchunk;
 
            From: Roman Mashak mrv@mojatatu.com
[ Upstream commit 38212bb31fe923d0a2c6299bd2adfbb84cddef2a ]
When a new neighbor entry has been added, event is generated but it does not include protocol, because its value is assigned after the event notification routine has run, so move protocol assignment code earlier.
Fixes: df9b0e30d44c ("neighbor: Add protocol attribute") Cc: David Ahern dsahern@gmail.com Signed-off-by: Roman Mashak mrv@mojatatu.com Reviewed-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/neighbour.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1954,6 +1954,9 @@ static int neigh_add(struct sk_buff *skb NEIGH_UPDATE_F_OVERRIDE_ISROUTER); }
+ if (protocol) + neigh->protocol = protocol; + if (ndm->ndm_flags & NTF_EXT_LEARNED) flags |= NEIGH_UPDATE_F_EXT_LEARNED;
@@ -1967,9 +1970,6 @@ static int neigh_add(struct sk_buff *skb err = __neigh_update(neigh, lladdr, ndm->ndm_state, flags, NETLINK_CB(skb).portid, extack);
- if (protocol) - neigh->protocol = protocol; - neigh_release(neigh);
out:
 
            From: Florian Fainelli f.fainelli@gmail.com
[ Upstream commit 050569fc8384c8056bacefcc246bcb2dfe574936 ]
When ndo_get_phys_port_name() for the CPU port was added we introduced an early check for when the DSA master network device in dsa_master_ndo_setup() already implements ndo_get_phys_port_name(). When we perform the teardown operation in dsa_master_ndo_teardown() we would not be checking that cpu_dp->orig_ndo_ops was successfully allocated and non-NULL initialized.
With network device drivers such as virtio_net, this leads to a NPD as soon as the DSA switch hanging off of it gets torn down because we are now assigning the virtio_net device's netdev_ops a NULL pointer.
Fixes: da7b9e9b00d4 ("net: dsa: Add ndo_get_phys_port_name() for CPU port") Reported-by: Allen Pais allen.pais@oracle.com Signed-off-by: Florian Fainelli f.fainelli@gmail.com Tested-by: Allen Pais allen.pais@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/dsa/master.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -259,7 +259,8 @@ static void dsa_master_ndo_teardown(stru { struct dsa_port *cpu_dp = dev->dsa_ptr;
- dev->netdev_ops = cpu_dp->orig_ndo_ops; + if (cpu_dp->orig_ndo_ops) + dev->netdev_ops = cpu_dp->orig_ndo_ops; cpu_dp->orig_ndo_ops = NULL; }
 
            From: Dejin Zheng zhengdejin5@gmail.com
[ Upstream commit b959c77dac09348955f344104c6a921ebe104753 ]
A call of the function macb_init() can fail in the function fu540_c000_init. The related system resources were not released then. use devm_platform_ioremap_resource() to replace ioremap() to fix it.
Fixes: c218ad559020ff9 ("macb: Add support for SiFive FU540-C000") Cc: Andy Shevchenko andy.shevchenko@gmail.com Reviewed-by: Yash Shah yash.shah@sifive.com Suggested-by: Nicolas Ferre nicolas.ferre@microchip.com Suggested-by: Andy Shevchenko andy.shevchenko@gmail.com Signed-off-by: Dejin Zheng zhengdejin5@gmail.com Acked-by: Nicolas Ferre nicolas.ferre@microchip.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/cadence/macb_main.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-)
--- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -4054,15 +4054,9 @@ static int fu540_c000_clk_init(struct pl
static int fu540_c000_init(struct platform_device *pdev) { - struct resource *res; - - res = platform_get_resource(pdev, IORESOURCE_MEM, 1); - if (!res) - return -ENODEV; - - mgmt->reg = ioremap(res->start, resource_size(res)); - if (!mgmt->reg) - return -ENOMEM; + mgmt->reg = devm_platform_ioremap_resource(pdev, 1); + if (IS_ERR(mgmt->reg)) + return PTR_ERR(mgmt->reg);
return macb_init(pdev); }
 
            From: Scott Dial scott@scottdial.com
[ Upstream commit ab046a5d4be4c90a3952a0eae75617b49c0cb01b ]
MACsec decryption always occurs in a softirq context. Since the FPU may not be usable in the softirq context, the call to decrypt may be scheduled on the cryptd work queue. The cryptd work queue does not provide ordering guarantees. Therefore, preserving order requires masking out ASYNC implementations of gcm(aes).
For instance, an Intel CPU with AES-NI makes available the generic-gcm-aesni driver from the aesni_intel module to implement gcm(aes). However, this implementation requires the FPU, so it is not always available to use from a softirq context, and will fallback to the cryptd work queue, which does not preserve frame ordering. With this change, such a system would select gcm_base(ctr(aes-aesni),ghash-generic). While the aes-aesni implementation prefers to use the FPU, it will fallback to the aes-asm implementation if unavailable.
By using a synchronous version of gcm(aes), the decryption will complete before returning from crypto_aead_decrypt(). Therefore, the macsec_decrypt_done() callback will be called before returning from macsec_decrypt(). Thus, the order of calls to macsec_post_decrypt() for the frames is preserved.
While it's presumable that the pure AES-NI version of gcm(aes) is more performant, the hybrid solution is capable of gigabit speeds on modest hardware. Regardless, preserving the order of frames is paramount for many network protocols (e.g., triggering TCP retries). Within the MACsec driver itself, the replay protection is tripped by the out-of-order frames, and can cause frames to be dropped.
This bug has been present in this code since it was added in v4.6, however it may not have been noticed since not all CPUs have FPU offload available. Additionally, the bug manifests as occasional out-of-order packets that are easily misattributed to other network phenomena.
When this code was added in v4.6, the crypto/gcm.c code did not restrict selection of the ghash function based on the ASYNC flag. For instance, x86 CPUs with PCLMULQDQ would select the ghash-clmulni driver instead of ghash-generic, which submits to the cryptd work queue if the FPU is busy. However, this bug was was corrected in v4.8 by commit b30bdfa86431afbafe15284a3ad5ac19b49b88e3, and was backported all the way back to the v3.14 stable branch, so this patch should be applicable back to the v4.6 stable branch.
Signed-off-by: Scott Dial scott@scottdial.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/macsec.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -1309,7 +1309,8 @@ static struct crypto_aead *macsec_alloc_ struct crypto_aead *tfm; int ret;
- tfm = crypto_alloc_aead("gcm(aes)", 0, 0); + /* Pick a sync gcm(aes) cipher to ensure order is preserved. */ + tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm)) return tfm;
 
            From: Tariq Toukan tariqt@mellanox.com
[ Upstream commit 40e473071dbad04316ddc3613c3a3d1c75458299 ]
When ENOSPC is set the idx is still valid and gets set to the global MLX4_SINK_COUNTER_INDEX. However gcc's static analysis cannot tell that ENOSPC is impossible from mlx4_cmd_imm() and gives this warning:
drivers/net/ethernet/mellanox/mlx4/main.c:2552:28: warning: 'idx' may be used uninitialized in this function [-Wmaybe-uninitialized] 2552 | priv->def_counter[port] = idx;
Also, when ENOSPC is returned mlx4_allocate_default_counters should not fail.
Fixes: 6de5f7f6a1fa ("net/mlx4_core: Allocate default counter per port") Signed-off-by: Jason Gunthorpe jgg@mellanox.com Signed-off-by: Tariq Toukan tariqt@mellanox.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2550,6 +2550,7 @@ static int mlx4_allocate_default_counter
if (!err || err == -ENOSPC) { priv->def_counter[port] = idx; + err = 0; } else if (err == -ENOENT) { err = 0; continue; @@ -2600,7 +2601,8 @@ int mlx4_counter_alloc(struct mlx4_dev * MLX4_CMD_TIME_CLASS_A, MLX4_CMD_WRAPPED); if (!err) *idx = get_param_l(&out_param); - + if (WARN_ON(err == -ENOSPC)) + err = -EINVAL; return err; } return __mlx4_counter_alloc(dev, idx);
 
            From: Eric Dumazet edumazet@google.com
[ Upstream commit 2761121af87de45951989a0adada917837d8fa82 ]
Do not assume the attribute has the right size.
Fixes: aea5f654e6b7 ("net/sched: add skbprio scheduler") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_skbprio.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/net/sched/sch_skbprio.c +++ b/net/sched/sch_skbprio.c @@ -169,6 +169,9 @@ static int skbprio_change(struct Qdisc * { struct tc_skbprio_qopt *ctl = nla_data(opt);
+ if (opt->nla_len != nla_attr_size(sizeof(*ctl))) + return -EINVAL; + sch->limit = ctl->limit; return 0; }
 
            From: Willem de Bruijn willemb@google.com
[ Upstream commit 9274124f023b5c56dc4326637d4f787968b03607 ]
Syzkaller again found a path to a kernel crash through bad gso input: a packet with transport header extending beyond skb_headlen(skb).
Tighten validation at kernel entry:
- Verify that the transport header lies within the linear section.
To avoid pulling linux/tcp.h, verify just sizeof tcphdr. tcp_gso_segment will call pskb_may_pull (th->doff * 4) before use.
- Match the gso_type against the ip_proto found by the flow dissector.
Fixes: bfd5f4a3d605 ("packet: Add GSO/csum offload support.") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/virtio_net.h | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-)
--- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -3,6 +3,8 @@ #define _LINUX_VIRTIO_NET_H
#include <linux/if_vlan.h> +#include <uapi/linux/tcp.h> +#include <uapi/linux/udp.h> #include <uapi/linux/virtio_net.h>
static inline int virtio_net_hdr_set_proto(struct sk_buff *skb, @@ -28,17 +30,25 @@ static inline int virtio_net_hdr_to_skb( bool little_endian) { unsigned int gso_type = 0; + unsigned int thlen = 0; + unsigned int ip_proto;
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { case VIRTIO_NET_HDR_GSO_TCPV4: gso_type = SKB_GSO_TCPV4; + ip_proto = IPPROTO_TCP; + thlen = sizeof(struct tcphdr); break; case VIRTIO_NET_HDR_GSO_TCPV6: gso_type = SKB_GSO_TCPV6; + ip_proto = IPPROTO_TCP; + thlen = sizeof(struct tcphdr); break; case VIRTIO_NET_HDR_GSO_UDP: gso_type = SKB_GSO_UDP; + ip_proto = IPPROTO_UDP; + thlen = sizeof(struct udphdr); break; default: return -EINVAL; @@ -57,16 +67,22 @@ static inline int virtio_net_hdr_to_skb(
if (!skb_partial_csum_set(skb, start, off)) return -EINVAL; + + if (skb_transport_offset(skb) + thlen > skb_headlen(skb)) + return -EINVAL; } else { /* gso packets without NEEDS_CSUM do not set transport_offset. * probe and drop if does not match one of the above types. */ if (gso_type && skb->network_header) { + struct flow_keys_basic keys; + if (!skb->protocol) virtio_net_hdr_set_proto(skb, hdr); retry: - skb_probe_transport_header(skb); - if (!skb_transport_header_was_set(skb)) { + if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, + NULL, 0, 0, 0, + 0)) { /* UFO does not specify ipv4 or 6: try both */ if (gso_type & SKB_GSO_UDP && skb->protocol == htons(ETH_P_IP)) { @@ -75,6 +91,12 @@ retry: } return -EINVAL; } + + if (keys.control.thoff + thlen > skb_headlen(skb) || + keys.basic.ip_proto != ip_proto) + return -EINVAL; + + skb_set_transport_header(skb, keys.control.thoff); } }
 
            From: Anthony Felice tony.felice@timesys.com
[ Upstream commit 4b5b71f770e2edefbfe74203777264bfe6a9927c ]
Commit 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode") updated ethernet drivers to use a linkmode bitmap. It mistakenly dropped a bitwise negation in the tc35815 ethernet driver on a bitmask to set the supported/advertising flags.
Found by Anthony via code inspection, not tested as I do not have the required hardware.
Fixes: 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode") Signed-off-by: Anthony Felice tony.felice@timesys.com Reviewed-by: Akshay Bhat akshay.bhat@timesys.com Reviewed-by: Heiner Kallweit hkallweit1@gmail.com Reviewed-by: Andrew Lunn andrew@lunn.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/toshiba/tc35815.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/toshiba/tc35815.c +++ b/drivers/net/ethernet/toshiba/tc35815.c @@ -644,7 +644,7 @@ static int tc_mii_probe(struct net_devic linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT, mask); linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT, mask); } - linkmode_and(phydev->supported, phydev->supported, mask); + linkmode_andnot(phydev->supported, phydev->supported, mask); linkmode_copy(phydev->advertising, phydev->supported);
lp->link = 0;
 
            From: Xiyu Yang xiyuyang19@fudan.edu.cn
[ Upstream commit 095f5614bfe16e5b3e191b34ea41b10d6fdd4ced ]
bpf_exec_tx_verdict() invokes sk_psock_get(), which returns a reference of the specified sk_psock object to "psock" with increased refcnt.
When bpf_exec_tx_verdict() returns, local variable "psock" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of bpf_exec_tx_verdict(). When "policy" equals to NULL but "psock" is not NULL, the function forgets to decrease the refcnt increased by sk_psock_get(), causing a refcnt leak.
Fix this issue by calling sk_psock_put() on this error path before bpf_exec_tx_verdict() returns.
Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_sw.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -797,6 +797,8 @@ static int bpf_exec_tx_verdict(struct sk *copied -= sk_msg_free(sk, msg); tls_free_open_rec(sk); } + if (psock) + sk_psock_put(sk, psock); return err; } more_data:
 
            From: Xiyu Yang xiyuyang19@fudan.edu.cn
[ Upstream commit 62b4011fa7bef9fa00a6aeec26e69685dc1cc21e ]
tls_data_ready() invokes sk_psock_get(), which returns a reference of the specified sk_psock object to "psock" with increased refcnt.
When tls_data_ready() returns, local variable "psock" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of tls_data_ready(). When "psock->ingress_msg" is empty but "psock" is not NULL, the function forgets to decrease the refcnt increased by sk_psock_get(), causing a refcnt leak.
Fix this issue by calling sk_psock_put() on all paths when "psock" is not NULL.
Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_sw.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -2078,8 +2078,9 @@ static void tls_data_ready(struct sock * strp_data_ready(&ctx->strp);
psock = sk_psock_get(sk); - if (psock && !list_empty(&psock->ingress_msg)) { - ctx->saved_data_ready(sk); + if (psock) { + if (!list_empty(&psock->ingress_msg)) + ctx->saved_data_ready(sk); sk_psock_put(sk, psock); } }
 
            From: Matt Jolly Kangie@footclan.ninja
[ Upstream commit 57c7f2bd758eed867295c81d3527fff4fab1ed74 ]
Add support for Dell Wireless 5816e to drivers/net/usb/qmi_wwan.c
Signed-off-by: Matt Jolly Kangie@footclan.ninja Acked-by: Bjørn Mork bjorn@mork.no Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1359,6 +1359,7 @@ static const struct usb_device_id produc {QMI_FIXED_INTF(0x413c, 0x81b3, 8)}, /* Dell Wireless 5809e Gobi(TM) 4G LTE Mobile Broadband Card (rev3) */ {QMI_FIXED_INTF(0x413c, 0x81b6, 8)}, /* Dell Wireless 5811e */ {QMI_FIXED_INTF(0x413c, 0x81b6, 10)}, /* Dell Wireless 5811e */ + {QMI_FIXED_INTF(0x413c, 0x81cc, 8)}, /* Dell Wireless 5816e */ {QMI_FIXED_INTF(0x413c, 0x81d7, 0)}, /* Dell Wireless 5821e */ {QMI_FIXED_INTF(0x413c, 0x81d7, 1)}, /* Dell Wireless 5821e preproduction config */ {QMI_FIXED_INTF(0x413c, 0x81e0, 0)}, /* Dell Wireless 5821e with eSIM support*/
 
            From: Qiushi Wu wu000273@umn.edu
[ Upstream commit bd4af432cc71b5fbfe4833510359a6ad3ada250d ]
In function nfp_abm_vnic_set_mac, pointer nsp is allocated by nfp_nsp_open. But when nfp_nsp_has_hwinfo_lookup fail, the pointer is not released, which can lead to a memory leak bug. Fix this issue by adding nfp_nsp_close(nsp) in the error path.
Fixes: f6e71efdf9fb1 ("nfp: abm: look up MAC addresses via management FW") Signed-off-by: Qiushi Wu wu000273@umn.edu Acked-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/netronome/nfp/abm/main.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/net/ethernet/netronome/nfp/abm/main.c +++ b/drivers/net/ethernet/netronome/nfp/abm/main.c @@ -283,6 +283,7 @@ nfp_abm_vnic_set_mac(struct nfp_pf *pf, if (!nfp_nsp_has_hwinfo_lookup(nsp)) { nfp_warn(pf->cpp, "NSP doesn't support PF MAC generation\n"); eth_hw_addr_random(nn->dp.netdev); + nfp_nsp_close(nsp); return; }
 
            From: Eric Dumazet edumazet@google.com
[ Upstream commit 8738c85c72b3108c9b9a369a39868ba5f8e10ae0 ]
If choke_init() could not allocate q->tab, we would crash later in choke_reset().
BUG: KASAN: null-ptr-deref in memset include/linux/string.h:366 [inline] BUG: KASAN: null-ptr-deref in choke_reset+0x208/0x340 net/sched/sch_choke.c:326 Write of size 8 at addr 0000000000000000 by task syz-executor822/7022
CPU: 1 PID: 7022 Comm: syz-executor822 Not tainted 5.7.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 __kasan_report.cold+0x5/0x4d mm/kasan/report.c:515 kasan_report+0x33/0x50 mm/kasan/common.c:625 check_memory_region_inline mm/kasan/generic.c:187 [inline] check_memory_region+0x141/0x190 mm/kasan/generic.c:193 memset+0x20/0x40 mm/kasan/common.c:85 memset include/linux/string.h:366 [inline] choke_reset+0x208/0x340 net/sched/sch_choke.c:326 qdisc_reset+0x6b/0x520 net/sched/sch_generic.c:910 dev_deactivate_queue.constprop.0+0x13c/0x240 net/sched/sch_generic.c:1138 netdev_for_each_tx_queue include/linux/netdevice.h:2197 [inline] dev_deactivate_many+0xe2/0xba0 net/sched/sch_generic.c:1195 dev_deactivate+0xf8/0x1c0 net/sched/sch_generic.c:1233 qdisc_graft+0xd25/0x1120 net/sched/sch_api.c:1051 tc_modify_qdisc+0xbab/0x1a00 net/sched/sch_api.c:1670 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5454 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
Fixes: 77e62da6e60c ("sch_choke: drop all packets in queue during reset") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Cc: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_choke.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/sched/sch_choke.c +++ b/net/sched/sch_choke.c @@ -323,7 +323,8 @@ static void choke_reset(struct Qdisc *sc
sch->q.qlen = 0; sch->qstats.backlog = 0; - memset(q->tab, 0, (q->tab_mask + 1) * sizeof(struct sk_buff *)); + if (q->tab) + memset(q->tab, 0, (q->tab_mask + 1) * sizeof(struct sk_buff *)); q->head = q->tail = 0; red_restart(&q->vars); }
 
            From: Eric Dumazet edumazet@google.com
[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]
syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue()
More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range.
Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000.
Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/sch_sfq.c | 9 +++++++++ 1 file changed, 9 insertions(+)
--- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
 
            From: Tuong Lien tuong.t.lien@dektech.com.au
[ Upstream commit 980d69276f3048af43a045be2925dacfb898a7be ]
When an application connects to the TIPC topology server and subscribes to some services, a new connection is created along with some objects - 'tipc_subscription' to store related data correspondingly... However, there is one omission in the connection handling that when the connection or application is orderly shutdown (e.g. via SIGQUIT, etc.), the connection is not closed in kernel, the 'tipc_subscription' objects are not freed too. This results in: - The maximum number of subscriptions (65535) will be reached soon, new subscriptions will be rejected; - TIPC module cannot be removed (unless the objects are somehow forced to release first);
The commit fixes the issue by closing the connection if the 'recvmsg()' returns '0' i.e. when the peer is shutdown gracefully. It also includes the other unexpected cases.
Acked-by: Jon Maloy jmaloy@redhat.com Acked-by: Ying Xue ying.xue@windriver.com Signed-off-by: Tuong Lien tuong.t.lien@dektech.com.au Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tipc/topsrv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -402,10 +402,11 @@ static int tipc_conn_rcv_from_sock(struc read_lock_bh(&sk->sk_callback_lock); ret = tipc_conn_rcv_sub(srv, con, &s); read_unlock_bh(&sk->sk_callback_lock); + if (!ret) + return 0; } - if (ret < 0) - tipc_conn_close(con);
+ tipc_conn_close(con); return ret; }
 
            From: "Toke H�iland-J�rgensen" toke@redhat.com
[ Upstream commit b723748750ece7d844cdf2f52c01d37f83387208 ]
RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header to the inner header if that inner header is already marked as ECT(0). When RFC 6040 decapsulation was implemented, this case of propagation was not added. This simply appears to be an oversight, so let's fix that.
Fixes: eccc1bb8d4b4 ("tunnel: drop packet if ECN present with not-ECT") Reported-by: Bob Briscoe ietf@bobbriscoe.net Reported-by: Olivier Tilmans olivier.tilmans@nokia-bell-labs.com Cc: Dave Taht dave.taht@gmail.com Cc: Stephen Hemminger stephen@networkplumber.org Signed-off-by: Toke Høiland-Jørgensen toke@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/inet_ecn.h | 57 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-)
--- a/include/net/inet_ecn.h +++ b/include/net/inet_ecn.h @@ -99,6 +99,20 @@ static inline int IP_ECN_set_ce(struct i return 1; }
+static inline int IP_ECN_set_ect1(struct iphdr *iph) +{ + u32 check = (__force u32)iph->check; + + if ((iph->tos & INET_ECN_MASK) != INET_ECN_ECT_0) + return 0; + + check += (__force u16)htons(0x100); + + iph->check = (__force __sum16)(check + (check>=0xFFFF)); + iph->tos ^= INET_ECN_MASK; + return 1; +} + static inline void IP_ECN_clear(struct iphdr *iph) { iph->tos &= ~INET_ECN_MASK; @@ -134,6 +148,22 @@ static inline int IP6_ECN_set_ce(struct return 1; }
+static inline int IP6_ECN_set_ect1(struct sk_buff *skb, struct ipv6hdr *iph) +{ + __be32 from, to; + + if ((ipv6_get_dsfield(iph) & INET_ECN_MASK) != INET_ECN_ECT_0) + return 0; + + from = *(__be32 *)iph; + to = from ^ htonl(INET_ECN_MASK << 20); + *(__be32 *)iph = to; + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_add(csum_sub(skb->csum, (__force __wsum)from), + (__force __wsum)to); + return 1; +} + static inline void ipv6_copy_dscp(unsigned int dscp, struct ipv6hdr *inner) { dscp &= ~INET_ECN_MASK; @@ -159,6 +189,25 @@ static inline int INET_ECN_set_ce(struct return 0; }
+static inline int INET_ECN_set_ect1(struct sk_buff *skb) +{ + switch (skb->protocol) { + case cpu_to_be16(ETH_P_IP): + if (skb_network_header(skb) + sizeof(struct iphdr) <= + skb_tail_pointer(skb)) + return IP_ECN_set_ect1(ip_hdr(skb)); + break; + + case cpu_to_be16(ETH_P_IPV6): + if (skb_network_header(skb) + sizeof(struct ipv6hdr) <= + skb_tail_pointer(skb)) + return IP6_ECN_set_ect1(skb, ipv6_hdr(skb)); + break; + } + + return 0; +} + /* * RFC 6040 4.2 * To decapsulate the inner header at the tunnel egress, a compliant @@ -208,8 +257,12 @@ static inline int INET_ECN_decapsulate(s int rc;
rc = __INET_ECN_decapsulate(outer, inner, &set_ce); - if (!rc && set_ce) - INET_ECN_set_ce(skb); + if (!rc) { + if (set_ce) + INET_ECN_set_ce(skb); + else if ((outer & INET_ECN_MASK) == INET_ECN_ECT_1) + INET_ECN_set_ect1(skb); + }
return rc; }
 
            From: Michael Chan michael.chan@broadcom.com
[ Upstream commit c71c4e49afe173823a2a85b0cabc9b3f1176ffa2 ]
Fix the logic that sets the enable/disable flag for the source MAC filter according to firmware spec 1.7.1.
In the original firmware spec. before 1.7.1, the VF spoof check flags were not latched after making the HWRM_FUNC_CFG call, so there was a need to keep the func_flags so that subsequent calls would perserve the VF spoof check setting. A change was made in the 1.7.1 spec so that the flags became latched. So we now set or clear the anti- spoof setting directly without retrieving the old settings in the stored vf->func_flags which are no longer valid. We also remove the unneeded vf->func_flags.
Fixes: 8eb992e876a8 ("bnxt_en: Update firmware interface spec to 1.7.6.2.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 - drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 10 ++-------- 2 files changed, 2 insertions(+), 9 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -1058,7 +1058,6 @@ struct bnxt_vf_info { #define BNXT_VF_LINK_FORCED 0x4 #define BNXT_VF_LINK_UP 0x8 #define BNXT_VF_TRUST 0x10 - u32 func_flags; /* func cfg flags */ u32 min_tx_rate; u32 max_tx_rate; void *hwrm_cmd_req_addr; --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c @@ -85,11 +85,10 @@ int bnxt_set_vf_spoofchk(struct net_devi if (old_setting == setting) return 0;
- func_flags = vf->func_flags; if (setting) - func_flags |= FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_ENABLE; + func_flags = FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_ENABLE; else - func_flags |= FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_DISABLE; + func_flags = FUNC_CFG_REQ_FLAGS_SRC_MAC_ADDR_CHECK_DISABLE; /*TODO: if the driver supports VLAN filter on guest VLAN, * the spoof check should also include vlan anti-spoofing */ @@ -98,7 +97,6 @@ int bnxt_set_vf_spoofchk(struct net_devi req.flags = cpu_to_le32(func_flags); rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); if (!rc) { - vf->func_flags = func_flags; if (setting) vf->flags |= BNXT_VF_SPOOFCHK; else @@ -230,7 +228,6 @@ int bnxt_set_vf_mac(struct net_device *d memcpy(vf->mac_addr, mac, ETH_ALEN); bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_MAC_ADDR); memcpy(req.dflt_mac_addr, mac, ETH_ALEN); return hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); @@ -268,7 +265,6 @@ int bnxt_set_vf_vlan(struct net_device *
bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.dflt_vlan = cpu_to_le16(vlan_tag); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_VLAN); rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT); @@ -307,7 +303,6 @@ int bnxt_set_vf_bw(struct net_device *de return 0; bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags); req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_MAX_BW); req.max_bw = cpu_to_le32(max_tx_rate); req.enables |= cpu_to_le32(FUNC_CFG_REQ_ENABLES_MIN_BW); @@ -479,7 +474,6 @@ static void __bnxt_set_vf_params(struct vf = &bp->pf.vf[vf_id]; bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1); req.fid = cpu_to_le16(vf->fw_fid); - req.flags = cpu_to_le32(vf->func_flags);
if (is_valid_ether_addr(vf->mac_addr)) { req.enables |= cpu_to_le32(FUNC_CFG_REQ_ENABLES_DFLT_MAC_ADDR);
 
            From: Vasundhara Volam vasundhara-v.volam@broadcom.com
[ Upstream commit 9e68cb0359b20f99c7b070f1d3305e5e0a9fae6d ]
Broadcom adapters support only maximum of 512 CQs per PF. If user sets MSIx vectors more than supported CQs, firmware is setting incorrect value for msix_vec_per_pf_max parameter. Fix it by reducing the BNXT_MSIX_VEC_MAX value to 512, even though the maximum # of MSIx vectors supported by adapter are 1280.
Fixes: f399e8497826 ("bnxt_en: Use msix_vec_per_pf_max and msix_vec_per_pf_min devlink params.") Signed-off-by: Vasundhara Volam vasundhara-v.volam@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h @@ -39,7 +39,7 @@ static inline void bnxt_link_bp_to_dl(st #define NVM_OFF_DIS_GRE_VER_CHECK 171 #define NVM_OFF_ENABLE_SRIOV 401
-#define BNXT_MSIX_VEC_MAX 1280 +#define BNXT_MSIX_VEC_MAX 512 #define BNXT_MSIX_VEC_MIN_MAX 128
enum bnxt_nvm_dir_type {
 
            From: Michael Chan michael.chan@broadcom.com
[ Upstream commit bae361c54fb6ac6eba3b4762f49ce14beb73ef13 ]
Improve the slot reset sequence by disabling the device to prevent bad DMAs if slot reset fails. Return the proper result instead of always PCI_ERS_RESULT_RECOVERED to the caller.
Fixes: 6316ea6db93d ("bnxt_en: Enable AER support.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -12066,12 +12066,15 @@ static pci_ers_result_t bnxt_io_slot_res } }
- if (result != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) - dev_close(netdev); + if (result != PCI_ERS_RESULT_RECOVERED) { + if (netif_running(netdev)) + dev_close(netdev); + pci_disable_device(pdev); + }
rtnl_unlock();
- return PCI_ERS_RESULT_RECOVERED; + return result; }
/**
 
            From: Michael Chan michael.chan@broadcom.com
[ Upstream commit bbf211b1ecb891c7e0cc7888834504183fc8b534 ]
bnxt_alloc_ctx_pg_tbls() should return error when the memory size of the context memory to set up is zero. By returning success (0), the caller may proceed normally and may crash later when it tries to set up the memory.
Fixes: 08fe9d181606 ("bnxt_en: Add Level 2 context memory paging support.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -6649,7 +6649,7 @@ static int bnxt_alloc_ctx_pg_tbls(struct int rc;
if (!mem_size) - return 0; + return -EINVAL;
ctx_pg->nr_pages = DIV_ROUND_UP(mem_size, BNXT_PAGE_SIZE); if (ctx_pg->nr_pages > MAX_CTX_TOTAL_PAGES) {
 
            From: Michael Chan michael.chan@broadcom.com
[ Upstream commit c72cb303aa6c2ae7e4184f0081c6d11bf03fb96b ]
The current logic in bnxt_fix_features() will inadvertently turn on both CTAG and STAG VLAN offload if the user tries to disable both. Fix it by checking that the user is trying to enable CTAG or STAG before enabling both. The logic is supposed to enable or disable both CTAG and STAG together.
Fixes: 5a9f6b238e59 ("bnxt_en: Enable and disable RX CTAG and RX STAG VLAN acceleration together.") Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -9755,6 +9755,7 @@ static netdev_features_t bnxt_fix_featur netdev_features_t features) { struct bnxt *bp = netdev_priv(dev); + netdev_features_t vlan_features;
if ((features & NETIF_F_NTUPLE) && !bnxt_rfs_capable(bp)) features &= ~NETIF_F_NTUPLE; @@ -9771,12 +9772,14 @@ static netdev_features_t bnxt_fix_featur /* Both CTAG and STAG VLAN accelaration on the RX side have to be * turned on or off together. */ - if ((features & (NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX)) != - (NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX)) { + vlan_features = features & (NETIF_F_HW_VLAN_CTAG_RX | + NETIF_F_HW_VLAN_STAG_RX); + if (vlan_features != (NETIF_F_HW_VLAN_CTAG_RX | + NETIF_F_HW_VLAN_STAG_RX)) { if (dev->features & NETIF_F_HW_VLAN_CTAG_RX) features &= ~(NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX); - else + else if (vlan_features) features |= NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX; }
 
            From: Erez Shitrit erezsh@mellanox.com
[ Upstream commit 8075411d93b6efe143d9f606f6531077795b7fbf ]
In polling mode, set arm_db member to a value that will avoid CQ event recovery by the HW. Otherwise we might get event without completion function. In addition,empty completion function to was added to protect from unexpected events.
Fixes: 297cccebdc5a ("net/mlx5: DR, Expose an internal API to issue RDMA operations") Signed-off-by: Erez Shitrit erezsh@mellanox.com Reviewed-by: Tariq Toukan tariqt@mellanox.com Reviewed-by: Alex Vesker valex@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c | 14 ++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c @@ -689,6 +689,12 @@ static void dr_cq_event(struct mlx5_core pr_info("CQ event %u on CQ #%u\n", event, mcq->cqn); }
+static void dr_cq_complete(struct mlx5_core_cq *mcq, + struct mlx5_eqe *eqe) +{ + pr_err("CQ completion CQ: #%u\n", mcq->cqn); +} + static struct mlx5dr_cq *dr_create_cq(struct mlx5_core_dev *mdev, struct mlx5_uars_page *uar, size_t ncqe) @@ -750,6 +756,7 @@ static struct mlx5dr_cq *dr_create_cq(st mlx5_fill_page_frag_array(&cq->wq_ctrl.buf, pas);
cq->mcq.event = dr_cq_event; + cq->mcq.comp = dr_cq_complete;
err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out)); kvfree(in); @@ -761,7 +768,12 @@ static struct mlx5dr_cq *dr_create_cq(st cq->mcq.set_ci_db = cq->wq_ctrl.db.db; cq->mcq.arm_db = cq->wq_ctrl.db.db + 1; *cq->mcq.set_ci_db = 0; - *cq->mcq.arm_db = 0; + + /* set no-zero value, in order to avoid the HW to run db-recovery on + * CQ that used in polling mode. + */ + *cq->mcq.arm_db = cpu_to_be32(2 << 28); + cq->mcq.vector = 0; cq->mcq.irqn = irqn; cq->mcq.uar = uar;
 
            From: Moshe Shemesh moshe@mellanox.com
[ Upstream commit f3cb3cebe26ed4c8036adbd9448b372129d3c371 ]
mlx5_cmd_flush() will trigger forced completions to all valid command entries. Triggered by an asynch event such as fast teardown it can happen at any stage of the command, including command initialization. It will trigger forced completion and that can lead to completion on an uninitialized command entry.
Setting MLX5_CMD_ENT_STATE_PENDING_COMP only after command entry is initialized will ensure force completion is treated only if command entry is initialized.
Fixes: 73dd3a4839c1 ("net/mlx5: Avoid using pending command interface slots") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Eran Ben Elisha eranbe@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -888,7 +888,6 @@ static void cmd_work_handler(struct work }
cmd->ent_arr[ent->idx] = ent; - set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state); lay = get_inst(cmd, ent->idx); ent->lay = lay; memset(lay, 0, sizeof(*lay)); @@ -910,6 +909,7 @@ static void cmd_work_handler(struct work
if (ent->callback) schedule_delayed_work(&ent->cb_timeout_work, cb_timeout); + set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);
/* Skip sending command to fw if internal error */ if (pci_channel_offline(dev->pdev) ||
 
            From: Moshe Shemesh moshe@mellanox.com
[ Upstream commit cece6f432cca9f18900463ed01b97a152a03600a ]
Processing commands by cmd_work_handler() while already in Internal Error State will result in entry leak, since the handler process force completion without doorbell. Forced completion doesn't release the entry and event completion will never arrive, so entry should be released.
Fixes: 73dd3a4839c1 ("net/mlx5: Avoid using pending command interface slots") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Eran Ben Elisha eranbe@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -922,6 +922,10 @@ static void cmd_work_handler(struct work MLX5_SET(mbox_out, ent->out, syndrome, drv_synd);
mlx5_cmd_comp_handler(dev, 1UL << ent->idx, true); + /* no doorbell, no need to keep the entry */ + free_ent(cmd, ent->idx); + if (ent->callback) + free_cmd(ent); return; }
 
            From: Dan Carpenter dan.carpenter@oracle.com
[ Upstream commit 39bd16df7c31bb8cf5dfd0c88e42abd5ae10029d ]
The "rss_context" variable comes from the user via ethtool_get_rxfh(). It can be any u32 value except zero. Eventually it gets passed to mvpp22_rss_ctx() and if it is over MVPP22_N_RSS_TABLES (8) then it results in an array overflow.
Fixes: 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -4319,6 +4319,8 @@ static int mvpp2_ethtool_get_rxfh_contex
if (!mvpp22_rss_is_supported()) return -EOPNOTSUPP; + if (rss_context >= MVPP22_N_RSS_TABLES) + return -EINVAL;
if (hfunc) *hfunc = ETH_RSS_HASH_CRC32;
 
            From: Dan Carpenter dan.carpenter@oracle.com
[ Upstream commit 722c0f00d4feea77475a5dc943b53d60824a1e4e ]
The "info->fs.location" is a u32 that comes from the user via the ethtool_set_rxnfc() function. We need to check for invalid values to prevent a buffer overflow.
I copy and pasted this check from the mvpp2_ethtool_cls_rule_ins() function.
Fixes: 90b509b39ac9 ("net: mvpp2: cls: Add Classification offload support") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c @@ -1422,6 +1422,9 @@ int mvpp2_ethtool_cls_rule_del(struct mv struct mvpp2_ethtool_fs *efs; int ret;
+ if (info->fs.location >= MVPP2_N_RFS_ENTRIES_PER_FLOW) + return -EINVAL; + efs = port->rfs_rules[info->fs.location]; if (!efs) return -EINVAL;
 
            From: Jason Gerecke jason.gerecke@wacom.com
commit 778fbf4179991e7652e97d7f1ca1f657ef828422 upstream.
We've recently switched from extracting the value of HID_DG_CONTACTMAX at a fixed offset (which may not be correct for all tablets) to injecting the report into the driver for the generic codepath to handle. Unfortunately, this change was made for *all* tablets, even those which aren't generic. Because `wacom_wac_report` ignores reports from non- generic devices, the contact count never gets initialized. Ultimately this results in the touch device itself failing to probe, and thus the loss of touch input.
This commit adds back the fixed-offset extraction for non-generic devices.
Link: https://github.com/linuxwacom/input-wacom/issues/155 Fixes: 184eccd40389 ("HID: wacom: generic: read HID_DG_CONTACTMAX from any feature report") Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Aaron Armstrong Skomra aaron.skomra@wacom.com CC: stable@vger.kernel.org # 5.3+ Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_sys.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/hid/wacom_sys.c +++ b/drivers/hid/wacom_sys.c @@ -319,9 +319,11 @@ static void wacom_feature_mapping(struct data[0] = field->report->id; ret = wacom_get_report(hdev, HID_FEATURE_REPORT, data, n, WAC_CMD_RETRIES); - if (ret == n) { + if (ret == n && features->type == HID_GENERIC) { ret = hid_report_raw_event(hdev, HID_FEATURE_REPORT, data, n, 0); + } else if (ret == 2 && features->type != HID_GENERIC) { + features->touch_max = data[1]; } else { features->touch_max = 16; hid_warn(hdev, "wacom_feature_mapping: "
 
            From: Jere Leppänen jere.leppanen@nokia.com
commit 145cb2f7177d94bc54563ed26027e952ee0ae03c upstream.
When we start shutdown in sctp_sf_do_dupcook_a(), we want to bundle the SHUTDOWN with the COOKIE-ACK to ensure that the peer receives them at the same time and in the correct order. This bundling was broken by commit 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc"), which assigns a transport for the COOKIE-ACK, but not for the SHUTDOWN.
Fix this by passing a reference to the COOKIE-ACK chunk as an argument to sctp_sf_do_9_2_start_shutdown() and onward to sctp_make_shutdown(). This way the SHUTDOWN chunk is assigned the same transport as the COOKIE-ACK chunk, which allows them to be bundled.
In sctp_sf_do_9_2_start_shutdown(), the void *arg parameter was previously unused. Now that we're taking it into use, it must be a valid pointer to a chunk, or NULL. There is only one call site where it's not, in sctp_sf_autoclose_timer_expire(). Fix that too.
Fixes: 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc") Signed-off-by: Jere Leppänen jere.leppanen@nokia.com Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Cc: Guenter Roeck linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/sctp/sm_statefuns.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -1865,7 +1865,7 @@ static enum sctp_disposition sctp_sf_do_ */ sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl)); return sctp_sf_do_9_2_start_shutdown(net, ep, asoc, - SCTP_ST_CHUNK(0), NULL, + SCTP_ST_CHUNK(0), repl, commands); } else { sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE, @@ -5470,7 +5470,7 @@ enum sctp_disposition sctp_sf_do_9_2_sta * in the Cumulative TSN Ack field the last sequential TSN it * has received from the peer. */ - reply = sctp_make_shutdown(asoc, NULL); + reply = sctp_make_shutdown(asoc, arg); if (!reply) goto nomem;
@@ -6068,7 +6068,7 @@ enum sctp_disposition sctp_sf_autoclose_ disposition = SCTP_DISPOSITION_CONSUME; if (sctp_outq_is_empty(&asoc->outqueue)) { disposition = sctp_sf_do_9_2_start_shutdown(net, ep, asoc, type, - arg, commands); + NULL, commands); }
return disposition;
 
            From: Jason Gerecke killertofu@gmail.com
commit b43f977dd281945960c26b3ef67bba0fa07d39d9 upstream.
This reverts commit 15893fa40109f5e7c67eeb8da62267d0fdf0be9d.
The referenced commit broke pen and touch input for a variety of devices such as the Cintiq Pro 32. Affected devices may appear to work normally for a short amount of time, but eventually loose track of actual touch state and can leave touch arbitration enabled which prevents the pen from working. The commit is not itself required for any currently-available Bluetooth device, and so we revert it to correct the behavior of broken devices.
This breakage occurs due to a mismatch between the order of collections and the order of usages on some devices. This commit tries to read the contact count before processing events, but will fail if the contact count does not occur prior to the first logical finger collection. This is the case for devices like the Cintiq Pro 32 which place the contact count at the very end of the report.
Without the contact count set, touches will only be partially processed. The `wacom_wac_finger_slot` function will not open any slots since the number of contacts seen is greater than the expectation of 0, but we will still end up calling `input_mt_sync_frame` for each finger anyway. This can cause problems for userspace separate from the issue currently taking place in the kernel. Only once all of the individual finger collections have been processed do we finally get to the enclosing collection which contains the contact count. The value ends up being used for the *next* report, however.
This delayed use of the contact count can cause the driver to loose track of the actual touch state and believe that there are contacts down when there aren't. This leaves touch arbitration enabled and prevents the pen from working. It can also cause userspace to incorrectly treat single- finger input as gestures.
Link: https://github.com/linuxwacom/input-wacom/issues/146 Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Aaron Armstrong Skomra aaron.skomra@wacom.com Fixes: 15893fa40109 ("HID: wacom: generic: read the number of expected touches on a per collection basis") Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_wac.c | 79 +++++++++--------------------------------------- 1 file changed, 16 insertions(+), 63 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -2637,9 +2637,25 @@ static void wacom_wac_finger_pre_report( case HID_DG_TIPSWITCH: hid_data->last_slot_field = equivalent_usage; break; + case HID_DG_CONTACTCOUNT: + hid_data->cc_report = report->id; + hid_data->cc_index = i; + hid_data->cc_value_index = j; + break; } } } + + if (hid_data->cc_report != 0 && + hid_data->cc_index >= 0) { + struct hid_field *field = report->field[hid_data->cc_index]; + int value = field->value[hid_data->cc_value_index]; + if (value) + hid_data->num_expected = value; + } + else { + hid_data->num_expected = wacom_wac->features.touch_max; + } }
static void wacom_wac_finger_report(struct hid_device *hdev, @@ -2649,7 +2665,6 @@ static void wacom_wac_finger_report(stru struct wacom_wac *wacom_wac = &wacom->wacom_wac; struct input_dev *input = wacom_wac->touch_input; unsigned touch_max = wacom_wac->features.touch_max; - struct hid_data *hid_data = &wacom_wac->hid_data;
/* If more packets of data are expected, give us a chance to * process them rather than immediately syncing a partial @@ -2663,7 +2678,6 @@ static void wacom_wac_finger_report(stru
input_sync(input); wacom_wac->hid_data.num_received = 0; - hid_data->num_expected = 0;
/* keep touch state for pen event */ wacom_wac->shared->touch_down = wacom_wac_finger_count_touches(wacom_wac); @@ -2738,73 +2752,12 @@ static void wacom_report_events(struct h } }
-static void wacom_set_num_expected(struct hid_device *hdev, - struct hid_report *report, - int collection_index, - struct hid_field *field, - int field_index) -{ - struct wacom *wacom = hid_get_drvdata(hdev); - struct wacom_wac *wacom_wac = &wacom->wacom_wac; - struct hid_data *hid_data = &wacom_wac->hid_data; - unsigned int original_collection_level = - hdev->collection[collection_index].level; - bool end_collection = false; - int i; - - if (hid_data->num_expected) - return; - - // find the contact count value for this segment - for (i = field_index; i < report->maxfield && !end_collection; i++) { - struct hid_field *field = report->field[i]; - unsigned int field_level = - hdev->collection[field->usage[0].collection_index].level; - unsigned int j; - - if (field_level != original_collection_level) - continue; - - for (j = 0; j < field->maxusage; j++) { - struct hid_usage *usage = &field->usage[j]; - - if (usage->collection_index != collection_index) { - end_collection = true; - break; - } - if (wacom_equivalent_usage(usage->hid) == HID_DG_CONTACTCOUNT) { - hid_data->cc_report = report->id; - hid_data->cc_index = i; - hid_data->cc_value_index = j; - - if (hid_data->cc_report != 0 && - hid_data->cc_index >= 0) { - - struct hid_field *field = - report->field[hid_data->cc_index]; - int value = - field->value[hid_data->cc_value_index]; - - if (value) - hid_data->num_expected = value; - } - } - } - } - - if (hid_data->cc_report == 0 || hid_data->cc_index < 0) - hid_data->num_expected = wacom_wac->features.touch_max; -} - static int wacom_wac_collection(struct hid_device *hdev, struct hid_report *report, int collection_index, struct hid_field *field, int field_index) { struct wacom *wacom = hid_get_drvdata(hdev);
- if (WACOM_FINGER_FIELD(field)) - wacom_set_num_expected(hdev, report, collection_index, field, - field_index); wacom_report_events(hdev, report, collection_index, field_index);
/*
 
            From: Alan Stern stern@rowland.harvard.edu
commit 0ed08faded1da03eb3def61502b27f81aef2e615 upstream.
The syzbot fuzzer discovered a bad race between in the usbhid driver between usbhid_stop() and usbhid_close(). In particular, usbhid_stop() does:
usb_free_urb(usbhid->urbin); ... usbhid->urbin = NULL; /* don't mess up next start */
and usbhid_close() does:
usb_kill_urb(usbhid->urbin);
with no mutual exclusion. If the two routines happen to run concurrently so that usb_kill_urb() is called in between the usb_free_urb() and the NULL assignment, it will access the deallocated urb structure -- a use-after-free bug.
This patch adds a mutex to the usbhid private structure and uses it to enforce mutual exclusion of the usbhid_start(), usbhid_stop(), usbhid_open() and usbhid_close() callbacks.
Reported-and-tested-by: syzbot+7bf5a7b0f0a1f9446f4c@syzkaller.appspotmail.com Signed-off-by: Alan Stern stern@rowland.harvard.edu CC: stable@vger.kernel.org Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/usbhid/hid-core.c | 37 +++++++++++++++++++++++++++++-------- drivers/hid/usbhid/usbhid.h | 1 + 2 files changed, 30 insertions(+), 8 deletions(-)
--- a/drivers/hid/usbhid/hid-core.c +++ b/drivers/hid/usbhid/hid-core.c @@ -682,16 +682,21 @@ static int usbhid_open(struct hid_device struct usbhid_device *usbhid = hid->driver_data; int res;
+ mutex_lock(&usbhid->mutex); + set_bit(HID_OPENED, &usbhid->iofl);
- if (hid->quirks & HID_QUIRK_ALWAYS_POLL) - return 0; + if (hid->quirks & HID_QUIRK_ALWAYS_POLL) { + res = 0; + goto Done; + }
res = usb_autopm_get_interface(usbhid->intf); /* the device must be awake to reliably request remote wakeup */ if (res < 0) { clear_bit(HID_OPENED, &usbhid->iofl); - return -EIO; + res = -EIO; + goto Done; }
usbhid->intf->needs_remote_wakeup = 1; @@ -725,6 +730,9 @@ static int usbhid_open(struct hid_device msleep(50);
clear_bit(HID_RESUME_RUNNING, &usbhid->iofl); + + Done: + mutex_unlock(&usbhid->mutex); return res; }
@@ -732,6 +740,8 @@ static void usbhid_close(struct hid_devi { struct usbhid_device *usbhid = hid->driver_data;
+ mutex_lock(&usbhid->mutex); + /* * Make sure we don't restart data acquisition due to * a resumption we no longer care about by avoiding racing @@ -743,12 +753,13 @@ static void usbhid_close(struct hid_devi clear_bit(HID_IN_POLLING, &usbhid->iofl); spin_unlock_irq(&usbhid->lock);
- if (hid->quirks & HID_QUIRK_ALWAYS_POLL) - return; + if (!(hid->quirks & HID_QUIRK_ALWAYS_POLL)) { + hid_cancel_delayed_stuff(usbhid); + usb_kill_urb(usbhid->urbin); + usbhid->intf->needs_remote_wakeup = 0; + }
- hid_cancel_delayed_stuff(usbhid); - usb_kill_urb(usbhid->urbin); - usbhid->intf->needs_remote_wakeup = 0; + mutex_unlock(&usbhid->mutex); }
/* @@ -1057,6 +1068,8 @@ static int usbhid_start(struct hid_devic unsigned int n, insize = 0; int ret;
+ mutex_lock(&usbhid->mutex); + clear_bit(HID_DISCONNECTED, &usbhid->iofl);
usbhid->bufsize = HID_MIN_BUFFER_SIZE; @@ -1177,6 +1190,8 @@ static int usbhid_start(struct hid_devic usbhid_set_leds(hid); device_set_wakeup_enable(&dev->dev, 1); } + + mutex_unlock(&usbhid->mutex); return 0;
fail: @@ -1187,6 +1202,7 @@ fail: usbhid->urbout = NULL; usbhid->urbctrl = NULL; hid_free_buffers(dev, hid); + mutex_unlock(&usbhid->mutex); return ret; }
@@ -1202,6 +1218,8 @@ static void usbhid_stop(struct hid_devic usbhid->intf->needs_remote_wakeup = 0; }
+ mutex_lock(&usbhid->mutex); + clear_bit(HID_STARTED, &usbhid->iofl); spin_lock_irq(&usbhid->lock); /* Sync with error and led handlers */ set_bit(HID_DISCONNECTED, &usbhid->iofl); @@ -1222,6 +1240,8 @@ static void usbhid_stop(struct hid_devic usbhid->urbout = NULL;
hid_free_buffers(hid_to_usb_dev(hid), hid); + + mutex_unlock(&usbhid->mutex); }
static int usbhid_power(struct hid_device *hid, int lvl) @@ -1382,6 +1402,7 @@ static int usbhid_probe(struct usb_inter INIT_WORK(&usbhid->reset_work, hid_reset); timer_setup(&usbhid->io_retry, hid_retry_timeout, 0); spin_lock_init(&usbhid->lock); + mutex_init(&usbhid->mutex);
ret = hid_add_device(hid); if (ret) { --- a/drivers/hid/usbhid/usbhid.h +++ b/drivers/hid/usbhid/usbhid.h @@ -80,6 +80,7 @@ struct usbhid_device { dma_addr_t outbuf_dma; /* Output buffer dma */ unsigned long last_out; /* record of last output for timeouts */
+ struct mutex mutex; /* start/stop/open/close */ spinlock_t lock; /* fifo spinlock */ unsigned long iofl; /* I/O flags (CTRL_RUNNING, OUT_RUNNING) */ struct timer_list io_retry; /* Retry timer */
 
            From: Jason Gerecke killertofu@gmail.com
commit dcce8ef8f70a8e38e6c47c1bae8b312376c04420 upstream.
The state of the center button was not reported to userspace for the 2nd-gen Intuos Pro S when used over Bluetooth due to the pad handling code not being updated to support its reduced number of buttons. This patch uses the actual number of buttons present on the tablet to assemble a button state bitmap.
Link: https://github.com/linuxwacom/xf86-input-wacom/issues/112 Fixes: cd47de45b855 ("HID: wacom: Add 2nd gen Intuos Pro Small support") Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/hid/wacom_wac.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -1427,11 +1427,13 @@ static void wacom_intuos_pro2_bt_pad(str { struct input_dev *pad_input = wacom->pad_input; unsigned char *data = wacom->data; + int nbuttons = wacom->features.numbered_buttons;
- int buttons = data[282] | ((data[281] & 0x40) << 2); + int expresskeys = data[282]; + int center = (data[281] & 0x40) >> 6; int ring = data[285] & 0x7F; bool ringstatus = data[285] & 0x80; - bool prox = buttons || ringstatus; + bool prox = expresskeys || center || ringstatus;
/* Fix touchring data: userspace expects 0 at left and increasing clockwise */ ring = 71 - ring; @@ -1439,7 +1441,8 @@ static void wacom_intuos_pro2_bt_pad(str if (ring > 71) ring -= 72;
- wacom_report_numbered_buttons(pad_input, 9, buttons); + wacom_report_numbered_buttons(pad_input, nbuttons, + expresskeys | (center << (nbuttons - 1)));
input_report_abs(pad_input, ABS_WHEEL, ringstatus ? ring : 0);
 
            From: Oliver Neukum oneukum@suse.com
commit 9f04db234af691007bb785342a06abab5fb34474 upstream.
This device needs US_FL_NO_REPORT_OPCODES to avoid going through prolonged error handling on enumeration.
Signed-off-by: Oliver Neukum oneukum@suse.com Reported-by: Julian Groß julian.g@posteo.de Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20200429155218.7308-1-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/storage/unusual_uas.h | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/drivers/usb/storage/unusual_uas.h +++ b/drivers/usb/storage/unusual_uas.h @@ -28,6 +28,13 @@ * and don't forget to CC: the USB development list linux-usb@vger.kernel.org */
+/* Reported-by: Julian Groß julian.g@posteo.de */ +UNUSUAL_DEV(0x059f, 0x105f, 0x0000, 0x9999, + "LaCie", + "2Big Quadra USB3", + USB_SC_DEVICE, USB_PR_DEVICE, NULL, + US_FL_NO_REPORT_OPCODES), + /* * Apricorn USB3 dongle sometimes returns "USBSUSBSUSBS" in response to SCSI * commands in UAS mode. Observed with the 1.28 firmware; are there others?
 
            From: Bryan O'Donoghue bryan.odonoghue@linaro.org
commit 91edf63d5022bd0464788ffb4acc3d5febbaf81d upstream.
Currently we check to make sure there is no error state on the extcon handle for VBUS when writing to the HS_PHY_GENCONFIG_2 register. When using the USB role-switch API we still need to write to this register absent an extcon handle.
This patch makes the appropriate update to ensure the write happens if role-switching is true.
Fixes: 05559f10ed79 ("usb: chipidea: add role switch class support") Cc: stable stable@vger.kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Philipp Zabel p.zabel@pengutronix.de Cc: linux-usb@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Stephen Boyd swboyd@chromium.org Signed-off-by: Bryan O'Donoghue bryan.odonoghue@linaro.org Signed-off-by: Peter Chen peter.chen@nxp.com Link: https://lore.kernel.org/r/20200507004918.25975-2-peter.chen@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/chipidea/ci_hdrc_msm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/usb/chipidea/ci_hdrc_msm.c +++ b/drivers/usb/chipidea/ci_hdrc_msm.c @@ -114,7 +114,7 @@ static int ci_hdrc_msm_notify_event(stru hw_write_id_reg(ci, HS_PHY_GENCONFIG_2, HS_PHY_ULPI_TX_PKT_EN_CLR_FIX, 0);
- if (!IS_ERR(ci->platdata->vbus_extcon.edev)) { + if (!IS_ERR(ci->platdata->vbus_extcon.edev) || ci->role_switch) { hw_write_id_reg(ci, HS_PHY_GENCONFIG_2, HS_PHY_SESS_VLD_CTRL_EN, HS_PHY_SESS_VLD_CTRL_EN);
 
            From: Oliver Neukum oneukum@suse.com
commit e9b3c610a05c1cdf8e959a6d89c38807ff758ee6 upstream.
We must not process packets shorter than a packet ID
Signed-off-by: Oliver Neukum oneukum@suse.com Reported-and-tested-by: syzbot+d29e9263e13ce0b9f4fd@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable stable@vger.kernel.org Signed-off-by: Johan Hovold johan@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/usb/serial/garmin_gps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/usb/serial/garmin_gps.c +++ b/drivers/usb/serial/garmin_gps.c @@ -1138,8 +1138,8 @@ static void garmin_read_process(struct g send it directly to the tty port */ if (garmin_data_p->flags & FLAGS_QUEUING) { pkt_add(garmin_data_p, data, data_length); - } else if (bulk_data || - getLayerId(data) == GARMIN_LAYERID_APPL) { + } else if (bulk_data || (data_length >= sizeof(u32) && + getLayerId(data) == GARMIN_LAYERID_APPL)) {
spin_lock_irqsave(&garmin_data_p->lock, flags); garmin_data_p->flags |= APP_RESP_SEEN;
 
            From: Steven Rostedt (VMware) rostedt@goodmis.org
commit 11f5efc3ab66284f7aaacc926e9351d658e2577b upstream.
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults.
Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear.
Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home
Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" tz.stoyanov@gmail.com Suggested-by: Joerg Roedel jroedel@suse.de Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/trace/trace.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
--- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8318,6 +8318,19 @@ static int allocate_trace_buffers(struct */ allocate_snapshot = false; #endif + + /* + * Because of some magic with the way alloc_percpu() works on + * x86_64, we need to synchronize the pgd of all the tables, + * otherwise the trace events that happen in x86_64 page fault + * handlers can't cope with accessing the chance that a + * alloc_percpu()'d memory might be touched in the page fault trace + * event. Oh, and we need to audit all other alloc_percpu() and vmalloc() + * calls in tracing, because something might get triggered within a + * page fault trace event! + */ + vmalloc_sync_mappings(); + return 0; }
 
            From: Jason A. Donenfeld Jason@zx2c4.com
commit a9a8ba90fa5857c2c8a0e32eef2159cec717da11 upstream.
Rather than chunking via PAGE_SIZE, this commit changes the arch implementations to chunk in explicit 4k parts, so that calculations on maximum acceptable latency don't suddenly become invalid on platforms where PAGE_SIZE isn't 4k, such as arm64.
Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Reviewed-by: Eric Biggers ebiggers@google.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-)
--- a/arch/arm/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); --- a/arch/arm64/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2);
 
            From: Christian Borntraeger borntraeger@de.ibm.com
commit 5615e74f48dcc982655543e979b6c3f3f877e6f6 upstream.
In LPAR we will only get an intercept for FC==3 for the PQAP instruction. Running nested under z/VM can result in other intercepts as well as ECA_APIE is an effective bit: If one hypervisor layer has turned this bit off, the end result will be that we will get intercepts for all function codes. Usually the first one will be a query like PQAP(QCI). So the WARN_ON_ONCE is not right. Let us simply remove it.
Cc: Pierre Morel pmorel@linux.ibm.com Cc: Tony Krowiak akrowiak@linux.ibm.com Cc: stable@vger.kernel.org # v5.3+ Fixes: e5282de93105 ("s390: ap: kvm: add PQAP interception for AQIC") Link: https://lore.kernel.org/kvm/20200505083515.2720-1-borntraeger@de.ibm.com Reported-by: Qian Cai cailca@icloud.com Signed-off-by: Christian Borntraeger borntraeger@de.ibm.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Cornelia Huck cohuck@redhat.com Signed-off-by: Christian Borntraeger borntraeger@de.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/s390/kvm/priv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -626,10 +626,12 @@ static int handle_pqap(struct kvm_vcpu * * available for the guest are AQIC and TAPQ with the t bit set * since we do not set IC.3 (FIII) we currently will only intercept * the AQIC function code. + * Note: running nested under z/VM can result in intercepts for other + * function codes, e.g. PQAP(QCI). We do not support this and bail out. */ reg0 = vcpu->run->s.regs.gprs[0]; fc = (reg0 >> 24) & 0xff; - if (WARN_ON_ONCE(fc != 0x03)) + if (fc != 0x03) return -EOPNOTSUPP;
/* PQAP instruction is allowed for guest kernel only */
 
            From: Sean Christopherson sean.j.christopherson@intel.com
commit c7cb2d650c9e78c03bd2d1c0db89891825f8c0f4 upstream.
Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit 089dd8e53126e ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set.
Reported-by: Qian Cai cai@lca.pw Cc: Rick Edgecombe rick.p.edgecombe@intel.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: stable@vger.kernel.org Fixes: f2fde6a5bcfcf ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson sean.j.christopherson@intel.com Message-Id: 20200506035355.2242-1-sean.j.christopherson@intel.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/vmx/vmenter.S | 3 +++ 1 file changed, 3 insertions(+)
--- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -86,6 +86,9 @@ ENTRY(vmx_vmexit) /* IMPORTANT: Stuff the RSB immediately after VM-Exit, before RET! */ FILL_RETURN_BUFFER %_ASM_AX, RSB_CLEAR_LOOPS, X86_FEATURE_RETPOLINE
+ /* Clear RFLAGS.CF and RFLAGS.ZF to preserve VM-Exit, i.e. !VM-Fail. */ + or $1, %_ASM_AX + pop %_ASM_AX .Lvmexit_skip_rsb: #endif
 
            From: Marc Zyngier maz@kernel.org
commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream.
When deciding whether a guest has to be stopped we check whether this is a private interrupt or not. Unfortunately, there's an off-by-one bug here, and we fail to recognize a whole range of interrupts as being global (GICv2 SPIs 32-63).
Fix the condition from > to be >=.
Cc: stable@vger.kernel.org Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race") Reported-by: André Przywara andre.przywara@arm.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- virt/kvm/arm/vgic/vgic-mmio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/virt/kvm/arm/vgic/vgic-mmio.c +++ b/virt/kvm/arm/vgic/vgic-mmio.c @@ -389,7 +389,7 @@ static void vgic_mmio_change_active(stru static void vgic_change_active_prepare(struct kvm_vcpu *vcpu, u32 intid) { if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 || - intid > VGIC_NR_PRIVATE_IRQS) + intid >= VGIC_NR_PRIVATE_IRQS) kvm_arm_halt_guest(vcpu->kvm); }
@@ -397,7 +397,7 @@ static void vgic_change_active_prepare(s static void vgic_change_active_finish(struct kvm_vcpu *vcpu, u32 intid) { if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 || - intid > VGIC_NR_PRIVATE_IRQS) + intid >= VGIC_NR_PRIVATE_IRQS) kvm_arm_resume_guest(vcpu->kvm); }
 
            From: Marc Zyngier maz@kernel.org
commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream.
In the unlikely event that a 32bit vcpu traps into the hypervisor on an instruction that is located right at the end of the 32bit range, the emulation of that instruction is going to increment PC past the 32bit range. This isn't great, as userspace can then observe this value and get a bit confused.
Conversly, userspace can do things like (in the context of a 64bit guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0, set PC to a 64bit value, change PSTATE to AArch32-USR, and observe that PC hasn't been truncated. More confusion.
Fix both by: - truncating PC increments for 32bit guests - sanitizing all 32bit regs every time a core reg is changed by userspace, and that PSTATE indicates a 32bit mode.
Cc: stable@vger.kernel.org Acked-by: Will Deacon will@kernel.org Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm64/kvm/guest.c | 7 +++++++ virt/kvm/arm/hyp/aarch32.c | 8 ++++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
--- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -202,6 +202,13 @@ static int set_core_reg(struct kvm_vcpu }
memcpy((u32 *)regs + off, valp, KVM_REG_SIZE(reg->id)); + + if (*vcpu_cpsr(vcpu) & PSR_MODE32_BIT) { + int i; + + for (i = 0; i < 16; i++) + *vcpu_reg32(vcpu, i) = (u32)*vcpu_reg32(vcpu, i); + } out: return err; } --- a/virt/kvm/arm/hyp/aarch32.c +++ b/virt/kvm/arm/hyp/aarch32.c @@ -125,12 +125,16 @@ static void __hyp_text kvm_adjust_itstat */ void __hyp_text kvm_skip_instr32(struct kvm_vcpu *vcpu, bool is_wide_instr) { + u32 pc = *vcpu_pc(vcpu); bool is_thumb;
is_thumb = !!(*vcpu_cpsr(vcpu) & PSR_AA32_T_BIT); if (is_thumb && !is_wide_instr) - *vcpu_pc(vcpu) += 2; + pc += 2; else - *vcpu_pc(vcpu) += 4; + pc += 4; + + *vcpu_pc(vcpu) = pc; + kvm_adjust_itstate(vcpu); }
 
            From: Mark Rutland mark.rutland@arm.com
commit 027d0c7101f50cf03aeea9eebf484afd4920c8d3 upstream.
The static analyzer in GCC 10 spotted that in huge_pte_alloc() we may pass a NULL pmdp into pte_alloc_map() when pmd_alloc() returns NULL:
| CC arch/arm64/mm/pageattr.o | CC arch/arm64/mm/hugetlbpage.o | from arch/arm64/mm/hugetlbpage.c:10: | arch/arm64/mm/hugetlbpage.c: In function ‘huge_pte_alloc’: | ./arch/arm64/include/asm/pgtable-types.h:28:24: warning: dereference of NULL ‘pmdp’ [CWE-690] [-Wanalyzer-null-dereference] | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’ | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’ | |arch/arm64/mm/hugetlbpage.c:232:10: | |./arch/arm64/include/asm/pgtable-types.h:28:24: | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’ | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’
This can only occur when the kernel cannot allocate a page, and so is unlikely to happen in practice before other systems start failing.
We can avoid this by bailing out if pmd_alloc() fails, as we do earlier in the function if pud_alloc() fails.
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Mark Rutland mark.rutland@arm.com Reported-by: Kyrill Tkachov kyrylo.tkachov@arm.com Cc: stable@vger.kernel.org # 4.5.x- Cc: Will Deacon will@kernel.org Signed-off-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/arm64/mm/hugetlbpage.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -230,6 +230,8 @@ pte_t *huge_pte_alloc(struct mm_struct * ptep = (pte_t *)pudp; } else if (sz == (CONT_PTE_SIZE)) { pmdp = pmd_alloc(mm, pudp, addr); + if (!pmdp) + return NULL;
WARN_ON(addr & (sz - 1)); /*
 
            From: H. Nikolaus Schaller hns@goldelico.com
commit c59359a02d14a7256cd508a4886b7d2012df2363 upstream.
so that the driver can load by matching the device tree if compiled as module.
Cc: stable@vger.kernel.org # v5.3+ Fixes: 90b86fcc47b4 ("DRM: Add KMS driver for the Ingenic JZ47xx SoCs") Signed-off-by: H. Nikolaus Schaller hns@goldelico.com Signed-off-by: Paul Cercueil paul@crapouillou.net Link: https://patchwork.freedesktop.org/patch/msgid/1694a29b7a3449b6b662cec33d1b33... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/gpu/drm/ingenic/ingenic-drm.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/gpu/drm/ingenic/ingenic-drm.c +++ b/drivers/gpu/drm/ingenic/ingenic-drm.c @@ -824,6 +824,7 @@ static const struct of_device_id ingenic { .compatible = "ingenic,jz4725b-lcd", .data = &jz4725b_soc_info }, { /* sentinel */ }, }; +MODULE_DEVICE_TABLE(of, ingenic_drm_of_match);
static struct platform_driver ingenic_drm_driver = { .driver = {
 
            From: Oleg Nesterov oleg@redhat.com
commit b5f2006144c6ae941726037120fa1001ddede784 upstream.
Commit cc731525f26a ("signal: Remove kernel interal si_code magic") changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info() to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h> #include <mqueue.h> #include <unistd.h> #include <sys/wait.h> #include <assert.h>
static int notified;
static void sigh(int sig) { notified = 1; }
int main(void) { signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL); assert(fd >= 0);
struct sigevent se = { .sigev_notify = SIGEV_SIGNAL, .sigev_signo = SIGIO, }; assert(mq_notify(fd, &se) == 0);
if (!fork()) { assert(setuid(1) == 0); mq_send(fd, "",1,0); return 0; }
wait(NULL); mq_unlink("/mq"); assert(notified); return 0; }
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere] Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic") Reported-by: Yoji yoji.fujihar.min@gmail.com Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Manfred Spraul manfred@colorfullife.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: "Eric W. Biederman" ebiederm@xmission.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Markus Elfring elfring@users.sourceforge.net Cc: 1vier1@web.de Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.c... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- ipc/mqueue.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-)
--- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -82,6 +82,7 @@ struct mqueue_inode_info {
struct sigevent notify; struct pid *notify_owner; + u32 notify_self_exec_id; struct user_namespace *notify_user_ns; struct user_struct *user; /* user who created, for accounting */ struct sock *notify_sock; @@ -709,28 +710,44 @@ static void __do_notify(struct mqueue_in * synchronously. */ if (info->notify_owner && info->attr.mq_curmsgs == 1) { - struct kernel_siginfo sig_i; switch (info->notify.sigev_notify) { case SIGEV_NONE: break; - case SIGEV_SIGNAL: - /* sends signal */ + case SIGEV_SIGNAL: { + struct kernel_siginfo sig_i; + struct task_struct *task; + + /* do_mq_notify() accepts sigev_signo == 0, why?? */ + if (!info->notify.sigev_signo) + break;
clear_siginfo(&sig_i); sig_i.si_signo = info->notify.sigev_signo; sig_i.si_errno = 0; sig_i.si_code = SI_MESGQ; sig_i.si_value = info->notify.sigev_value; - /* map current pid/uid into info->owner's namespaces */ rcu_read_lock(); + /* map current pid/uid into info->owner's namespaces */ sig_i.si_pid = task_tgid_nr_ns(current, ns_of_pid(info->notify_owner)); - sig_i.si_uid = from_kuid_munged(info->notify_user_ns, current_uid()); + sig_i.si_uid = from_kuid_munged(info->notify_user_ns, + current_uid()); + /* + * We can't use kill_pid_info(), this signal should + * bypass check_kill_permission(). It is from kernel + * but si_fromuser() can't know this. + * We do check the self_exec_id, to avoid sending + * signals to programs that don't expect them. + */ + task = pid_task(info->notify_owner, PIDTYPE_TGID); + if (task && task->self_exec_id == + info->notify_self_exec_id) { + do_send_sig_info(info->notify.sigev_signo, + &sig_i, task, PIDTYPE_TGID); + } rcu_read_unlock(); - - kill_pid_info(info->notify.sigev_signo, - &sig_i, info->notify_owner); break; + } case SIGEV_THREAD: set_cookie(info->notify_cookie, NOTIFY_WOKENUP); netlink_sendskb(info->notify_sock, info->notify_cookie); @@ -1315,6 +1332,7 @@ retry: info->notify.sigev_signo = notification->sigev_signo; info->notify.sigev_value = notification->sigev_value; info->notify.sigev_notify = SIGEV_SIGNAL; + info->notify_self_exec_id = current->self_exec_id; break; }
 
            From: Roman Penyaev rpenyaev@suse.de
commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream.
This patch does two things:
- fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
- improves performance for events delivery.
The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function.
This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905)
The idea of the current patch is to use init_wait() instead of init_waitqueue_entry().
Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups.
Problem is very well reproduced by the epoll60 test case [1].
Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case:
With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373):
real 0m6.970s user 0m49.786s sys 0m0.113s
After this patch:
real 0m5.220s user 0m36.879s sys 0m0.019s
The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events:
With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373):
threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309
After this patch:
threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767
(number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better)
[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
Signed-off-by: Roman Penyaev rpenyaev@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Jason Baron jbaron@akamai.com Cc: Khazhismel Kumykov khazhy@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Heiher r@hev.cc Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/eventpoll.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1827,7 +1827,6 @@ static int ep_poll(struct eventpoll *ep, { int res = 0, eavail, timed_out = 0; u64 slack = 0; - bool waiter = false; wait_queue_entry_t wait; ktime_t expires, *to = NULL;
@@ -1872,21 +1871,23 @@ fetch_events: */ ep_reset_busy_poll_napi_id(ep);
- /* - * We don't have any available event to return to the caller. We need - * to sleep here, and we will be woken by ep_poll_callback() when events - * become available. - */ - if (!waiter) { - waiter = true; - init_waitqueue_entry(&wait, current); - + do { + /* + * Internally init_wait() uses autoremove_wake_function(), + * thus wait entry is removed from the wait queue on each + * wakeup. Why it is important? In case of several waiters + * each new wakeup will hit the next waiter, giving it the + * chance to harvest new event. Otherwise wakeup can be + * lost. This is also good performance-wise, because on + * normal wakeup path no need to call __remove_wait_queue() + * explicitly, thus ep->lock is not taken, which halts the + * event delivery. + */ + init_wait(&wait); write_lock_irq(&ep->lock); __add_wait_queue_exclusive(&ep->wq, &wait); write_unlock_irq(&ep->lock); - }
- for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state @@ -1916,10 +1917,20 @@ fetch_events: timed_out = 1; break; } - } + + /* We were woken up, thus go and try to harvest some events */ + eavail = 1; + + } while (0);
__set_current_state(TASK_RUNNING);
+ if (!list_empty_careful(&wait.entry)) { + write_lock_irq(&ep->lock); + __remove_wait_queue(&ep->wq, &wait); + write_unlock_irq(&ep->lock); + } + send_events: /* * Try to transfer events to user space. In case we get 0 events and @@ -1930,12 +1941,6 @@ send_events: !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events;
- if (waiter) { - write_lock_irq(&ep->lock); - __remove_wait_queue(&ep->wq, &wait); - write_unlock_irq(&ep->lock); - } - return res; }
 
            From: Khazhismel Kumykov khazhy@google.com
commit 0c54a6a44bf3d41e76ce3f583a6ece267618df2e upstream.
In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch.
[khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov khazhy@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Roman Penyaev rpenyaev@suse.de Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Roman Penyaev rpenyaev@suse.de Cc: Heiher r@hev.cc Cc: Jason Baron jbaron@akamai.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/eventpoll.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
--- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1176,6 +1176,10 @@ static inline bool chain_epi_lockless(st { struct eventpoll *ep = epi->ep;
+ /* Fast preliminary check */ + if (epi->next != EP_UNACTIVE_PTR) + return false; + /* Check that the same epi has not been just chained from another CPU */ if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) return false; @@ -1242,16 +1246,12 @@ static int ep_poll_callback(wait_queue_e * chained in ep->ovflist and requeued later on. */ if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { - if (epi->next == EP_UNACTIVE_PTR && - chain_epi_lockless(epi)) + if (chain_epi_lockless(epi)) + ep_pm_stay_awake_rcu(epi); + } else if (!ep_is_linked(epi)) { + /* In the usual case, add event to ready list. */ + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) ep_pm_stay_awake_rcu(epi); - goto out_unlock; - } - - /* If this file is already in the ready list we exit soon */ - if (!ep_is_linked(epi) && - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) { - ep_pm_stay_awake_rcu(epi); }
/*
 
            From: David Hildenbrand david@redhat.com
commit e84fe99b68ce353c37ceeecc95dce9696c976556 upstream.
Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up.
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40
The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger.
Signed-off-by: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Pavel Tatashin pasha.tatashin@soleen.com Reviewed-by: Pankaj Gupta pankaj.gupta.linux@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Shile Zhang shile.zhang@linux.alibaba.com Acked-by: Michal Hocko mhocko@suse.com Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Shile Zhang shile.zhang@linux.alibaba.com Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michal Hocko mhocko@kernel.org Cc: Alexander Duyck alexander.duyck@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Oscar Salvador osalvador@suse.de Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 1 + 1 file changed, 1 insertion(+)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1555,6 +1555,7 @@ void set_zone_contiguous(struct zone *zo if (!__pageblock_pfn_to_page(block_start_pfn, block_end_pfn, zone)) return; + cond_resched(); }
/* We confirm that there is no hole */
 
            From: Henry Willard henry.willard@oracle.com
commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.
Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") adds a boost_watermark() function which increases the min watermark in a zone by at least pageblock_nr_pages or the number of pages in a page block.
On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or 512M. It does this regardless of the number of managed pages managed in the zone or the likelihood of success.
This can put the zone immediately under water in terms of allocating pages from the zone, and can cause a small machine to fail immediately due to OoM. Unlike set_recommended_min_free_kbytes(), which substantially increases min_free_kbytes and is tied to THP, boost_watermark() can be called even if THP is not active.
The problem is most likely to appear on architectures such as Arm64 where pageblock_nr_pages is very large.
It is desirable to run the kdump capture kernel in as small a space as possible to avoid wasting memory. In some architectures, such as Arm64, there are restrictions on where the capture kernel can run, and therefore, the space available. A capture kernel running in 768M can fail due to OoM immediately after boost_watermark() sets the min in zone DMA32, where most of the memory is, to 512M. It fails even though there is over 500M of free memory. With boost_watermark() suppressed, the capture kernel can run successfully in 448M.
This patch limits boost_watermark() to boosting a zone's min watermark only when there are enough pages that the boost will produce positive results. In this case that is estimated to be four times as many pages as pageblock_nr_pages.
Mel said:
: There is no harm in marking it stable. Clearly it does not happen very : often but it's not impossible. 32-bit x86 is a lot less common now : which would previously have been vulnerable to triggering this easily. : ppc64 has a larger base page size but typically only has one zone. : arm64 is likely the most vulnerable, particularly when CMA is : configured with a small movable zone.
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Henry Willard henry.willard@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: David Hildenbrand david@redhat.com Acked-by: Mel Gorman mgorman@techsingularity.net Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@orac... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2351,6 +2351,14 @@ static inline void boost_watermark(struc
if (!watermark_boost_factor) return; + /* + * Don't bother in zones that are unlikely to produce results. + * On small machines, including kdump capture kernels running + * in a small area, boosting the watermark can cause an out of + * memory situation immediately. + */ + if ((pageblock_nr_pages * 4) > zone_managed_pages(zone)) + return;
max_boost = mult_frac(zone->_watermark[WMARK_HIGH], watermark_boost_factor, 10000);
 
            From: Jeff Layton jlayton@kernel.org
commit 0fa8263367db9287aa0632f96c1a5f93cc478150 upstream.
Eduard reported a problem mounting cephfs on s390 arch. The feature mask sent by the MDS is little-endian, so we need to convert it before storing and testing against it.
Cc: stable@vger.kernel.org Reported-and-Tested-by: Eduard Shishkin edward6@linux.ibm.com Signed-off-by: Jeff Layton jlayton@kernel.org Reviewed-by: "Yan, Zheng" zyan@redhat.com Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ceph/mds_client.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
--- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3072,8 +3072,7 @@ static void handle_session(struct ceph_m void *end = p + msg->front.iov_len; struct ceph_mds_session_head *h; u32 op; - u64 seq; - unsigned long features = 0; + u64 seq, features = 0; int wake = 0; bool blacklisted = false;
@@ -3092,9 +3091,8 @@ static void handle_session(struct ceph_m goto bad; /* version >= 3, feature bits */ ceph_decode_32_safe(&p, end, len, bad); - ceph_decode_need(&p, end, len, bad); - memcpy(&features, p, min_t(size_t, len, sizeof(features))); - p += len; + ceph_decode_64_safe(&p, end, features, bad); + p += len - sizeof(features); }
mutex_lock(&mdsc->mutex);
 
            From: Luis Henriques lhenriques@suse.com
commit 12ae44a40a1be891bdc6463f8c7072b4ede746ef upstream.
A misconfigured cephx can easily result in having the kernel client flooding the logs with:
ceph: Can't lookup inode 1 (err: -13)
Change this message to debug level.
Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques lhenriques@suse.com Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Ilya Dryomov idryomov@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ceph/quota.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -159,8 +159,8 @@ static struct inode *lookup_quotarealm_i }
if (IS_ERR(in)) { - pr_warn("Can't lookup inode %llx (err: %ld)\n", - realm->ino, PTR_ERR(in)); + dout("Can't lookup inode %llx (err: %ld)\n", + realm->ino, PTR_ERR(in)); qri->timeout = jiffies + msecs_to_jiffies(60 * 1000); /* XXX */ } else { qri->timeout = 0;
 
            From: Oscar Carter oscar.carter@gmx.com
commit 769acc3656d93aaacada814939743361d284fd87 upstream.
Check the return value of gasket_get_bar_index function as it can return a negative one (-EINVAL). If this happens, a negative index is used in the "gasket_dev->bar_data" array.
Addresses-Coverity-ID: 1438542 ("Negative array index read") Fixes: 9a69f5087ccc2 ("drivers/staging: Gasket driver framework + Apex driver") Signed-off-by: Oscar Carter oscar.carter@gmx.com Cc: stable stable@vger.kernel.org Reviewed-by: Richard Yeh rcy@google.com Link: https://lore.kernel.org/r/20200501155118.13380-1-oscar.carter@gmx.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/staging/gasket/gasket_core.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -926,6 +926,10 @@ do_map_region(const struct gasket_dev *g gasket_get_bar_index(gasket_dev, (vma->vm_pgoff << PAGE_SHIFT) + driver_desc->legacy_mmap_address_offset); + + if (bar_index < 0) + return DO_MAP_REGION_INVALID; + phys_base = gasket_dev->bar_data[bar_index].phys_base + phys_offset; while (mapped_bytes < map_length) { /*
 
            From: Luis Chamberlain mcgrof@kernel.org
commit 3740d93e37902b31159a82da2d5c8812ed825404 upstream.
Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did.
This has been reported by Sergey when one triggers a a coredump with the following configuration:
``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ```
The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file.
[ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ```
Cc: stable@vger.kernel.org # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon chainsaw@gentoo.org Reported-by: Sergey Kvachonok ravenexp@gmail.com Tested-by: Sergei Trofimovich slyfox@gentoo.org Signed-off-by: Luis Chamberlain mcgrof@kernel.org Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/coredump.c | 8 ++++++++ kernel/umh.c | 5 +++++ 2 files changed, 13 insertions(+)
--- a/fs/coredump.c +++ b/fs/coredump.c @@ -788,6 +788,14 @@ void do_coredump(const kernel_siginfo_t if (displaced) put_files_struct(displaced); if (!dump_interrupted()) { + /* + * umh disabled with CONFIG_STATIC_USERMODEHELPER_PATH="" would + * have this set to NULL. + */ + if (!cprm.file) { + pr_info("Core dump to |%s disabled\n", cn.corename); + goto close_fail; + } file_start_write(cprm.file); core_dumped = binfmt->core_dump(&cprm); file_end_write(cprm.file); --- a/kernel/umh.c +++ b/kernel/umh.c @@ -544,6 +544,11 @@ EXPORT_SYMBOL_GPL(fork_usermode_blob); * Runs a user-space application. The application is started * asynchronously if wait is not set, and runs as a child of system workqueues. * (ie. it runs with full root capabilities and optimized affinity). + * + * Note: successful return value does not guarantee the helper was called at + * all. You can't rely on sub_info->{init,cleanup} being called even for + * UMH_WAIT_* wait modes as STATIC_USERMODEHELPER_PATH="" turns all helpers + * into a successful no-op. */ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) {
 
            From: Vincent Chen vincent.chen@sifive.com
commit c749bb2d554825e007cbc43b791f54e124dadfce upstream.
The current max_pfn equals to zero. In this case, I found it caused users cannot get some page information through /proc such as kpagecount in v5.6 kernel because of new sanity checks. The following message is displayed by stress-ng test suite with the command "stress-ng --verbose --physpage 1 -t 1" on HiFive unleashed board.
# stress-ng --verbose --physpage 1 -t 1 stress-ng: debug: [109] 4 processors online, 4 processors configured stress-ng: info: [109] dispatching hogs: 1 physpage stress-ng: debug: [109] cache allocate: reducing cache level from L3 (too high) to L0 stress-ng: debug: [109] get_cpu_cache: invalid cache_level: 0 stress-ng: info: [109] cache allocate: using built-in defaults as no suitable cache found stress-ng: debug: [109] cache allocate: default cache size: 2048K stress-ng: debug: [109] starting stressors stress-ng: debug: [109] 1 stressor spawned stress-ng: debug: [110] stress-ng-physpage: started [110] (instance 0) stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd34de000 in /proc/kpagecount, errno=0 (Success) stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd32db078 in /proc/kpagecount, errno=0 (Success) ... stress-ng: error: [110] stress-ng-physpage: cannot read page count for address 0x3fd32db078 in /proc/kpagecount, errno=0 (Success) stress-ng: debug: [110] stress-ng-physpage: exited [110] (instance 0) stress-ng: debug: [109] process [110] terminated stress-ng: info: [109] successful run completed in 1.00s #
After applying this patch, the kernel can pass the test.
# stress-ng --verbose --physpage 1 -t 1 stress-ng: debug: [104] 4 processors online, 4 processors configured stress-ng: info: [104] dispatching hogs: 1 physpage stress-ng: info: [104] cache allocate: using defaults, can't determine cache details from sysfs stress-ng: debug: [104] cache allocate: default cache size: 2048K stress-ng: debug: [104] starting stressors stress-ng: debug: [104] 1 stressor spawned stress-ng: debug: [105] stress-ng-physpage: started [105] (instance 0) stress-ng: debug: [105] stress-ng-physpage: exited [105] (instance 0) stress-ng: debug: [104] process [105] terminated stress-ng: info: [104] successful run completed in 1.01s #
Cc: stable@vger.kernel.org Signed-off-by: Vincent Chen vincent.chen@sifive.com Reviewed-by: Anup Patel anup@brainfault.org Reviewed-by: Yash Shah yash.shah@sifive.com Tested-by: Yash Shah yash.shah@sifive.com Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/riscv/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -116,7 +116,8 @@ void __init setup_bootmem(void) memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
set_max_mapnr(PFN_DOWN(mem_size)); - max_low_pfn = PFN_DOWN(memblock_end_of_DRAM()); + max_pfn = PFN_DOWN(memblock_end_of_DRAM()); + max_low_pfn = max_pfn;
#ifdef CONFIG_BLK_DEV_INITRD setup_initrd();
 
            From: Tejun Heo tj@kernel.org
commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation.
However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen.
1. The iocg is in the process of being deactivated by the periodic timer.
2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active.
3. The iocg is deactivated.
4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay.
5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs.
This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups.
The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path.
This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: Tejun Heo tj@kernel.org Reported-by: Vlad Dmitriev vvd@fb.com Cc: stable@vger.kernel.org # v5.4+ Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- block/blk-iocost.c | 117 ++++++++++++++++++++++++----------------- tools/cgroup/iocost_monitor.py | 7 ++ 2 files changed, 77 insertions(+), 47 deletions(-)
--- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -469,7 +469,7 @@ struct ioc_gq { */ atomic64_t vtime; atomic64_t done_vtime; - atomic64_t abs_vdebt; + u64 abs_vdebt; u64 last_vtime;
/* @@ -1145,7 +1145,7 @@ static void iocg_kick_waitq(struct ioc_g struct iocg_wake_ctx ctx = { .iocg = iocg }; u64 margin_ns = (u64)(ioc->period_us * WAITQ_TIMER_MARGIN_PCT / 100) * NSEC_PER_USEC; - u64 abs_vdebt, vdebt, vshortage, expires, oexpires; + u64 vdebt, vshortage, expires, oexpires; s64 vbudget; u32 hw_inuse;
@@ -1155,18 +1155,15 @@ static void iocg_kick_waitq(struct ioc_g vbudget = now->vnow - atomic64_read(&iocg->vtime);
/* pay off debt */ - abs_vdebt = atomic64_read(&iocg->abs_vdebt); - vdebt = abs_cost_to_cost(abs_vdebt, hw_inuse); + vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse); if (vdebt && vbudget > 0) { u64 delta = min_t(u64, vbudget, vdebt); u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse), - abs_vdebt); + iocg->abs_vdebt);
atomic64_add(delta, &iocg->vtime); atomic64_add(delta, &iocg->done_vtime); - atomic64_sub(abs_delta, &iocg->abs_vdebt); - if (WARN_ON_ONCE(atomic64_read(&iocg->abs_vdebt) < 0)) - atomic64_set(&iocg->abs_vdebt, 0); + iocg->abs_vdebt -= abs_delta; }
/* @@ -1222,12 +1219,18 @@ static bool iocg_kick_delay(struct ioc_g u64 expires, oexpires; u32 hw_inuse;
+ lockdep_assert_held(&iocg->waitq.lock); + /* debt-adjust vtime */ current_hweight(iocg, NULL, &hw_inuse); - vtime += abs_cost_to_cost(atomic64_read(&iocg->abs_vdebt), hw_inuse); + vtime += abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
- /* clear or maintain depending on the overage */ - if (time_before_eq64(vtime, now->vnow)) { + /* + * Clear or maintain depending on the overage. Non-zero vdebt is what + * guarantees that @iocg is online and future iocg_kick_delay() will + * clear use_delay. Don't leave it on when there's no vdebt. + */ + if (!iocg->abs_vdebt || time_before_eq64(vtime, now->vnow)) { blkcg_clear_delay(blkg); return false; } @@ -1261,9 +1264,12 @@ static enum hrtimer_restart iocg_delay_t { struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer); struct ioc_now now; + unsigned long flags;
+ spin_lock_irqsave(&iocg->waitq.lock, flags); ioc_now(iocg->ioc, &now); iocg_kick_delay(iocg, &now, 0); + spin_unlock_irqrestore(&iocg->waitq.lock, flags);
return HRTIMER_NORESTART; } @@ -1371,14 +1377,13 @@ static void ioc_timer_fn(struct timer_li * should have woken up in the last period and expire idle iocgs. */ list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) { - if (!waitqueue_active(&iocg->waitq) && - !atomic64_read(&iocg->abs_vdebt) && !iocg_is_idle(iocg)) + if (!waitqueue_active(&iocg->waitq) && iocg->abs_vdebt && + !iocg_is_idle(iocg)) continue;
spin_lock(&iocg->waitq.lock);
- if (waitqueue_active(&iocg->waitq) || - atomic64_read(&iocg->abs_vdebt)) { + if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) { /* might be oversleeping vtime / hweight changes, kick */ iocg_kick_waitq(iocg, &now); iocg_kick_delay(iocg, &now, 0); @@ -1721,28 +1726,49 @@ static void ioc_rqos_throttle(struct rq_ * tests are racy but the races aren't systemic - we only miss once * in a while which is fine. */ - if (!waitqueue_active(&iocg->waitq) && - !atomic64_read(&iocg->abs_vdebt) && + if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt && time_before_eq64(vtime + cost, now.vnow)) { iocg_commit_bio(iocg, bio, cost); return; }
/* - * We're over budget. If @bio has to be issued regardless, - * remember the abs_cost instead of advancing vtime. - * iocg_kick_waitq() will pay off the debt before waking more IOs. + * We activated above but w/o any synchronization. Deactivation is + * synchronized with waitq.lock and we won't get deactivated as long + * as we're waiting or has debt, so we're good if we're activated + * here. In the unlikely case that we aren't, just issue the IO. + */ + spin_lock_irq(&iocg->waitq.lock); + + if (unlikely(list_empty(&iocg->active_list))) { + spin_unlock_irq(&iocg->waitq.lock); + iocg_commit_bio(iocg, bio, cost); + return; + } + + /* + * We're over budget. If @bio has to be issued regardless, remember + * the abs_cost instead of advancing vtime. iocg_kick_waitq() will pay + * off the debt before waking more IOs. + * * This way, the debt is continuously paid off each period with the - * actual budget available to the cgroup. If we just wound vtime, - * we would incorrectly use the current hw_inuse for the entire - * amount which, for example, can lead to the cgroup staying - * blocked for a long time even with substantially raised hw_inuse. + * actual budget available to the cgroup. If we just wound vtime, we + * would incorrectly use the current hw_inuse for the entire amount + * which, for example, can lead to the cgroup staying blocked for a + * long time even with substantially raised hw_inuse. + * + * An iocg with vdebt should stay online so that the timer can keep + * deducting its vdebt and [de]activate use_delay mechanism + * accordingly. We don't want to race against the timer trying to + * clear them and leave @iocg inactive w/ dangling use_delay heavily + * penalizing the cgroup and its descendants. */ if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) { - atomic64_add(abs_cost, &iocg->abs_vdebt); + iocg->abs_vdebt += abs_cost; if (iocg_kick_delay(iocg, &now, cost)) blkcg_schedule_throttle(rqos->q, (bio->bi_opf & REQ_SWAP) == REQ_SWAP); + spin_unlock_irq(&iocg->waitq.lock); return; }
@@ -1759,20 +1785,6 @@ static void ioc_rqos_throttle(struct rq_ * All waiters are on iocg->waitq and the wait states are * synchronized using waitq.lock. */ - spin_lock_irq(&iocg->waitq.lock); - - /* - * We activated above but w/o any synchronization. Deactivation is - * synchronized with waitq.lock and we won't get deactivated as - * long as we're waiting, so we're good if we're activated here. - * In the unlikely case that we are deactivated, just issue the IO. - */ - if (unlikely(list_empty(&iocg->active_list))) { - spin_unlock_irq(&iocg->waitq.lock); - iocg_commit_bio(iocg, bio, cost); - return; - } - init_waitqueue_func_entry(&wait.wait, iocg_wake_fn); wait.wait.private = current; wait.bio = bio; @@ -1804,6 +1816,7 @@ static void ioc_rqos_merge(struct rq_qos struct ioc_now now; u32 hw_inuse; u64 abs_cost, cost; + unsigned long flags;
/* bypass if disabled or for root cgroup */ if (!ioc->enabled || !iocg->level) @@ -1823,15 +1836,28 @@ static void ioc_rqos_merge(struct rq_qos iocg->cursor = bio_end;
/* - * Charge if there's enough vtime budget and the existing request - * has cost assigned. Otherwise, account it as debt. See debt - * handling in ioc_rqos_throttle() for details. + * Charge if there's enough vtime budget and the existing request has + * cost assigned. */ if (rq->bio && rq->bio->bi_iocost_cost && - time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) + time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) { iocg_commit_bio(iocg, bio, cost); - else - atomic64_add(abs_cost, &iocg->abs_vdebt); + return; + } + + /* + * Otherwise, account it as debt if @iocg is online, which it should + * be for the vast majority of cases. See debt handling in + * ioc_rqos_throttle() for details. + */ + spin_lock_irqsave(&iocg->waitq.lock, flags); + if (likely(!list_empty(&iocg->active_list))) { + iocg->abs_vdebt += abs_cost; + iocg_kick_delay(iocg, &now, cost); + } else { + iocg_commit_bio(iocg, bio, cost); + } + spin_unlock_irqrestore(&iocg->waitq.lock, flags); }
static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio) @@ -2001,7 +2027,6 @@ static void ioc_pd_init(struct blkg_poli iocg->ioc = ioc; atomic64_set(&iocg->vtime, now.vnow); atomic64_set(&iocg->done_vtime, now.vnow); - atomic64_set(&iocg->abs_vdebt, 0); atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period)); INIT_LIST_HEAD(&iocg->active_list); iocg->hweight_active = HWEIGHT_WHOLE; --- a/tools/cgroup/iocost_monitor.py +++ b/tools/cgroup/iocost_monitor.py @@ -159,7 +159,12 @@ class IocgStat: else: self.inflight_pct = 0
- self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 + # vdebt used to be an atomic64_t and is now u64, support both + try: + self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 + except: + self.debt_ms = iocg.abs_vdebt.value_() / VTIME_PER_USEC / 1000 + self.use_delay = blkg.use_delay.counter.value_() self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000
 
            From: George Spelvin lkml@sdf.org
commit fd0c42c4dea54335967c5a86f15fc064235a2797 upstream.
and change to pseudorandom numbers, as this is a traffic dithering operation that doesn't need crypto-grade.
The previous code operated in 4 steps:
1. Generate a random byte 0 <= rand_tq <= 255 2. Multiply it by BATADV_TQ_MAX_VALUE - tq 3. Divide by 255 (= BATADV_TQ_MAX_VALUE) 4. Return BATADV_TQ_MAX_VALUE - rand_tq
This would apperar to scale (BATADV_TQ_MAX_VALUE - tq) by a random value between 0/255 and 255/255.
But! The intermediate value between steps 3 and 4 is stored in a u8 variable. So it's truncated, and most of the time, is less than 255, after which the division produces 0. Specifically, if tq is odd, the product is always even, and can never be 255. If tq is even, there's exactly one random byte value that will produce a product byte of 255.
Thus, the return value is 255 (511/512 of the time) or 254 (1/512 of the time).
If we assume that the truncation is a bug, and the code is meant to scale the input, a simpler way of looking at it is that it's returning a random value between tq and BATADV_TQ_MAX_VALUE, inclusive.
Well, we have an optimized function for doing just that.
Fixes: 3c12de9a5c75 ("batman-adv: network coding - code and transmit packets if possible") Signed-off-by: George Spelvin lkml@sdf.org Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/network-coding.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
--- a/net/batman-adv/network-coding.c +++ b/net/batman-adv/network-coding.c @@ -1009,15 +1009,8 @@ static struct batadv_nc_path *batadv_nc_ */ static u8 batadv_nc_random_weight_tq(u8 tq) { - u8 rand_val, rand_tq; - - get_random_bytes(&rand_val, sizeof(rand_val)); - /* randomize the estimated packet loss (max TQ - estimated TQ) */ - rand_tq = rand_val * (BATADV_TQ_MAX_VALUE - tq); - - /* normalize the randomized packet loss */ - rand_tq /= BATADV_TQ_MAX_VALUE; + u8 rand_tq = prandom_u32_max(BATADV_TQ_MAX_VALUE + 1 - tq);
/* convert to (randomized) estimated tq again */ return BATADV_TQ_MAX_VALUE - rand_tq;
 
            From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit f872de8185acf1b48b954ba5bd8f9bc0a0d14016 upstream.
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(), which gets a batadv_hard_iface object from net_dev with increased refcnt and its reference is assigned to a local pointer 'hard_iface'.
When batadv_show_throughput_override() returns, "hard_iface" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The issue happens in the normal path of batadv_show_throughput_override(), which forgets to decrease the refcnt increased by batadv_hardif_get_by_netdev() before the function returns, causing a refcnt leak.
Fix this issue by calling batadv_hardif_put() before the batadv_show_throughput_override() returns in the normal path.
Fixes: 0b5ecc6811bd ("batman-adv: add throughput override attribute to hard_ifaces") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/sysfs.c | 1 + 1 file changed, 1 insertion(+)
--- a/net/batman-adv/sysfs.c +++ b/net/batman-adv/sysfs.c @@ -1190,6 +1190,7 @@ static ssize_t batadv_show_throughput_ov
tp_override = atomic_read(&hard_iface->bat_v.throughput_override);
+ batadv_hardif_put(hard_iface); return sprintf(buff, "%u.%u MBit\n", tp_override / 10, tp_override % 10); }
 
            From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit 6107c5da0fca8b50b4d3215e94d619d38cc4a18c upstream.
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(), which gets a batadv_hard_iface object from net_dev with increased refcnt and its reference is assigned to a local pointer 'hard_iface'.
When batadv_store_throughput_override() returns, "hard_iface" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The issue happens in one error path of batadv_store_throughput_override(). When batadv_parse_throughput() returns NULL, the refcnt increased by batadv_hardif_get_by_netdev() is not decreased, causing a refcnt leak.
Fix this issue by jumping to "out" label when batadv_parse_throughput() returns NULL.
Fixes: 0b5ecc6811bd ("batman-adv: add throughput override attribute to hard_ifaces") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/batman-adv/sysfs.c +++ b/net/batman-adv/sysfs.c @@ -1150,7 +1150,7 @@ static ssize_t batadv_store_throughput_o ret = batadv_parse_throughput(net_dev, buff, "throughput_override", &tp_override); if (!ret) - return count; + goto out;
old_tp_override = atomic_read(&hard_iface->bat_v.throughput_override); if (old_tp_override == tp_override)
 
            From: Xiyu Yang xiyuyang19@fudan.edu.cn
commit 6f91a3f7af4186099dd10fa530dd7e0d9c29747d upstream.
batadv_v_ogm_process() invokes batadv_hardif_neigh_get(), which returns a reference of the neighbor object to "hardif_neigh" with increased refcount.
When batadv_v_ogm_process() returns, "hardif_neigh" becomes invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling paths of batadv_v_ogm_process(). When batadv_v_ogm_orig_get() fails to get the orig node and returns NULL, the refcnt increased by batadv_hardif_neigh_get() is not decreased, causing a refcnt leak.
Fix this issue by jumping to "out" label when batadv_v_ogm_orig_get() fails to get the orig node.
Fixes: 9323158ef9f4 ("batman-adv: OGMv2 - implement originators logic") Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Sven Eckelmann sven@narfation.org Signed-off-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/batman-adv/bat_v_ogm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/batman-adv/bat_v_ogm.c +++ b/net/batman-adv/bat_v_ogm.c @@ -897,7 +897,7 @@ static void batadv_v_ogm_process(const s
orig_node = batadv_v_ogm_orig_get(bat_priv, ogm_packet->orig); if (!orig_node) - return; + goto out;
neigh_node = batadv_neigh_node_get_or_create(orig_node, if_incoming, ethhdr->h_source);
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit 06a9750edcffa808494d56da939085c35904e618 upstream.
The PUSH_AND_CLEAR_REGS macro zeroes each register immediately after pushing it. If an NMI or exception hits after a register is cleared, but before the UNWIND_HINT_REGS annotation, the ORC unwinder will wrongly think the previous value of the register was zero. This can confuse the unwinding process and cause it to exit early.
Because ORC is simpler than DWARF, there are a limited number of unwind annotation states, so it's not possible to add an individual unwind hint after each push/clear combination. Instead, the register clearing instructions need to be consolidated and moved to after the UNWIND_HINT_REGS annotation.
Fixes: 3f01daecd545 ("x86/entry/64: Introduce the PUSH_AND_CLEAN_REGS macro") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/68fd3d0bc92ae2d62ff7879d15d3684217d51f08.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/calling.h | 40 +++++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 19 deletions(-)
--- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -98,13 +98,6 @@ For 32-bit we have the following convent #define SIZEOF_PTREGS 21*8
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0 - /* - * Push registers and sanitize registers of values that a - * speculation attack might otherwise want to exploit. The - * lower registers are likely clobbered well before they - * could be put to use in a speculative execution gadget. - * Interleave XOR with PUSH for better uop scheduling: - */ .if \save_ret pushq %rsi /* pt_regs->si */ movq 8(%rsp), %rsi /* temporarily store the return address in %rsi */ @@ -114,34 +107,43 @@ For 32-bit we have the following convent pushq %rsi /* pt_regs->si */ .endif pushq \rdx /* pt_regs->dx */ - xorl %edx, %edx /* nospec dx */ pushq %rcx /* pt_regs->cx */ - xorl %ecx, %ecx /* nospec cx */ pushq \rax /* pt_regs->ax */ pushq %r8 /* pt_regs->r8 */ - xorl %r8d, %r8d /* nospec r8 */ pushq %r9 /* pt_regs->r9 */ - xorl %r9d, %r9d /* nospec r9 */ pushq %r10 /* pt_regs->r10 */ - xorl %r10d, %r10d /* nospec r10 */ pushq %r11 /* pt_regs->r11 */ - xorl %r11d, %r11d /* nospec r11*/ pushq %rbx /* pt_regs->rbx */ - xorl %ebx, %ebx /* nospec rbx*/ pushq %rbp /* pt_regs->rbp */ - xorl %ebp, %ebp /* nospec rbp*/ pushq %r12 /* pt_regs->r12 */ - xorl %r12d, %r12d /* nospec r12*/ pushq %r13 /* pt_regs->r13 */ - xorl %r13d, %r13d /* nospec r13*/ pushq %r14 /* pt_regs->r14 */ - xorl %r14d, %r14d /* nospec r14*/ pushq %r15 /* pt_regs->r15 */ - xorl %r15d, %r15d /* nospec r15*/ UNWIND_HINT_REGS + .if \save_ret pushq %rsi /* return address on top of stack */ .endif + + /* + * Sanitize registers of values that a speculation attack might + * otherwise want to exploit. The lower registers are likely clobbered + * well before they could be put to use in a speculative execution + * gadget. + */ + xorl %edx, %edx /* nospec dx */ + xorl %ecx, %ecx /* nospec cx */ + xorl %r8d, %r8d /* nospec r8 */ + xorl %r9d, %r9d /* nospec r9 */ + xorl %r10d, %r10d /* nospec r10 */ + xorl %r11d, %r11d /* nospec r11 */ + xorl %ebx, %ebx /* nospec rbx */ + xorl %ebp, %ebp /* nospec rbp */ + xorl %r12d, %r12d /* nospec r12 */ + xorl %r13d, %r13d /* nospec r13 */ + xorl %r14d, %r14d /* nospec r14 */ + xorl %r15d, %r15d /* nospec r15 */ + .endm
.macro POP_REGS pop_rdi=1 skip_r11rcx=0
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit 1fb143634a38095b641a3a21220774799772dc4c upstream.
In swapgs_restore_regs_and_return_to_usermode, after the stack is switched to the trampoline stack, the existing UNWIND_HINT_REGS hint is no longer valid, which can result in the following ORC unwinder warning:
WARNING: can't dereference registers at 000000003aeb0cdd for ip swapgs_restore_regs_and_return_to_usermode+0x93/0xa0
For full correctness, we could try to add complicated unwind hints so the unwinder could continue to find the registers, but when when it's this close to kernel exit, unwind hints aren't really needed anymore and it's fine to just use an empty hint which tells the unwinder to stop.
For consistency, also move the UNWIND_HINT_EMPTY in entry_SYSCALL_64_after_hwframe to a similar location.
Fixes: 3e3b9293d392 ("x86/entry/64: Return to userspace from the trampoline stack") Reported-by: Vince Weaver vincent.weaver@maine.edu Reported-by: Dave Jones dsj@fb.com Reported-by: Dr. David Alan Gilbert dgilbert@redhat.com Reported-by: Joe Mario jmario@redhat.com Reported-by: Jann Horn jannh@google.com Reported-by: Linus Torvalds torvalds@linux-foundation.org Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/60ea8f562987ed2d9ace2977502fe481c0d7c9a0.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -249,7 +249,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe) */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ - UNWIND_HINT_EMPTY POP_REGS pop_rdi=0 skip_r11rcx=1
/* @@ -258,6 +257,7 @@ syscall_return_via_sysret: */ movq %rsp, %rdi movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp + UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */ pushq (%rdi) /* RDI */ @@ -637,6 +637,7 @@ GLOBAL(swapgs_restore_regs_and_return_to */ movq %rsp, %rdi movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp + UNWIND_HINT_EMPTY
/* Copy the IRET frame to the trampoline stack. */ pushq 6*8(%rdi) /* SS */
 
            From: Jann Horn jannh@google.com
commit f977df7b7ca45a4ac4b66d30a8931d0434c394b1 upstream.
The LEAQ instruction in rewind_stack_do_exit() moves the stack pointer directly below the pt_regs at the top of the task stack before calling do_exit(). Tell the unwinder to expect pt_regs.
Fixes: 8c1f75587a18 ("x86/entry/64: Add unwind hint annotations") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/68c33e17ae5963854916a46f522624f8e1d264f2.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1740,7 +1740,7 @@ ENTRY(rewind_stack_do_exit)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rax leaq -PTREGS_SIZE(%rax), %rsp - UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE + UNWIND_HINT_REGS
call do_exit END(rewind_stack_do_exit)
 
            From: Miroslav Benes mbenes@suse.cz
commit f1d9a2abff66aa8156fbc1493abed468db63ea48 upstream.
When unwinding an inactive task, the ORC unwinder skips the first frame by default. If both the 'regs' and 'first_frame' parameters of unwind_start() are NULL, 'state->sp' and 'first_frame' are later initialized to the same value for an inactive task. Given there is a "less than or equal to" comparison used at the end of __unwind_start() for skipping stack frames, the first frame is skipped.
Drop the equal part of the comparison and make the behavior equivalent to the frame pointer unwinder.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/7f08db872ab59e807016910acdbe82f744de7065.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -648,7 +648,7 @@ void __unwind_start(struct unwind_state /* Otherwise, skip ahead to the user-specified starting frame: */ while (!unwind_done(state) && (!on_stack(&state->stack_info, first_frame, sizeof(long)) || - state->sp <= (unsigned long)first_frame)) + state->sp < (unsigned long)first_frame)) unwind_next_frame(state);
return;
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit 98d0c8ebf77e0ba7c54a9ae05ea588f0e9e3f46e upstream.
If the unwinder is called before the ORC data has been initialized, orc_find() returns NULL, and it tries to fall back to using frame pointers. This can cause some unexpected warnings during boot.
Move the 'orc_init' check from orc_find() to __unwind_init(), so that it doesn't even try to unwind from an uninitialized state.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/069d1499ad606d85532eb32ce39b2441679667d5.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -142,9 +142,6 @@ static struct orc_entry *orc_find(unsign { static struct orc_entry *orc;
- if (!orc_init) - return NULL; - if (ip == 0) return &null_orc_entry;
@@ -582,6 +579,9 @@ EXPORT_SYMBOL_GPL(unwind_next_frame); void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long *first_frame) { + if (!orc_init) + goto done; + memset(state, 0, sizeof(*state)); state->task = task;
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit a0f81bf26888048100bf017fadf438a5bdffa8d8 upstream.
If the ORC entry type is unknown, nothing else can be done other than reporting an error. Exit the function instead of breaking out of the switch statement.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/a7fa668ca6eabbe81ab18b2424f15adbbfdc810a.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kernel/unwind_orc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -528,7 +528,7 @@ bool unwind_next_frame(struct unwind_sta default: orc_warn("unknown .orc_unwind entry type %d for ip %pB\n", orc->type, (void *)orig_ip); - break; + goto err; }
/* Find BP: */
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit 81b67439d147677d844d492fcbd03712ea438f42 upstream.
The following execution path is possible:
fsnotify() [ realign the stack and store previous SP in R10 ] <IRQ> [ only IRET regs saved ] common_interrupt() interrupt_entry() <NMI> [ full pt_regs saved ] ... [ unwind stack ]
When the unwinder goes through the NMI and the IRQ on the stack, and then sees fsnotify(), it doesn't have access to the value of R10, because it only has the five IRET registers. So the unwind stops prematurely.
However, because the interrupt_entry() code is careful not to clobber R10 before saving the full regs, the unwinder should be able to read R10 from the previously saved full pt_regs associated with the NMI.
Handle this case properly. When encountering an IRET regs frame immediately after a full pt_regs frame, use the pt_regs as a backup which can be used to get the C register values.
Also, note that a call frame resets the 'prev_regs' value, because a function is free to clobber the registers. For this fix to work, the IRET and full regs frames must be adjacent, with no FUNC frames in between. So replace the FUNC hint in interrupt_entry() with an IRET_REGS hint.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Dave Jones dsj@fb.com Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lore.kernel.org/r/97a408167cc09f1cfa0de31a7b70dd88868d743f.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/entry/entry_64.S | 4 +-- arch/x86/include/asm/unwind.h | 2 - arch/x86/kernel/unwind_orc.c | 51 ++++++++++++++++++++++++++++++++---------- 3 files changed, 43 insertions(+), 14 deletions(-)
--- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -512,7 +512,7 @@ END(spurious_entries_start) * +----------------------------------------------------+ */ ENTRY(interrupt_entry) - UNWIND_HINT_FUNC + UNWIND_HINT_IRET_REGS offset=16 ASM_CLAC cld
@@ -544,9 +544,9 @@ ENTRY(interrupt_entry) pushq 5*8(%rdi) /* regs->eflags */ pushq 4*8(%rdi) /* regs->cs */ pushq 3*8(%rdi) /* regs->ip */ + UNWIND_HINT_IRET_REGS pushq 2*8(%rdi) /* regs->orig_ax */ pushq 8(%rdi) /* return address */ - UNWIND_HINT_FUNC
movq (%rdi), %rdi jmp 2f --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -19,7 +19,7 @@ struct unwind_state { #if defined(CONFIG_UNWINDER_ORC) bool signal, full_regs; unsigned long sp, bp, ip; - struct pt_regs *regs; + struct pt_regs *regs, *prev_regs; #elif defined(CONFIG_UNWINDER_FRAME_POINTER) bool got_irq; unsigned long *bp, *orig_sp, ip; --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -375,9 +375,38 @@ static bool deref_stack_iret_regs(struct return true; }
+/* + * If state->regs is non-NULL, and points to a full pt_regs, just get the reg + * value from state->regs. + * + * Otherwise, if state->regs just points to IRET regs, and the previous frame + * had full regs, it's safe to get the value from the previous regs. This can + * happen when early/late IRQ entry code gets interrupted by an NMI. + */ +static bool get_reg(struct unwind_state *state, unsigned int reg_off, + unsigned long *val) +{ + unsigned int reg = reg_off/8; + + if (!state->regs) + return false; + + if (state->full_regs) { + *val = ((unsigned long *)state->regs)[reg]; + return true; + } + + if (state->prev_regs) { + *val = ((unsigned long *)state->prev_regs)[reg]; + return true; + } + + return false; +} + bool unwind_next_frame(struct unwind_state *state) { - unsigned long ip_p, sp, orig_ip = state->ip, prev_sp = state->sp; + unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; enum stack_type prev_type = state->stack_info.type; struct orc_entry *orc; bool indirect = false; @@ -439,39 +468,35 @@ bool unwind_next_frame(struct unwind_sta break;
case ORC_REG_R10: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, r10), &sp)) { orc_warn("missing regs for base reg R10 at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->r10; break;
case ORC_REG_R13: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, r13), &sp)) { orc_warn("missing regs for base reg R13 at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->r13; break;
case ORC_REG_DI: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, di), &sp)) { orc_warn("missing regs for base reg DI at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->di; break;
case ORC_REG_DX: - if (!state->regs || !state->full_regs) { + if (!get_reg(state, offsetof(struct pt_regs, dx), &sp)) { orc_warn("missing regs for base reg DX at ip %pB\n", (void *)state->ip); goto err; } - sp = state->regs->dx; break;
default: @@ -498,6 +523,7 @@ bool unwind_next_frame(struct unwind_sta
state->sp = sp; state->regs = NULL; + state->prev_regs = NULL; state->signal = false; break;
@@ -509,6 +535,7 @@ bool unwind_next_frame(struct unwind_sta }
state->regs = (struct pt_regs *)sp; + state->prev_regs = NULL; state->full_regs = true; state->signal = true; break; @@ -520,6 +547,8 @@ bool unwind_next_frame(struct unwind_sta goto err; }
+ if (state->full_regs) + state->prev_regs = state->regs; state->regs = (void *)sp - IRET_FRAME_OFFSET; state->full_regs = false; state->signal = true; @@ -534,8 +563,8 @@ bool unwind_next_frame(struct unwind_sta /* Find BP: */ switch (orc->bp_reg) { case ORC_REG_UNDEFINED: - if (state->regs && state->full_regs) - state->bp = state->regs->bp; + if (get_reg(state, offsetof(struct pt_regs, bp), &tmp)) + state->bp = tmp; break;
case ORC_REG_PREV_SP:
 
            From: Suravee Suthikulpanit suravee.suthikulpanit@amd.com
commit 637543a8d61c6afe4e9be64bfb43c78701a83375 upstream.
Current logic incorrectly uses the enum ioapic_irq_destination_types to check the posted interrupt destination types. However, the value was set using APIC_DM_XXX macros, which are left-shifted by 8 bits.
Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead.
Fixes: (fdcf75621375 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes') Cc: Alexander Graf graf@amazon.com Signed-off-by: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Message-Id: 1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com Reviewed-by: Maxim Levitsky mlevitsk@redhat.com Tested-by: Maxim Levitsky mlevitsk@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/include/asm/kvm_host.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1608,8 +1608,8 @@ void kvm_set_msi_irq(struct kvm *kvm, st static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq) { /* We can only post Fixed and LowPrio IRQs */ - return (irq->delivery_mode == dest_Fixed || - irq->delivery_mode == dest_LowestPrio); + return (irq->delivery_mode == APIC_DM_FIXED || + irq->delivery_mode == APIC_DM_LOWEST); }
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
 
            From: Janakarajan Natarajan Janakarajan.Natarajan@amd.com
commit 996ed22c7a5251d76dcdfe5026ef8230e90066d9 upstream.
When trying to lock read-only pages, sev_pin_memory() fails because FOLL_WRITE is used as the flag for get_user_pages_fast().
Commit 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") updated the get_user_pages_fast() call sites to use flags, but incorrectly updated the call in sev_pin_memory(). As the original coding of this call was correct, revert the change made by that commit.
Fixes: 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") Signed-off-by: Janakarajan Natarajan Janakarajan.Natarajan@amd.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Ira Weiny ira.weiny@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Sean Christopherson sean.j.christopherson@intel.com Cc: Vitaly Kuznetsov vkuznets@redhat.com Cc: Wanpeng Li wanpengli@tencent.com Cc: Jim Mattson jmattson@google.com Cc: Joerg Roedel joro@8bytes.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H . Peter Anvin" hpa@zytor.com Cc: Mike Marshall hubcap@omnibond.com Cc: Brijesh Singh brijesh.singh@amd.com Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.co... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1861,7 +1861,7 @@ static struct page **sev_pin_memory(stru return NULL;
/* Pin the user virtual address. */ - npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); + npinned = get_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); if (npinned != npages) { pr_err("SEV: Failure locking %lu pages.\n", npages); goto err;
 
            From: Guillaume Nault gnault@redhat.com
commit ea64d8d6c675c0bb712689b13810301de9d8f77a upstream.
If the UDP header of a local VXLAN endpoint is NAT-ed, and the VXLAN device has disabled UDP checksums and enabled Tx checksum offloading, then the skb passed to udp_manip_pkt() has hdr->check == 0 (outer checksum disabled) and skb->ip_summed == CHECKSUM_PARTIAL (inner packet checksum offloaded).
Because of the ->ip_summed value, udp_manip_pkt() tries to update the outer checksum with the new address and port, leading to an invalid checksum sent on the wire, as the original null checksum obviously didn't take the old address and port into account.
So, we can't take ->ip_summed into account in udp_manip_pkt(), as it might not refer to the checksum we're acting on. Instead, we can base the decision to update the UDP checksum entirely on the value of hdr->check, because it's null if and only if checksum is disabled:
* A fully computed checksum can't be 0, since a 0 checksum is represented by the CSUM_MANGLED_0 value instead.
* A partial checksum can't be 0, since the pseudo-header always adds at least one non-zero value (the UDP protocol type 0x11) and adding more values to the sum can't make it wrap to 0 as the carry is then added to the wrapped number.
* A disabled checksum uses the special value 0.
The problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented.
Fixes: 5b1158e909ec ("[NETFILTER]: Add NAT support for nf_conntrack") Signed-off-by: Guillaume Nault gnault@redhat.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/netfilter/nf_nat_proto.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
--- a/net/netfilter/nf_nat_proto.c +++ b/net/netfilter/nf_nat_proto.c @@ -68,15 +68,13 @@ static bool udp_manip_pkt(struct sk_buff enum nf_nat_manip_type maniptype) { struct udphdr *hdr; - bool do_csum;
if (skb_ensure_writable(skb, hdroff + sizeof(*hdr))) return false;
hdr = (struct udphdr *)(skb->data + hdroff); - do_csum = hdr->check || skb->ip_summed == CHECKSUM_PARTIAL; + __udp_manip_pkt(skb, iphdroff, hdr, tuple, maniptype, !!hdr->check);
- __udp_manip_pkt(skb, iphdroff, hdr, tuple, maniptype, do_csum); return true; }
 
            From: Arnd Bergmann arnd@arndb.de
commit c165d57b552aaca607fa5daf3fb524a6efe3c5a3 upstream.
gcc-10 points out that a code path exists where a pointer to a stack variable may be passed back to the caller:
net/netfilter/nfnetlink_osf.c: In function 'nf_osf_hdr_ctx_init': cc1: warning: function may return address of local variable [-Wreturn-local-addr] net/netfilter/nfnetlink_osf.c:171:16: note: declared here 171 | struct tcphdr _tcph; | ^~~~~
I am not sure whether this can happen in practice, but moving the variable declaration into the callers avoids the problem.
Fixes: 31a9c29210e2 ("netfilter: nf_osf: add struct nf_osf_hdr_ctx") Signed-off-by: Arnd Bergmann arnd@arndb.de Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- net/netfilter/nfnetlink_osf.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
--- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -165,12 +165,12 @@ static bool nf_osf_match_one(const struc static const struct tcphdr *nf_osf_hdr_ctx_init(struct nf_osf_hdr_ctx *ctx, const struct sk_buff *skb, const struct iphdr *ip, - unsigned char *opts) + unsigned char *opts, + struct tcphdr *_tcph) { const struct tcphdr *tcp; - struct tcphdr _tcph;
- tcp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(struct tcphdr), &_tcph); + tcp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(struct tcphdr), _tcph); if (!tcp) return NULL;
@@ -205,10 +205,11 @@ nf_osf_match(const struct sk_buff *skb, int fmatch = FMATCH_WRONG; struct nf_osf_hdr_ctx ctx; const struct tcphdr *tcp; + struct tcphdr _tcph;
memset(&ctx, 0, sizeof(ctx));
- tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts); + tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts, &_tcph); if (!tcp) return false;
@@ -265,10 +266,11 @@ bool nf_osf_find(const struct sk_buff *s const struct nf_osf_finger *kf; struct nf_osf_hdr_ctx ctx; const struct tcphdr *tcp; + struct tcphdr _tcph;
memset(&ctx, 0, sizeof(ctx));
- tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts); + tcp = nf_osf_hdr_ctx_init(&ctx, skb, ip, opts, &_tcph); if (!tcp) return false;
 
            From: Josh Poimboeuf jpoimboe@redhat.com
commit d8dd25a461e4eec7190cb9d66616aceacc5110ad upstream.
When the current frame address (CFA) is stored on the stack (i.e., cfa->base == CFI_SP_INDIRECT), objtool neglects to adjust the stack offset when there are subsequent pushes or pops. This results in bad ORC data at the end of the ENTER_IRQ_STACK macro, when it puts the previous stack pointer on the stack and does a subsequent push.
This fixes the following unwinder warning:
WARNING: can't dereference registers at 00000000f0a6bdba for ip interrupt_entry+0x9f/0xa0
Fixes: 627fce14809b ("objtool: Add ORC unwind table generation") Reported-by: Vince Weaver vincent.weaver@maine.edu Reported-by: Dave Jones dsj@fb.com Reported-by: Steven Rostedt rostedt@goodmis.org Reported-by: Vegard Nossum vegard.nossum@oracle.com Reported-by: Joe Mario jmario@redhat.com Reviewed-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Jann Horn jannh@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/853d5d691b29e250333332f09b8e27410b2d9924.158780874... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1402,7 +1402,7 @@ static int update_insn_state_regs(struct struct cfi_reg *cfa = &state->cfa; struct stack_op *op = &insn->stack_op;
- if (cfa->base != CFI_SP) + if (cfa->base != CFI_SP && cfa->base != CFI_SP_INDIRECT) return 0;
/* push */
 
            From: Julia Lawall Julia.Lawall@inria.fr
commit fb3637a113349f53830f7d6ca45891b7192cd28f upstream.
Elsewhere in the file, there is a list_for_each_entry with &vdev->resv_regions as the second argument, suggesting that &vdev->resv_regions is the list head. So exchange the arguments on the list_add call to put the list head in the second argument.
Fixes: 2a5a31487445 ("iommu/virtio: Add probe request") Signed-off-by: Julia Lawall Julia.Lawall@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Reviewed-by: Jean-Philippe Brucker jean-philippe@linaro.org Link: https://lore.kernel.org/r/1588704467-13431-1-git-send-email-Julia.Lawall@inr... Signed-off-by: Joerg Roedel jroedel@suse.de
--- drivers/iommu/virtio-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -454,7 +454,7 @@ static int viommu_add_resv_mem(struct vi if (!region) return -ENOMEM;
- list_add(&vdev->resv_regions, ®ion->list); + list_add(®ion->list, &vdev->resv_regions); return 0; }
 
            From: Ivan Delalande colona@arista.com
commit e08df079b23e2e982df15aa340bfbaf50f297504 upstream.
If the trapping instruction contains a ':', for a memory access through segment registers for example, the sed substitution will insert the '*' marker in the middle of the instruction instead of the line address:
2b: 65 48 0f c7 0f cmpxchg16b %gs:*(%rdi) <-- trapping instruction
I started to think I had forgotten some quirk of the assembly syntax before noticing that it was actually coming from the script. Fix it to add the address marker at the right place for these instructions:
28: 49 8b 06 mov (%r14),%rax 2b:* 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) <-- trapping instruction 30: 0f 94 c0 sete %al
Fixes: 18ff44b189e2 ("scripts/decodecode: make faulting insn ptr more robust") Signed-off-by: Ivan Delalande colona@arista.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Borislav Petkov bp@suse.de Link: http://lkml.kernel.org/r/20200419223653.GA31248@visor Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- scripts/decodecode | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/scripts/decodecode +++ b/scripts/decodecode @@ -126,7 +126,7 @@ faultlinenum=$(( $(wc -l $T.oo | cut -d faultline=`cat $T.dis | head -1 | cut -d":" -f2-` faultline=`echo "$faultline" | sed -e 's/[/\[/g; s/]/\]/g'`
-cat $T.oo | sed -e "${faultlinenum}s/^(.*:)(.*)/\1*\2\t\t<-- trapping instruction/" +cat $T.oo | sed -e "${faultlinenum}s/^([^:]*:)(.*)/\1*\2\t\t<-- trapping instruction/" echo cat $T.aa cleanup
 
            From: Yafang Shao laoar.shao@gmail.com
commit 11d6761218d19ca06ae5387f4e3692c4fa9e7493 upstream.
When I run my memcg testcase which creates lots of memcgs, I found there're unexpected out of memory logs while there're still enough available free memory. The error log is
mkdir: cannot create directory 'foo.65533': Cannot allocate memory
The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs, an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right errno should be -ENOSPC "No space left on device", which is an appropriate errno for userspace's failed mkdir.
As the errno really misled me, we should make it right. After this patch, the error log will be
mkdir: cannot create directory 'foo.65533': No space left on device
[akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@kernel.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/memcontrol.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
--- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5101,19 +5101,22 @@ static struct mem_cgroup *mem_cgroup_all unsigned int size; int node; int __maybe_unused i; + long error = -ENOMEM;
size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
memcg = kzalloc(size, GFP_KERNEL); if (!memcg) - return NULL; + return ERR_PTR(error);
memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, GFP_KERNEL); - if (memcg->id.id < 0) + if (memcg->id.id < 0) { + error = memcg->id.id; goto fail; + }
memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); if (!memcg->vmstats_local) @@ -5158,7 +5161,7 @@ static struct mem_cgroup *mem_cgroup_all fail: mem_cgroup_id_remove(memcg); __mem_cgroup_free(memcg); - return NULL; + return ERR_PTR(error); }
static struct cgroup_subsys_state * __ref @@ -5169,8 +5172,8 @@ mem_cgroup_css_alloc(struct cgroup_subsy long error = -ENOMEM;
memcg = mem_cgroup_alloc(); - if (!memcg) - return ERR_PTR(error); + if (IS_ERR(memcg)) + return ERR_CAST(memcg);
memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; @@ -5220,7 +5223,7 @@ mem_cgroup_css_alloc(struct cgroup_subsy fail: mem_cgroup_id_remove(memcg); mem_cgroup_free(memcg); - return ERR_PTR(-ENOMEM); + return ERR_PTR(error); }
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 
            From: Christoph Hellwig hch@lst.de
[ Upstream commit eb7ae5e06bb6e6ac6bb86872d27c43ebab92f6b2 ]
bdi_dev_name is not a fast path function, move it out of line. This prepares for using it from modular callers without having to export an implementation detail like bdi_unknown_name.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/backing-dev.h | 9 +-------- mm/backing-dev.c | 10 +++++++++- 2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index f88197c1ffc2d..c9ad5c3b7b4b2 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -505,13 +505,6 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi) (1 << WB_async_congested)); }
-extern const char *bdi_unknown_name; - -static inline const char *bdi_dev_name(struct backing_dev_info *bdi) -{ - if (!bdi || !bdi->dev) - return bdi_unknown_name; - return dev_name(bdi->dev); -} +const char *bdi_dev_name(struct backing_dev_info *bdi);
#endif /* _LINUX_BACKING_DEV_H */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 62f05f605fb5b..680e5028d0fc5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -21,7 +21,7 @@ struct backing_dev_info noop_backing_dev_info = { EXPORT_SYMBOL_GPL(noop_backing_dev_info);
static struct class *bdi_class; -const char *bdi_unknown_name = "(unknown)"; +static const char *bdi_unknown_name = "(unknown)";
/* * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU @@ -1043,6 +1043,14 @@ void bdi_put(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_put);
+const char *bdi_dev_name(struct backing_dev_info *bdi) +{ + if (!bdi || !bdi->dev) + return bdi_unknown_name; + return dev_name(bdi->dev); +} +EXPORT_SYMBOL_GPL(bdi_dev_name); + static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 
            From: Christoph Hellwig hch@lst.de
[ Upstream commit 6bd87eec23cbc9ed222bed0f5b5b02bf300e9a8d ]
Cache a copy of the name for the life time of the backing_dev_info structure so that we can reference it even after unregistering.
Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears") Reported-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/backing-dev-defs.h | 1 + mm/backing-dev.c | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 4fc87dee005ab..2849bdbb3acbe 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -220,6 +220,7 @@ struct backing_dev_info { wait_queue_head_t wb_waitq;
struct device *dev; + char dev_name[64]; struct device *owner;
struct timer_list laptop_mode_wb_timer; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 680e5028d0fc5..3f2480e4c5af3 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -938,7 +938,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) if (bdi->dev) /* The driver needs to use separate queues per device */ return 0;
- dev = device_create_vargs(bdi_class, NULL, MKDEV(0, 0), bdi, fmt, args); + vsnprintf(bdi->dev_name, sizeof(bdi->dev_name), fmt, args); + dev = device_create(bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name); if (IS_ERR(dev)) return PTR_ERR(dev);
@@ -1047,7 +1048,7 @@ const char *bdi_dev_name(struct backing_dev_info *bdi) { if (!bdi || !bdi->dev) return bdi_unknown_name; - return dev_name(bdi->dev); + return bdi->dev_name; } EXPORT_SYMBOL_GPL(bdi_dev_name);
 
            From: Amir Goldstein amir73il@gmail.com
[ Upstream commit dfc2d2594e4a79204a3967585245f00644b8f838 ]
The event inode field is used only for comparison in queue merges and cannot be dereferenced after handle_event(), because it does not hold a refcount on the inode.
Replace it with an abstract id to do the same thing.
Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- fs/notify/fanotify/fanotify.c | 4 ++-- fs/notify/inotify/inotify_fsnotify.c | 4 ++-- fs/notify/inotify/inotify_user.c | 2 +- include/linux/fsnotify_backend.h | 7 +++---- 4 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 5778d1347b351..14d0ac4664595 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -26,7 +26,7 @@ static bool should_merge(struct fsnotify_event *old_fsn, old = FANOTIFY_E(old_fsn); new = FANOTIFY_E(new_fsn);
- if (old_fsn->inode != new_fsn->inode || old->pid != new->pid || + if (old_fsn->objectid != new_fsn->objectid || old->pid != new->pid || old->fh_type != new->fh_type || old->fh_len != new->fh_len) return false;
@@ -314,7 +314,7 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, if (!event) goto out; init: __maybe_unused - fsnotify_init_event(&event->fse, inode); + fsnotify_init_event(&event->fse, (unsigned long)inode); event->mask = mask; if (FAN_GROUP_FLAG(group, FAN_REPORT_TID)) event->pid = get_pid(task_pid(current)); diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c index d510223d302ca..589dee9629938 100644 --- a/fs/notify/inotify/inotify_fsnotify.c +++ b/fs/notify/inotify/inotify_fsnotify.c @@ -39,7 +39,7 @@ static bool event_compare(struct fsnotify_event *old_fsn, if (old->mask & FS_IN_IGNORED) return false; if ((old->mask == new->mask) && - (old_fsn->inode == new_fsn->inode) && + (old_fsn->objectid == new_fsn->objectid) && (old->name_len == new->name_len) && (!old->name_len || !strcmp(old->name, new->name))) return true; @@ -118,7 +118,7 @@ int inotify_handle_event(struct fsnotify_group *group, mask &= ~IN_ISDIR;
fsn_event = &event->fse; - fsnotify_init_event(fsn_event, inode); + fsnotify_init_event(fsn_event, (unsigned long)inode); event->mask = mask; event->wd = i_mark->wd; event->sync_cookie = cookie; diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c index 107537a543fd8..81ffc8629fc4b 100644 --- a/fs/notify/inotify/inotify_user.c +++ b/fs/notify/inotify/inotify_user.c @@ -635,7 +635,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events) return ERR_PTR(-ENOMEM); } group->overflow_event = &oevent->fse; - fsnotify_init_event(group->overflow_event, NULL); + fsnotify_init_event(group->overflow_event, 0); oevent->mask = FS_Q_OVERFLOW; oevent->wd = -1; oevent->sync_cookie = 0; diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 1915bdba2fad9..64cfb5446f4d4 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -133,8 +133,7 @@ struct fsnotify_ops { */ struct fsnotify_event { struct list_head list; - /* inode may ONLY be dereferenced during handle_event(). */ - struct inode *inode; /* either the inode the event happened to or its parent */ + unsigned long objectid; /* identifier for queue merges */ };
/* @@ -500,10 +499,10 @@ extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info); extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
static inline void fsnotify_init_event(struct fsnotify_event *event, - struct inode *inode) + unsigned long objectid) { INIT_LIST_HEAD(&event->list); - event->inode = inode; + event->objectid = objectid; }
#else
 
            From: Amir Goldstein amir73il@gmail.com
[ Upstream commit f367a62a7cad2447d835a9f14fc63997a9137246 ]
With inotify, when a watch is set on a directory and on its child, an event on the child is reported twice, once with wd of the parent watch and once with wd of the child watch without the filename.
With fanotify, when a watch is set on a directory and on its child, an event on the child is reported twice, but it has the exact same information - either an open file descriptor of the child or an encoded fid of the child.
The reason that the two identical events are not merged is because the object id used for merging events in the queue is the child inode in one event and parent inode in the other.
For events with path or dentry data, use the victim inode instead of the watched inode as the object id for event merging, so that the event reported on parent will be merged with the event reported on the child.
Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org --- fs/notify/fanotify/fanotify.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 14d0ac4664595..f5d30573f4a99 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -314,7 +314,12 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, if (!event) goto out; init: __maybe_unused - fsnotify_init_event(&event->fse, (unsigned long)inode); + /* + * Use the victim inode instead of the watching inode as the id for + * event queue, so event reported on parent is merged with event + * reported on child when both directory and child watches exist. + */ + fsnotify_init_event(&event->fse, (unsigned long)id); event->mask = mask; if (FAN_GROUP_FLAG(group, FAN_REPORT_TID)) event->pid = get_pid(task_pid(current));
 
            On 13/05/2020 10:43, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.41 release. There are 90 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.41-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
All tests are passing for Tegra ...
Test results for stable-v5.4: 13 builds: 13 pass, 0 fail 26 boots: 26 pass, 0 fail 42 tests: 42 pass, 0 fail
Linux version: 5.4.41-rc1-g132220af41e6 Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000, tegra194-p2972-0000, tegra20-ventana, tegra210-p2371-2180, tegra210-p3450-0000, tegra30-cardhu-a04
Cheers Jon
 
            On Wed, May 13, 2020 at 11:43:56AM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.41 release. There are 90 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
Build results: total: 157 pass: 157 fail: 0 Qemu test results: total: 430 pass: 430 fail: 0
Guenter
 
            On Wed, 13 May 2020 at 15:19, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.4.41 release. There are 90 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.41-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
NOTE: While running libhugetlbfs fallocate_stress.sh on stable-rc 5.4 branch kernel on arm64 hikey device. The following kernel Internal error: Oops: found. https://lore.kernel.org/stable/CA+G9fYvvDjA5t+zi0Zyn2F6D=7aE-Gu-m13o47LXYYfC...
Summary ------------------------------------------------------------------------
kernel: 5.4.41-rc1 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-5.4.y git commit: 132220af41e6fd872e8c8d08d7b4e3a1b674f843 git describe: v5.4.40-91-g132220af41e6 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-5.4-oe/build/v5.4.40-91-g...
No regressions (compared to build v5.4.40)
No fixes (compared to build v5.4.40)
Ran 33743 total tests in the following environments and test suites.
Environments -------------- - dragonboard-410c - hi6220-hikey - i386 - juno-r2 - juno-r2-compat - juno-r2-kasan - nxp-ls2088 - qemu_arm - qemu_arm64 - qemu_i386 - qemu_x86_64 - x15 - x86 - x86-kasan
Test Suites ----------- * build * install-android-platform-tools-r2600 * install-android-platform-tools-r2800 * kselftest * kselftest/drivers * kselftest/filesystems * kselftest/net * kselftest/networking * libgpiod * libhugetlbfs * linux-log-parser * ltp-fcntl-locktests-tests * ltp-filecaps-tests * ltp-fs_bind-tests * ltp-fs_perms_simple-tests * ltp-fsx-tests * ltp-hugetlb-tests * ltp-mm-tests * ltp-nptl-tests * ltp-pty-tests * ltp-securebits-tests * perf * v4l2-compliance * ltp-cap_bounds-tests * ltp-commands-tests * ltp-containers-tests * ltp-cpuhotplug-tests * ltp-crypto-tests * ltp-cve-tests * ltp-dio-tests * ltp-fs-tests * ltp-io-tests * ltp-ipc-tests * ltp-math-tests * ltp-sched-tests * ltp-syscalls-tests * network-basic-tests * ltp-open-posix-tests * kvm-unit-tests * kselftest-vsyscall-mode-native * kselftest-vsyscall-mode-native/drivers * kselftest-vsyscall-mode-native/filesystems * kselftest-vsyscall-mode-native/net * kselftest-vsyscall-mode-native/networking * kselftest-vsyscall-mode-none * kselftest-vsyscall-mode-none/drivers * kselftest-vsyscall-mode-none/filesystems * kselftest-vsyscall-mode-none/net * kselftest-vsyscall-mode-none/networking
 
            On 5/13/20 3:43 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.41 release. There are 90 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 15 May 2020 09:41:20 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.41-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
thanks, -- Shuah
linux-stable-mirror@lists.linaro.org




