Note, this will be the LAST 5.1.y kernel release. Everyone should move to the 5.2.y series at this point in time.
This is the start of the stable review cycle for the 5.1.21 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun 28 Jul 2019 03:21:13 PM UTC. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.1.21-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.1.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.1.21-rc1
Kuo-Hsin Yang vovoy@chromium.org mm: vmscan: scan anonymous pages on file refaults
Damien Le Moal damien.lemoal@wdc.com block: Limit zone array allocation size
Damien Le Moal damien.lemoal@wdc.com sd_zbc: Fix report zones buffer allocation
Paolo Bonzini pbonzini@redhat.com Revert "kvm: x86: Use task structs fpu field for user"
Jan Kiszka jan.kiszka@siemens.com KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested
Paolo Bonzini pbonzini@redhat.com KVM: nVMX: do not use dangling shadow VMCS after guest reset
Theodore Ts'o tytso@mit.edu ext4: allow directory holes
Ross Zwisler zwisler@chromium.org ext4: use jbd2_inode dirty range scoping
Ross Zwisler zwisler@chromium.org jbd2: introduce jbd2_inode dirty range scoping
Ross Zwisler zwisler@chromium.org mm: add filemap_fdatawait_range_keep_errors()
Theodore Ts'o tytso@mit.edu ext4: enforce the immutable flag on open files
Darrick J. Wong darrick.wong@oracle.com ext4: don't allow any modifications to an immutable file
Peter Zijlstra peterz@infradead.org perf/core: Fix race between close() and fork()
Alexander Shishkin alexander.shishkin@linux.intel.com perf/core: Fix exclusive events' grouping
Song Liu songliubraving@fb.com perf script: Assume native_arch for pipe mode
Paul Cercueil paul@crapouillou.net MIPS: lb60: Fix pin mappings
Keerthy j-keerthy@ti.com gpio: davinci: silence error prints in case of EPROBE_DEFER
Nishka Dasgupta nishkadg.linux@gmail.com gpiolib: of: fix a memory leak in of_gpio_flags_quirks()
Chris Wilson chris@chris-wilson.co.uk dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc
Jérôme Glisse jglisse@redhat.com dma-buf: balance refcount inbalance
Aya Levin ayal@mellanox.com net/mlx5e: Fix error flow in tx reporter diagnose
Aya Levin ayal@mellanox.com net/mlx5e: Fix return value from timeout recover function
Saeed Mahameed saeedm@mellanox.com net/mlx5e: Rx, Fix checksum calculation for new hardware
Eli Britstein elibr@mellanox.com net/mlx5e: Fix port tunnel GRE entropy control
Jakub Kicinski jakub.kicinski@netronome.com net/tls: reject offload of TLS 1.3
Jakub Kicinski jakub.kicinski@netronome.com net/tls: fix poll ignoring partially copied records
Frank de Brabander debrabander@gmail.com selftests: txring_overwrite: fix incorrect test of mmap() return value
Cong Wang xiyou.wangcong@gmail.com netrom: hold sock when setting skb->destructor
Cong Wang xiyou.wangcong@gmail.com netrom: fix a memory leak in nr_rx_frame()
Andreas Steinmetz ast@domdv.de macsec: fix checksumming after decryption
Andreas Steinmetz ast@domdv.de macsec: fix use-after-free of skb during RX
Nikolay Aleksandrov nikolay@cumulusnetworks.com net: bridge: stp: don't cache eth dest pointer before skb pull
Nikolay Aleksandrov nikolay@cumulusnetworks.com net: bridge: don't cache ether dest pointer on input
Nikolay Aleksandrov nikolay@cumulusnetworks.com net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
Nikolay Aleksandrov nikolay@cumulusnetworks.com net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
Aya Levin ayal@mellanox.com net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
Peter Kosyh p.kosyh@gmail.com vrf: make sure skb->data contains ip header to make routing
Christoph Paasch cpaasch@apple.com tcp: Reset bytes_acked and bytes_received when disconnecting
Eric Dumazet edumazet@google.com tcp: fix tcp_set_congestion_control() use from bpf hook
Eric Dumazet edumazet@google.com tcp: be more careful in tcp_fragment()
Takashi Iwai tiwai@suse.de sky2: Disable MSI on ASUS P6T
Xin Long lucien.xin@gmail.com sctp: not bind the socket in sctp_connect
Marcelo Ricardo Leitner marcelo.leitner@gmail.com sctp: fix error handling on stream scheduler initialization
David Howells dhowells@redhat.com rxrpc: Fix send on a connected, but unbound socket
Heiner Kallweit hkallweit1@gmail.com r8169: fix issue with confused RX unit after PHY power-down on RTL8411b
Yang Wei albin_yang@163.com nfc: fix potential illegal memory access
Jakub Kicinski jakub.kicinski@netronome.com net/tls: make sure offload also gets the keys wiped
Jose Abreu Jose.Abreu@synopsys.com net: stmmac: Re-work the queue selection for TSO packets
Cong Wang xiyou.wangcong@gmail.com net_sched: unset TCQ_F_CAN_BYPASS when adding filters
Andrew Lunn andrew@lunn.ch net: phy: sfp: hwmon: Fix scaling of RX power
John Hurley john.hurley@netronome.com net: openvswitch: fix csum updates for MPLS actions
Lorenzo Bianconi lorenzo.bianconi@redhat.com net: neigh: fix multiple neigh timer scheduling
Florian Westphal fw@strlen.de net: make skb_dst_force return true when dst is refcounted
Baruch Siach baruch@tkos.co.il net: dsa: mv88e6xxx: wait after reset deactivation
Justin Chen justinpopo6@gmail.com net: bcmgenet: use promisc for unsupported filters
Ido Schimmel idosch@mellanox.com ipv6: Unlink sibling route in case of failure
David Ahern dsahern@gmail.com ipv6: rt6_check should return NULL if 'from' is NULL
Matteo Croce mcroce@redhat.com ipv4: don't set IPv6 only flags to IPv4 addresses
Eric Dumazet edumazet@google.com igmp: fix memory leak in igmpv3_del_delrec()
Haiyang Zhang haiyangz@microsoft.com hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
Taehee Yoo ap420073@gmail.com caif-hsi: fix possible deadlock in cfhsi_exit_module()
Brian King brking@linux.vnet.ibm.com bnx2x: Prevent load reordering in tx completion processing
-------------
Diffstat:
Makefile | 4 +- arch/mips/jz4740/board-qi_lb60.c | 16 +-- arch/x86/include/asm/kvm_host.h | 7 +- arch/x86/kvm/vmx/nested.c | 10 +- arch/x86/kvm/x86.c | 4 +- block/blk-zoned.c | 46 ++++--- drivers/dma-buf/dma-buf.c | 1 + drivers/dma-buf/reservation.c | 4 + drivers/gpio/gpio-davinci.c | 5 +- drivers/gpio/gpiolib-of.c | 1 + drivers/net/caif/caif_hsi.c | 2 +- drivers/net/dsa/mv88e6xxx/chip.c | 2 + drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 + drivers/net/ethernet/broadcom/genet/bcmgenet.c | 57 ++++----- drivers/net/ethernet/marvell/sky2.c | 7 ++ drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + .../ethernet/mellanox/mlx5/core/en/reporter_tx.c | 10 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 7 +- .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 9 +- .../net/ethernet/mellanox/mlx5/core/lib/port_tun.c | 23 +--- drivers/net/ethernet/realtek/r8169.c | 137 +++++++++++++++++++++ drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 29 +++-- drivers/net/hyperv/netvsc_drv.c | 1 - drivers/net/macsec.c | 6 +- drivers/net/phy/sfp.c | 2 +- drivers/net/vrf.c | 58 +++++---- drivers/scsi/sd_zbc.c | 104 +++++++++++----- fs/ext4/dir.c | 19 ++- fs/ext4/ext4_jbd2.h | 12 +- fs/ext4/file.c | 4 + fs/ext4/inode.c | 24 +++- fs/ext4/ioctl.c | 46 ++++++- fs/ext4/move_extent.c | 3 +- fs/ext4/namei.c | 45 +++++-- fs/jbd2/commit.c | 23 +++- fs/jbd2/journal.c | 4 + fs/jbd2/transaction.c | 49 ++++---- include/linux/blkdev.h | 5 + include/linux/fs.h | 2 + include/linux/jbd2.h | 22 ++++ include/linux/mlx5/mlx5_ifc.h | 3 +- include/linux/perf_event.h | 5 + include/net/dst.h | 5 +- include/net/tcp.h | 8 +- include/net/tls.h | 1 + kernel/events/core.c | 83 ++++++++++--- mm/filemap.c | 22 ++++ mm/vmscan.c | 6 +- net/bridge/br_input.c | 8 +- net/bridge/br_multicast.c | 23 ++-- net/bridge/br_stp_bpdu.c | 3 +- net/core/filter.c | 2 +- net/core/neighbour.c | 2 + net/ipv4/devinet.c | 8 ++ net/ipv4/igmp.c | 8 +- net/ipv4/tcp.c | 6 +- net/ipv4/tcp_cong.c | 6 +- net/ipv4/tcp_output.c | 13 +- net/ipv6/ip6_fib.c | 18 ++- net/ipv6/route.c | 2 +- net/netfilter/nf_queue.c | 6 +- net/netrom/af_netrom.c | 4 +- net/nfc/nci/data.c | 2 +- net/openvswitch/actions.c | 6 +- net/rxrpc/af_rxrpc.c | 4 +- net/sched/cls_api.c | 1 + net/sched/sch_fq_codel.c | 2 - net/sched/sch_sfq.c | 2 - net/sctp/socket.c | 24 +--- net/sctp/stream.c | 9 +- net/tls/tls_device.c | 10 +- net/tls/tls_main.c | 4 +- net/tls/tls_sw.c | 3 +- tools/perf/builtin-script.c | 3 +- tools/testing/selftests/net/txring_overwrite.c | 2 +- 76 files changed, 816 insertions(+), 315 deletions(-)
From: Brian King brking@linux.vnet.ibm.com
[ Upstream commit ea811b795df24644a8eb760b493c43fba4450677 ]
This patch fixes an issue seen on Power systems with bnx2x which results in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.
Signed-off-by: Brian King brking@linux.vnet.ibm.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struc hw_cons = le16_to_cpu(*txdata->tx_cons_sb); sw_cons = txdata->tx_pkt_cons;
+ /* Ensure subsequent loads occur after hw_cons */ + smp_rmb(); + while (sw_cons != hw_cons) { u16 pkt_cons;
From: Taehee Yoo ap420073@gmail.com
[ Upstream commit fdd258d49e88a9e0b49ef04a506a796f1c768a8e ]
cfhsi_exit_module() calls unregister_netdev() under rtnl_lock(). but unregister_netdev() internally calls rtnl_lock(). So deadlock would occur.
Fixes: c41254006377 ("caif-hsi: Add rtnl support") Signed-off-by: Taehee Yoo ap420073@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/caif/caif_hsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/caif/caif_hsi.c +++ b/drivers/net/caif/caif_hsi.c @@ -1455,7 +1455,7 @@ static void __exit cfhsi_exit_module(voi rtnl_lock(); list_for_each_safe(list_node, n, &cfhsi_list) { cfhsi = list_entry(list_node, struct cfhsi, list); - unregister_netdev(cfhsi->ndev); + unregister_netdevice(cfhsi->ndev); } rtnl_unlock(); }
From: Haiyang Zhang haiyangz@microsoft.com
[ Upstream commit be4363bdf0ce9530f15aa0a03d1060304d116b15 ]
There is an extra rcu_read_unlock left in netvsc_recv_callback(), after a previous patch that removes RCU from this function. This patch removes the extra RCU unlock.
Fixes: 345ac08990b8 ("hv_netvsc: pass netvsc_device to receive callback") Signed-off-by: Haiyang Zhang haiyangz@microsoft.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/hyperv/netvsc_drv.c | 1 - 1 file changed, 1 deletion(-)
--- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -849,7 +849,6 @@ int netvsc_recv_callback(struct net_devi
if (unlikely(!skb)) { ++net_device_ctx->eth_stats.rx_no_memory; - rcu_read_unlock(); return NVSP_STAT_FAIL; }
From: Eric Dumazet edumazet@google.com
[ Upstream commit e5b1c6c6277d5a283290a8c033c72544746f9b5b ]
im->tomb and/or im->sources might not be NULL, but we currently overwrite their values blindly.
Using swap() will make sure the following call to kfree_pmc(pmc) will properly free the psf structures.
Tested with the C repro provided by syzbot, which basically does :
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IP, IP_ADD_MEMBERSHIP, "\340\0\0\2\177\0\0\1\0\0\0\0", 12) = 0 ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=0}) = 0 setsockopt(3, SOL_IP, IP_MSFILTER, "\340\0\0\2\177\0\0\1\1\0\0\0\1\0\0\0\377\377\377\377", 20) = 0 ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP}) = 0 exit_group(0) = ?
BUG: memory leak unreferenced object 0xffff88811450f140 (size 64): comm "softirq", pid 0, jiffies 4294942448 (age 32.070s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 ................ 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ backtrace: [<00000000c7bad083>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<00000000c7bad083>] slab_post_alloc_hook mm/slab.h:439 [inline] [<00000000c7bad083>] slab_alloc mm/slab.c:3326 [inline] [<00000000c7bad083>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<000000009acc4151>] kmalloc include/linux/slab.h:547 [inline] [<000000009acc4151>] kzalloc include/linux/slab.h:742 [inline] [<000000009acc4151>] ip_mc_add1_src net/ipv4/igmp.c:1976 [inline] [<000000009acc4151>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2100 [<000000004ac14566>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2484 [<0000000052d8f995>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:959 [<000000004ee1e21f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1248 [<0000000066cdfe74>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2618 [<000000009383a786>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3126 [<00000000d8ac0c94>] __sys_setsockopt+0x98/0x120 net/socket.c:2072 [<000000001b1e9666>] __do_sys_setsockopt net/socket.c:2083 [inline] [<000000001b1e9666>] __se_sys_setsockopt net/socket.c:2080 [inline] [<000000001b1e9666>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080 [<00000000420d395e>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 [<000000007fd83a4b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when set link down") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Hangbin Liu liuhangbin@gmail.com Reported-by: syzbot+6ca1abd0db68b5173a4f@syzkaller.appspotmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv4/igmp.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
--- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -1232,12 +1232,8 @@ static void igmpv3_del_delrec(struct in_ if (pmc) { im->interface = pmc->interface; if (im->sfmode == MCAST_INCLUDE) { - im->tomb = pmc->tomb; - pmc->tomb = NULL; - - im->sources = pmc->sources; - pmc->sources = NULL; - + swap(im->tomb, pmc->tomb); + swap(im->sources, pmc->sources); for (psf = im->sources; psf; psf = psf->sf_next) psf->sf_crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv; } else {
From: Matteo Croce mcroce@redhat.com
[ Upstream commit 2e60546368165c2449564d71f6005dda9205b5fb ]
Avoid the situation where an IPV6 only flag is applied to an IPv4 address:
# ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute # ip -4 addr show dev dummy0 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global noprefixroute dummy0 valid_lft forever preferred_lft forever
Or worse, by sending a malicious netlink command:
# ip -4 addr show dev dummy0 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global nodad optimistic dadfailed home tentative mngtmpaddr noprefixroute stable-privacy dummy0 valid_lft forever preferred_lft forever
Signed-off-by: Matteo Croce mcroce@redhat.com Reviewed-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv4/devinet.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -66,6 +66,11 @@ #include <net/net_namespace.h> #include <net/addrconf.h>
+#define IPV6ONLY_FLAGS \ + (IFA_F_NODAD | IFA_F_OPTIMISTIC | IFA_F_DADFAILED | \ + IFA_F_HOMEADDRESS | IFA_F_TENTATIVE | \ + IFA_F_MANAGETEMPADDR | IFA_F_STABLE_PRIVACY) + static struct ipv4_devconf ipv4_devconf = { .data = { [IPV4_DEVCONF_ACCEPT_REDIRECTS - 1] = 1, @@ -472,6 +477,9 @@ static int __inet_insert_ifa(struct in_i ifa->ifa_flags &= ~IFA_F_SECONDARY; last_primary = &in_dev->ifa_list;
+ /* Don't set IPv6 only flags to IPv4 addresses */ + ifa->ifa_flags &= ~IPV6ONLY_FLAGS; + for (ifap = &in_dev->ifa_list; (ifa1 = *ifap) != NULL; ifap = &ifa1->ifa_next) { if (!(ifa1->ifa_flags & IFA_F_SECONDARY) &&
From: David Ahern dsahern@gmail.com
[ Upstream commit 49d05fe2c9d1b4a27761c9807fec39b8155bef9e ]
Paul reported that l2tp sessions were broken after the commit referenced in the Fixes tag. Prior to this commit rt6_check returned NULL if the rt6_info 'from' was NULL - ie., the dst_entry was disconnected from a FIB entry. Restore that behavior.
Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: Paul Donohue linux-kernel@PaulSD.com Tested-by: Paul Donohue linux-kernel@PaulSD.com Signed-off-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv6/route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2183,7 +2183,7 @@ static struct dst_entry *rt6_check(struc { u32 rt_cookie = 0;
- if ((from && !fib6_get_cookie_safe(from, &rt_cookie)) || + if (!from || !fib6_get_cookie_safe(from, &rt_cookie) || rt_cookie != cookie) return NULL;
From: Ido Schimmel idosch@mellanox.com
[ Upstream commit 54851aa90cf27041d64b12f65ac72e9f97bd90fd ]
When a route needs to be appended to an existing multipath route, fib6_add_rt2node() first appends it to the siblings list and increments the number of sibling routes on each sibling.
Later, the function notifies the route via call_fib6_entry_notifiers(). In case the notification is vetoed, the route is not unlinked from the siblings list, which can result in a use-after-free.
Fix this by unlinking the route from the siblings list before returning an error.
Audited the rest of the call sites from which the FIB notification chain is called and could not find more problems.
Fixes: 2233000cba40 ("net/ipv6: Move call_fib6_entry_notifiers up for route adds") Signed-off-by: Ido Schimmel idosch@mellanox.com Reported-by: Alexander Petrovskiy alexpe@mellanox.com Reviewed-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv6/ip6_fib.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)
--- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1113,8 +1113,24 @@ add: err = call_fib6_entry_notifiers(info->nl_net, FIB_EVENT_ENTRY_ADD, rt, extack); - if (err) + if (err) { + struct fib6_info *sibling, *next_sibling; + + /* If the route has siblings, then it first + * needs to be unlinked from them. + */ + if (!rt->fib6_nsiblings) + return err; + + list_for_each_entry_safe(sibling, next_sibling, + &rt->fib6_siblings, + fib6_siblings) + sibling->fib6_nsiblings--; + rt->fib6_nsiblings = 0; + list_del_init(&rt->fib6_siblings); + rt6_multipath_rebalance(next_sibling); return err; + }
rcu_assign_pointer(rt->fib6_next, iter); atomic_inc(&rt->fib6_ref);
From: Justin Chen justinpopo6@gmail.com
[ Upstream commit 35cbef9863640f06107144687bd13151bc2e8ce3 ]
Currently we silently ignore filters if we cannot meet the filter requirements. This will lead to the MAC dropping packets that are expected to pass. A better solution would be to set the NIC to promisc mode when the required filters cannot be met.
Also correct the number of MDF filters supported. It should be 17, not 16.
Signed-off-by: Justin Chen justinpopo6@gmail.com Reviewed-by: Florian Fainelli f.fainelli@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 57 +++++++++++-------------- 1 file changed, 26 insertions(+), 31 deletions(-)
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c @@ -3086,39 +3086,42 @@ static void bcmgenet_timeout(struct net_ netif_tx_wake_all_queues(dev); }
-#define MAX_MC_COUNT 16 +#define MAX_MDF_FILTER 17
static inline void bcmgenet_set_mdf_addr(struct bcmgenet_priv *priv, unsigned char *addr, - int *i, - int *mc) + int *i) { - u32 reg; - bcmgenet_umac_writel(priv, addr[0] << 8 | addr[1], UMAC_MDF_ADDR + (*i * 4)); bcmgenet_umac_writel(priv, addr[2] << 24 | addr[3] << 16 | addr[4] << 8 | addr[5], UMAC_MDF_ADDR + ((*i + 1) * 4)); - reg = bcmgenet_umac_readl(priv, UMAC_MDF_CTRL); - reg |= (1 << (MAX_MC_COUNT - *mc)); - bcmgenet_umac_writel(priv, reg, UMAC_MDF_CTRL); *i += 2; - (*mc)++; }
static void bcmgenet_set_rx_mode(struct net_device *dev) { struct bcmgenet_priv *priv = netdev_priv(dev); struct netdev_hw_addr *ha; - int i, mc; + int i, nfilter; u32 reg;
netif_dbg(priv, hw, dev, "%s: %08X\n", __func__, dev->flags);
- /* Promiscuous mode */ + /* Number of filters needed */ + nfilter = netdev_uc_count(dev) + netdev_mc_count(dev) + 2; + + /* + * Turn on promicuous mode for three scenarios + * 1. IFF_PROMISC flag is set + * 2. IFF_ALLMULTI flag is set + * 3. The number of filters needed exceeds the number filters + * supported by the hardware. + */ reg = bcmgenet_umac_readl(priv, UMAC_CMD); - if (dev->flags & IFF_PROMISC) { + if ((dev->flags & (IFF_PROMISC | IFF_ALLMULTI)) || + (nfilter > MAX_MDF_FILTER)) { reg |= CMD_PROMISC; bcmgenet_umac_writel(priv, reg, UMAC_CMD); bcmgenet_umac_writel(priv, 0, UMAC_MDF_CTRL); @@ -3128,32 +3131,24 @@ static void bcmgenet_set_rx_mode(struct bcmgenet_umac_writel(priv, reg, UMAC_CMD); }
- /* UniMac doesn't support ALLMULTI */ - if (dev->flags & IFF_ALLMULTI) { - netdev_warn(dev, "ALLMULTI is not supported\n"); - return; - } - /* update MDF filter */ i = 0; - mc = 0; /* Broadcast */ - bcmgenet_set_mdf_addr(priv, dev->broadcast, &i, &mc); + bcmgenet_set_mdf_addr(priv, dev->broadcast, &i); /* my own address.*/ - bcmgenet_set_mdf_addr(priv, dev->dev_addr, &i, &mc); - /* Unicast list*/ - if (netdev_uc_count(dev) > (MAX_MC_COUNT - mc)) - return; + bcmgenet_set_mdf_addr(priv, dev->dev_addr, &i);
- if (!netdev_uc_empty(dev)) - netdev_for_each_uc_addr(ha, dev) - bcmgenet_set_mdf_addr(priv, ha->addr, &i, &mc); - /* Multicast */ - if (netdev_mc_empty(dev) || netdev_mc_count(dev) >= (MAX_MC_COUNT - mc)) - return; + /* Unicast */ + netdev_for_each_uc_addr(ha, dev) + bcmgenet_set_mdf_addr(priv, ha->addr, &i);
+ /* Multicast */ netdev_for_each_mc_addr(ha, dev) - bcmgenet_set_mdf_addr(priv, ha->addr, &i, &mc); + bcmgenet_set_mdf_addr(priv, ha->addr, &i); + + /* Enable filters */ + reg = GENMASK(MAX_MDF_FILTER - 1, MAX_MDF_FILTER - nfilter); + bcmgenet_umac_writel(priv, reg, UMAC_MDF_CTRL); }
/* Set the hardware MAC address. */
From: Baruch Siach baruch@tkos.co.il
[ Upstream commit 7b75e49de424ceb53d13e60f35d0a73765626fda ]
Add a 1ms delay after reset deactivation. Otherwise the chip returns bogus ID value. This is observed with 88E6390 (Peridot) chip.
Signed-off-by: Baruch Siach baruch@tkos.co.il Reviewed-by: Andrew Lunn andrew@lunn.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/dsa/mv88e6xxx/chip.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -4910,6 +4910,8 @@ static int mv88e6xxx_probe(struct mdio_d err = PTR_ERR(chip->reset); goto out; } + if (chip->reset) + usleep_range(1000, 2000);
err = mv88e6xxx_detect(chip); if (err)
From: Florian Westphal fw@strlen.de
[ Upstream commit b60a77386b1d4868f72f6353d35dabe5fbe981f2 ]
netfilter did not expect that skb_dst_force() can cause skb to lose its dst entry.
I got a bug report with a skb->dst NULL dereference in netfilter output path. The backtrace contains nf_reinject(), so the dst might have been cleared when skb got queued to userspace.
Other users were fixed via if (skb_dst(skb)) { skb_dst_force(skb); if (!skb_dst(skb)) goto handle_err; }
But I think its preferable to make the 'dst might be cleared' part of the function explicit.
In netfilter case, skb with a null dst is expected when queueing in prerouting hook, so drop skb for the other hooks.
v2: v1 of this patch returned true in case skb had no dst entry. Eric said: Say if we have two skb_dst_force() calls for some reason on the same skb, only the first one will return false.
This now returns false even when skb had no dst, as per Erics suggestion, so callers might need to check skb_dst() first before skb_dst_force().
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/dst.h | 5 ++++- net/netfilter/nf_queue.c | 6 +++++- 2 files changed, 9 insertions(+), 2 deletions(-)
--- a/include/net/dst.h +++ b/include/net/dst.h @@ -313,8 +313,9 @@ static inline bool dst_hold_safe(struct * @skb: buffer * * If dst is not yet refcounted and not destroyed, grab a ref on it. + * Returns true if dst is refcounted. */ -static inline void skb_dst_force(struct sk_buff *skb) +static inline bool skb_dst_force(struct sk_buff *skb) { if (skb_dst_is_noref(skb)) { struct dst_entry *dst = skb_dst(skb); @@ -325,6 +326,8 @@ static inline void skb_dst_force(struct
skb->_skb_refdst = (unsigned long)dst; } + + return skb->_skb_refdst != 0UL; }
--- a/net/netfilter/nf_queue.c +++ b/net/netfilter/nf_queue.c @@ -190,6 +190,11 @@ static int __nf_queue(struct sk_buff *sk goto err; }
+ if (!skb_dst_force(skb) && state->hook != NF_INET_PRE_ROUTING) { + status = -ENETDOWN; + goto err; + } + *entry = (struct nf_queue_entry) { .skb = skb, .state = *state, @@ -198,7 +203,6 @@ static int __nf_queue(struct sk_buff *sk };
nf_queue_entry_get_refs(entry); - skb_dst_force(skb);
switch (entry->state.pf) { case AF_INET:
From: Lorenzo Bianconi lorenzo.bianconi@redhat.com
[ Upstream commit 071c37983d99da07797294ea78e9da1a6e287144 ]
Neigh timer can be scheduled multiple times from userspace adding multiple neigh entries and forcing the neigh timer scheduling passing NTF_USE in the netlink requests. This will result in a refcount leak and in the following dump stack:
[ 32.465295] NEIGH: BUG, double timer add, state is 8 [ 32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65 [ 32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 32.465313] Call Trace: [ 32.465318] dump_stack+0x7c/0xc0 [ 32.465323] __neigh_event_send+0x20c/0x880 [ 32.465326] ? ___neigh_create+0x846/0xfb0 [ 32.465329] ? neigh_lookup+0x2a9/0x410 [ 32.465332] ? neightbl_fill_info.constprop.0+0x800/0x800 [ 32.465334] neigh_add+0x4f8/0x5e0 [ 32.465337] ? neigh_xmit+0x620/0x620 [ 32.465341] ? find_held_lock+0x85/0xa0 [ 32.465345] rtnetlink_rcv_msg+0x204/0x570 [ 32.465348] ? rtnl_dellink+0x450/0x450 [ 32.465351] ? mark_held_locks+0x90/0x90 [ 32.465354] ? match_held_lock+0x1b/0x230 [ 32.465357] netlink_rcv_skb+0xc4/0x1d0 [ 32.465360] ? rtnl_dellink+0x450/0x450 [ 32.465363] ? netlink_ack+0x420/0x420 [ 32.465366] ? netlink_deliver_tap+0x115/0x560 [ 32.465369] ? __alloc_skb+0xc9/0x2f0 [ 32.465372] netlink_unicast+0x270/0x330 [ 32.465375] ? netlink_attachskb+0x2f0/0x2f0 [ 32.465378] netlink_sendmsg+0x34f/0x5a0 [ 32.465381] ? netlink_unicast+0x330/0x330 [ 32.465385] ? move_addr_to_kernel.part.0+0x20/0x20 [ 32.465388] ? netlink_unicast+0x330/0x330 [ 32.465391] sock_sendmsg+0x91/0xa0 [ 32.465394] ___sys_sendmsg+0x407/0x480 [ 32.465397] ? copy_msghdr_from_user+0x200/0x200 [ 32.465401] ? _raw_spin_unlock_irqrestore+0x37/0x40 [ 32.465404] ? lockdep_hardirqs_on+0x17d/0x250 [ 32.465407] ? __wake_up_common_lock+0xcb/0x110 [ 32.465410] ? __wake_up_common+0x230/0x230 [ 32.465413] ? netlink_bind+0x3e1/0x490 [ 32.465416] ? netlink_setsockopt+0x540/0x540 [ 32.465420] ? __fget_light+0x9c/0xf0 [ 32.465423] ? sockfd_lookup_light+0x8c/0xb0 [ 32.465426] __sys_sendmsg+0xa5/0x110 [ 32.465429] ? __ia32_sys_shutdown+0x30/0x30 [ 32.465432] ? __fd_install+0xe1/0x2c0 [ 32.465435] ? lockdep_hardirqs_off+0xb5/0x100 [ 32.465438] ? mark_held_locks+0x24/0x90 [ 32.465441] ? do_syscall_64+0xf/0x270 [ 32.465444] do_syscall_64+0x63/0x270 [ 32.465448] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER' receiving a netlink request with NTF_USE flag set
Reported-by: Marek Majkowski marek@cloudflare.com Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table") Signed-off-by: Lorenzo Bianconi lorenzo.bianconi@redhat.com Reviewed-by: David Ahern dsahern@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/neighbour.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1126,6 +1126,7 @@ int __neigh_event_send(struct neighbour
atomic_set(&neigh->probes, NEIGH_VAR(neigh->parms, UCAST_PROBES)); + neigh_del_timer(neigh); neigh->nud_state = NUD_INCOMPLETE; neigh->updated = now; next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME), @@ -1142,6 +1143,7 @@ int __neigh_event_send(struct neighbour } } else if (neigh->nud_state & NUD_STALE) { neigh_dbg(2, "neigh %p is delayed\n", neigh); + neigh_del_timer(neigh); neigh->nud_state = NUD_DELAY; neigh->updated = jiffies; neigh_add_timer(neigh, jiffies +
From: John Hurley john.hurley@netronome.com
[ Upstream commit 0e3183cd2a64843a95b62f8bd4a83605a4cf0615 ]
Skbs may have their checksum value populated by HW. If this is a checksum calculated over the entire packet then the CHECKSUM_COMPLETE field is marked. Changes to the data pointer on the skb throughout the network stack still try to maintain this complete csum value if it is required through functions such as skb_postpush_rcsum.
The MPLS actions in Open vSwitch modify a CHECKSUM_COMPLETE value when changes are made to packet data without a push or a pull. This occurs when the ethertype of the MAC header is changed or when MPLS lse fields are modified.
The modification is carried out using the csum_partial function to get the csum of a buffer and add it into the larger checksum. The buffer is an inversion of the data to be removed followed by the new data. Because the csum is calculated over 16 bits and these values align with 16 bits, the effect is the removal of the old value from the CHECKSUM_COMPLETE and addition of the new value.
However, the csum fed into the function and the outcome of the calculation are also inverted. This would only make sense if it was the new value rather than the old that was inverted in the input buffer.
Fix the issue by removing the bit inverts in the csum_partial calculation.
The bug was verified and the fix tested by comparing the folded value of the updated CHECKSUM_COMPLETE value with the folded value of a full software checksum calculation (reset skb->csum to 0 and run skb_checksum_complete(skb)). Prior to the fix the outcomes differed but after they produce the same result.
Fixes: 25cd9ba0abc0 ("openvswitch: Add basic MPLS support to kernel") Fixes: bc7cc5999fd3 ("openvswitch: update checksum in {push,pop}_mpls") Signed-off-by: John Hurley john.hurley@netronome.com Reviewed-by: Jakub Kicinski jakub.kicinski@netronome.com Reviewed-by: Simon Horman simon.horman@netronome.com Acked-by: Pravin B Shelar pshelar@ovn.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/openvswitch/actions.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -175,8 +175,7 @@ static void update_ethertype(struct sk_b if (skb->ip_summed == CHECKSUM_COMPLETE) { __be16 diff[] = { ~(hdr->h_proto), ethertype };
- skb->csum = ~csum_partial((char *)diff, sizeof(diff), - ~skb->csum); + skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum); }
hdr->h_proto = ethertype; @@ -268,8 +267,7 @@ static int set_mpls(struct sk_buff *skb, if (skb->ip_summed == CHECKSUM_COMPLETE) { __be32 diff[] = { ~(stack->label_stack_entry), lse };
- skb->csum = ~csum_partial((char *)diff, sizeof(diff), - ~skb->csum); + skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum); }
stack->label_stack_entry = lse;
From: Andrew Lunn andrew@lunn.ch
[ Upstream commit 0cea0e1148fe134a4a3aaf0b1496f09241fb943a ]
The RX power read from the SFP uses units of 0.1uW. This must be scaled to units of uW for HWMON. This requires a divide by 10, not the current 100.
With this change in place, sensors(1) and ethtool -m agree:
sff2-isa-0000 Adapter: ISA adapter in0: +3.23 V temp1: +33.1 C power1: 270.00 uW power2: 200.00 uW curr1: +0.01 A
Laser output power : 0.2743 mW / -5.62 dBm Receiver signal average optical power : 0.2014 mW / -6.96 dBm
Reported-by: chris.healy@zii.aero Signed-off-by: Andrew Lunn andrew@lunn.ch Fixes: 1323061a018a ("net: phy: sfp: Add HWMON support for module sensors") Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/phy/sfp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/phy/sfp.c +++ b/drivers/net/phy/sfp.c @@ -515,7 +515,7 @@ static int sfp_hwmon_read_sensor(struct
static void sfp_hwmon_to_rx_power(long *value) { - *value = DIV_ROUND_CLOSEST(*value, 100); + *value = DIV_ROUND_CLOSEST(*value, 10); }
static void sfp_hwmon_calibrate(struct sfp *sfp, unsigned int slope, int offset,
From: Cong Wang xiyou.wangcong@gmail.com
[ Upstream commit 3f05e6886a595c9a29a309c52f45326be917823c ]
For qdisc's that support TC filters and set TCQ_F_CAN_BYPASS, notably fq_codel, it makes no sense to let packets bypass the TC filters we setup in any scenario, otherwise our packets steering policy could not be enforced.
This can be reproduced easily with the following script:
ip li add dev dummy0 type dummy ifconfig dummy0 up tc qd add dev dummy0 root fq_codel tc filter add dev dummy0 parent 8001: protocol arp basic action mirred egress redirect dev lo tc filter add dev dummy0 parent 8001: protocol ip basic action mirred egress redirect dev lo ping -I dummy0 192.168.112.1
Without this patch, packets are sent directly to dummy0 without hitting any of the filters. With this patch, packets are redirected to loopback as expected.
This fix is not perfect, it only unsets the flag but does not set it back because we have to save the information somewhere in the qdisc if we really want that. Note, both fq_codel and sfq clear this flag in their ->bind_tcf() but this is clearly not sufficient when we don't use any class ID.
Fixes: 23624935e0c4 ("net_sched: TCQ_F_CAN_BYPASS generalization") Cc: Eric Dumazet edumazet@google.com Signed-off-by: Cong Wang xiyou.wangcong@gmail.com Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sched/cls_api.c | 1 + net/sched/sch_fq_codel.c | 2 -- net/sched/sch_sfq.c | 2 -- 3 files changed, 1 insertion(+), 4 deletions(-)
--- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -2162,6 +2162,7 @@ replay: tfilter_notify(net, skb, n, tp, block, q, parent, fh, RTM_NEWTFILTER, false, rtnl_held); tfilter_put(tp, fh); + q->flags &= ~TCQ_F_CAN_BYPASS; }
errout: --- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -600,8 +600,6 @@ static unsigned long fq_codel_find(struc static unsigned long fq_codel_bind(struct Qdisc *sch, unsigned long parent, u32 classid) { - /* we cannot bypass queue discipline anymore */ - sch->flags &= ~TCQ_F_CAN_BYPASS; return 0; }
--- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -828,8 +828,6 @@ static unsigned long sfq_find(struct Qdi static unsigned long sfq_bind(struct Qdisc *sch, unsigned long parent, u32 classid) { - /* we cannot bypass queue discipline anymore */ - sch->flags &= ~TCQ_F_CAN_BYPASS; return 0; }
From: Jose Abreu Jose.Abreu@synopsys.com
[ Upstream commit 4993e5b37e8bcb55ac90f76eb6d2432647273747 ]
Ben Hutchings says: "This is the wrong place to change the queue mapping. stmmac_xmit() is called with a specific TX queue locked, and accessing a different TX queue results in a data race for all of that queue's state.
I think this commit should be reverted upstream and in all stable branches. Instead, the driver should implement the ndo_select_queue operation and override the queue mapping there."
Fixes: c5acdbee22a1 ("net: stmmac: Send TSO packets always from Queue 0") Suggested-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Jose Abreu joabreu@synopsys.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 29 ++++++++++++++-------- 1 file changed, 19 insertions(+), 10 deletions(-)
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -3058,17 +3058,8 @@ static netdev_tx_t stmmac_xmit(struct sk
/* Manage oversized TCP frames for GMAC4 device */ if (skb_is_gso(skb) && priv->tso) { - if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) { - /* - * There is no way to determine the number of TSO - * capable Queues. Let's use always the Queue 0 - * because if TSO is supported then at least this - * one will be capable. - */ - skb_set_queue_mapping(skb, 0); - + if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) return stmmac_tso_xmit(skb, dev); - } }
if (unlikely(stmmac_tx_avail(priv, queue) < nfrags + 1)) { @@ -3885,6 +3876,23 @@ static int stmmac_setup_tc(struct net_de } }
+static u16 stmmac_select_queue(struct net_device *dev, struct sk_buff *skb, + struct net_device *sb_dev, + select_queue_fallback_t fallback) +{ + if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) { + /* + * There is no way to determine the number of TSO + * capable Queues. Let's use always the Queue 0 + * because if TSO is supported then at least this + * one will be capable. + */ + return 0; + } + + return fallback(dev, skb, NULL) % dev->real_num_tx_queues; +} + static int stmmac_set_mac_address(struct net_device *ndev, void *addr) { struct stmmac_priv *priv = netdev_priv(ndev); @@ -4101,6 +4109,7 @@ static const struct net_device_ops stmma .ndo_tx_timeout = stmmac_tx_timeout, .ndo_do_ioctl = stmmac_ioctl, .ndo_setup_tc = stmmac_setup_tc, + .ndo_select_queue = stmmac_select_queue, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = stmmac_poll_controller, #endif
From: Jakub Kicinski jakub.kicinski@netronome.com
[ Upstream commit acd3e96d53a24d219f720ed4012b62723ae05da1 ]
Commit 86029d10af18 ("tls: zero the crypto information from tls_context before freeing") added memzero_explicit() calls to clear the key material before freeing struct tls_context, but it missed tls_device.c has its own way of freeing this structure. Replace the missing free.
Fixes: 86029d10af18 ("tls: zero the crypto information from tls_context before freeing") Signed-off-by: Jakub Kicinski jakub.kicinski@netronome.com Reviewed-by: Dirk van der Merwe dirk.vandermerwe@netronome.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/tls.h | 1 + net/tls/tls_device.c | 2 +- net/tls/tls_main.c | 4 ++-- 3 files changed, 4 insertions(+), 3 deletions(-)
--- a/include/net/tls.h +++ b/include/net/tls.h @@ -285,6 +285,7 @@ struct tls_offload_context_rx { (ALIGN(sizeof(struct tls_offload_context_rx), sizeof(void *)) + \ TLS_DRIVER_STATE_SIZE)
+void tls_ctx_free(struct tls_context *ctx); int wait_on_pending_writer(struct sock *sk, long *timeo); int tls_sk_query(struct sock *sk, int optname, char __user *optval, int __user *optlen); --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -61,7 +61,7 @@ static void tls_device_free_ctx(struct t if (ctx->rx_conf == TLS_HW) kfree(tls_offload_ctx_rx(ctx));
- kfree(ctx); + tls_ctx_free(ctx); }
static void tls_device_gc_task(struct work_struct *work) --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -251,7 +251,7 @@ static void tls_write_space(struct sock ctx->sk_write_space(sk); }
-static void tls_ctx_free(struct tls_context *ctx) +void tls_ctx_free(struct tls_context *ctx) { if (!ctx) return; @@ -638,7 +638,7 @@ static void tls_hw_sk_destruct(struct so
ctx->sk_destruct(sk); /* Free ctx */ - kfree(ctx); + tls_ctx_free(ctx); icsk->icsk_ulp_data = NULL; }
From: Yang Wei albin_yang@163.com
[ Upstream commit dd006fc434e107ef90f7de0db9907cbc1c521645 ]
The frags_q is not properly initialized, it may result in illegal memory access when conn_info is NULL. The "goto free_exit" should be replaced by "goto exit".
Signed-off-by: Yang Wei albin_yang@163.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/nfc/nci/data.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/nfc/nci/data.c +++ b/net/nfc/nci/data.c @@ -119,7 +119,7 @@ static int nci_queue_tx_data_frags(struc conn_info = nci_get_conn_info_by_conn_id(ndev, conn_id); if (!conn_info) { rc = -EPROTO; - goto free_exit; + goto exit; }
__skb_queue_head_init(&frags_q);
From: Heiner Kallweit hkallweit1@gmail.com
[ Upstream commit fe4e8db0392a6c2e795eb89ef5fcd86522e66248 ]
On RTL8411b the RX unit gets confused if the PHY is powered-down. This was reported in [0] and confirmed by Realtek. Realtek provided a sequence to fix the RX unit after PHY wakeup.
The issue itself seems to have been there longer, the Fixes tag refers to where the fix applies properly.
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
Fixes: a99790bf5c7f ("r8169: Reinstate ASPM Support") Tested-by: Ionut Radu ionut.radu@gmail.com Signed-off-by: Heiner Kallweit hkallweit1@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/realtek/r8169.c | 137 +++++++++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+)
--- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -5241,6 +5241,143 @@ static void rtl_hw_start_8411_2(struct r /* disable aspm and clock request before access ephy */ rtl_hw_aspm_clkreq_enable(tp, false); rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2)); + + /* The following Realtek-provided magic fixes an issue with the RX unit + * getting confused after the PHY having been powered-down. + */ + r8168_mac_ocp_write(tp, 0xFC28, 0x0000); + r8168_mac_ocp_write(tp, 0xFC2A, 0x0000); + r8168_mac_ocp_write(tp, 0xFC2C, 0x0000); + r8168_mac_ocp_write(tp, 0xFC2E, 0x0000); + r8168_mac_ocp_write(tp, 0xFC30, 0x0000); + r8168_mac_ocp_write(tp, 0xFC32, 0x0000); + r8168_mac_ocp_write(tp, 0xFC34, 0x0000); + r8168_mac_ocp_write(tp, 0xFC36, 0x0000); + mdelay(3); + r8168_mac_ocp_write(tp, 0xFC26, 0x0000); + + r8168_mac_ocp_write(tp, 0xF800, 0xE008); + r8168_mac_ocp_write(tp, 0xF802, 0xE00A); + r8168_mac_ocp_write(tp, 0xF804, 0xE00C); + r8168_mac_ocp_write(tp, 0xF806, 0xE00E); + r8168_mac_ocp_write(tp, 0xF808, 0xE027); + r8168_mac_ocp_write(tp, 0xF80A, 0xE04F); + r8168_mac_ocp_write(tp, 0xF80C, 0xE05E); + r8168_mac_ocp_write(tp, 0xF80E, 0xE065); + r8168_mac_ocp_write(tp, 0xF810, 0xC602); + r8168_mac_ocp_write(tp, 0xF812, 0xBE00); + r8168_mac_ocp_write(tp, 0xF814, 0x0000); + r8168_mac_ocp_write(tp, 0xF816, 0xC502); + r8168_mac_ocp_write(tp, 0xF818, 0xBD00); + r8168_mac_ocp_write(tp, 0xF81A, 0x074C); + r8168_mac_ocp_write(tp, 0xF81C, 0xC302); + r8168_mac_ocp_write(tp, 0xF81E, 0xBB00); + r8168_mac_ocp_write(tp, 0xF820, 0x080A); + r8168_mac_ocp_write(tp, 0xF822, 0x6420); + r8168_mac_ocp_write(tp, 0xF824, 0x48C2); + r8168_mac_ocp_write(tp, 0xF826, 0x8C20); + r8168_mac_ocp_write(tp, 0xF828, 0xC516); + r8168_mac_ocp_write(tp, 0xF82A, 0x64A4); + r8168_mac_ocp_write(tp, 0xF82C, 0x49C0); + r8168_mac_ocp_write(tp, 0xF82E, 0xF009); + r8168_mac_ocp_write(tp, 0xF830, 0x74A2); + r8168_mac_ocp_write(tp, 0xF832, 0x8CA5); + r8168_mac_ocp_write(tp, 0xF834, 0x74A0); + r8168_mac_ocp_write(tp, 0xF836, 0xC50E); + r8168_mac_ocp_write(tp, 0xF838, 0x9CA2); + r8168_mac_ocp_write(tp, 0xF83A, 0x1C11); + r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0); + r8168_mac_ocp_write(tp, 0xF83E, 0xE006); + r8168_mac_ocp_write(tp, 0xF840, 0x74F8); + r8168_mac_ocp_write(tp, 0xF842, 0x48C4); + r8168_mac_ocp_write(tp, 0xF844, 0x8CF8); + r8168_mac_ocp_write(tp, 0xF846, 0xC404); + r8168_mac_ocp_write(tp, 0xF848, 0xBC00); + r8168_mac_ocp_write(tp, 0xF84A, 0xC403); + r8168_mac_ocp_write(tp, 0xF84C, 0xBC00); + r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2); + r8168_mac_ocp_write(tp, 0xF850, 0x0C0A); + r8168_mac_ocp_write(tp, 0xF852, 0xE434); + r8168_mac_ocp_write(tp, 0xF854, 0xD3C0); + r8168_mac_ocp_write(tp, 0xF856, 0x49D9); + r8168_mac_ocp_write(tp, 0xF858, 0xF01F); + r8168_mac_ocp_write(tp, 0xF85A, 0xC526); + r8168_mac_ocp_write(tp, 0xF85C, 0x64A5); + r8168_mac_ocp_write(tp, 0xF85E, 0x1400); + r8168_mac_ocp_write(tp, 0xF860, 0xF007); + r8168_mac_ocp_write(tp, 0xF862, 0x0C01); + r8168_mac_ocp_write(tp, 0xF864, 0x8CA5); + r8168_mac_ocp_write(tp, 0xF866, 0x1C15); + r8168_mac_ocp_write(tp, 0xF868, 0xC51B); + r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0); + r8168_mac_ocp_write(tp, 0xF86C, 0xE013); + r8168_mac_ocp_write(tp, 0xF86E, 0xC519); + r8168_mac_ocp_write(tp, 0xF870, 0x74A0); + r8168_mac_ocp_write(tp, 0xF872, 0x48C4); + r8168_mac_ocp_write(tp, 0xF874, 0x8CA0); + r8168_mac_ocp_write(tp, 0xF876, 0xC516); + r8168_mac_ocp_write(tp, 0xF878, 0x74A4); + r8168_mac_ocp_write(tp, 0xF87A, 0x48C8); + r8168_mac_ocp_write(tp, 0xF87C, 0x48CA); + r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4); + r8168_mac_ocp_write(tp, 0xF880, 0xC512); + r8168_mac_ocp_write(tp, 0xF882, 0x1B00); + r8168_mac_ocp_write(tp, 0xF884, 0x9BA0); + r8168_mac_ocp_write(tp, 0xF886, 0x1B1C); + r8168_mac_ocp_write(tp, 0xF888, 0x483F); + r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2); + r8168_mac_ocp_write(tp, 0xF88C, 0x1B04); + r8168_mac_ocp_write(tp, 0xF88E, 0xC508); + r8168_mac_ocp_write(tp, 0xF890, 0x9BA0); + r8168_mac_ocp_write(tp, 0xF892, 0xC505); + r8168_mac_ocp_write(tp, 0xF894, 0xBD00); + r8168_mac_ocp_write(tp, 0xF896, 0xC502); + r8168_mac_ocp_write(tp, 0xF898, 0xBD00); + r8168_mac_ocp_write(tp, 0xF89A, 0x0300); + r8168_mac_ocp_write(tp, 0xF89C, 0x051E); + r8168_mac_ocp_write(tp, 0xF89E, 0xE434); + r8168_mac_ocp_write(tp, 0xF8A0, 0xE018); + r8168_mac_ocp_write(tp, 0xF8A2, 0xE092); + r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20); + r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0); + r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F); + r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4); + r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3); + r8168_mac_ocp_write(tp, 0xF8AE, 0xF007); + r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0); + r8168_mac_ocp_write(tp, 0xF8B2, 0xF103); + r8168_mac_ocp_write(tp, 0xF8B4, 0xC607); + r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00); + r8168_mac_ocp_write(tp, 0xF8B8, 0xC606); + r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00); + r8168_mac_ocp_write(tp, 0xF8BC, 0xC602); + r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00); + r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C); + r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28); + r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C); + r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00); + r8168_mac_ocp_write(tp, 0xF8C8, 0xC707); + r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00); + r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2); + r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1); + r8168_mac_ocp_write(tp, 0xF8D0, 0xC502); + r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00); + r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA); + r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0); + r8168_mac_ocp_write(tp, 0xF8D8, 0xC502); + r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00); + r8168_mac_ocp_write(tp, 0xF8DC, 0x0132); + + r8168_mac_ocp_write(tp, 0xFC26, 0x8000); + + r8168_mac_ocp_write(tp, 0xFC2A, 0x0743); + r8168_mac_ocp_write(tp, 0xFC2C, 0x0801); + r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9); + r8168_mac_ocp_write(tp, 0xFC30, 0x02FD); + r8168_mac_ocp_write(tp, 0xFC32, 0x0C25); + r8168_mac_ocp_write(tp, 0xFC34, 0x00A9); + r8168_mac_ocp_write(tp, 0xFC36, 0x012D); + rtl_hw_aspm_clkreq_enable(tp, true); }
From: David Howells dhowells@redhat.com
[ Upstream commit e835ada07091f40dcfb1bc735082bd0a7c005e59 ]
If sendmsg() or sendmmsg() is called on a connected socket that hasn't had bind() called on it, then an oops will occur when the kernel tries to connect the call because no local endpoint has been allocated.
Fix this by implicitly binding the socket if it is in the RXRPC_CLIENT_UNBOUND state, just like it does for the RXRPC_UNBOUND state.
Further, the state should be transitioned to RXRPC_CLIENT_BOUND after this to prevent further attempts to bind it.
This can be tested with:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/socket.h> #include <arpa/inet.h> #include <linux/rxrpc.h> static const unsigned char inet6_addr[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0xac, 0x14, 0x14, 0xaa }; int main(void) { struct sockaddr_rxrpc srx; struct cmsghdr *cm; struct msghdr msg; unsigned char control[16]; int fd; memset(&srx, 0, sizeof(srx)); srx.srx_family = 0x21; srx.srx_service = 0; srx.transport_type = AF_INET; srx.transport_len = 0x1c; srx.transport.sin6.sin6_family = AF_INET6; srx.transport.sin6.sin6_port = htons(0x4e22); srx.transport.sin6.sin6_flowinfo = htons(0x4e22); srx.transport.sin6.sin6_scope_id = htons(0xaa3b); memcpy(&srx.transport.sin6.sin6_addr, inet6_addr, 16); cm = (struct cmsghdr *)control; cm->cmsg_len = CMSG_LEN(sizeof(unsigned long)); cm->cmsg_level = SOL_RXRPC; cm->cmsg_type = RXRPC_USER_CALL_ID; *(unsigned long *)CMSG_DATA(cm) = 0; msg.msg_name = NULL; msg.msg_namelen = 0; msg.msg_iov = NULL; msg.msg_iovlen = 0; msg.msg_control = control; msg.msg_controllen = cm->cmsg_len; msg.msg_flags = 0; fd = socket(AF_RXRPC, SOCK_DGRAM, AF_INET); connect(fd, (struct sockaddr *)&srx, sizeof(srx)); sendmsg(fd, &msg, 0); return 0; }
Leading to the following oops:
BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page ... RIP: 0010:rxrpc_connect_call+0x42/0xa01 ... Call Trace: ? mark_held_locks+0x47/0x59 ? __local_bh_enable_ip+0xb6/0xba rxrpc_new_client_call+0x3b1/0x762 ? rxrpc_do_sendmsg+0x3c0/0x92e rxrpc_do_sendmsg+0x3c0/0x92e rxrpc_sendmsg+0x16b/0x1b5 sock_sendmsg+0x2d/0x39 ___sys_sendmsg+0x1a4/0x22a ? release_sock+0x19/0x9e ? reacquire_held_locks+0x136/0x160 ? release_sock+0x19/0x9e ? find_held_lock+0x2b/0x6e ? __lock_acquire+0x268/0xf73 ? rxrpc_connect+0xdd/0xe4 ? __local_bh_enable_ip+0xb6/0xba __sys_sendmsg+0x5e/0x94 do_syscall_64+0x7d/0x1bf entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 2341e0775747 ("rxrpc: Simplify connect() implementation and simplify sendmsg() op") Reported-by: syzbot+7966f2a0b2c7da8939b4@syzkaller.appspotmail.com Signed-off-by: David Howells dhowells@redhat.com Reviewed-by: Marc Dionne marc.dionne@auristor.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/rxrpc/af_rxrpc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -521,6 +521,7 @@ static int rxrpc_sendmsg(struct socket *
switch (rx->sk.sk_state) { case RXRPC_UNBOUND: + case RXRPC_CLIENT_UNBOUND: rx->srx.srx_family = AF_RXRPC; rx->srx.srx_service = 0; rx->srx.transport_type = SOCK_DGRAM; @@ -545,10 +546,9 @@ static int rxrpc_sendmsg(struct socket * }
rx->local = local; - rx->sk.sk_state = RXRPC_CLIENT_UNBOUND; + rx->sk.sk_state = RXRPC_CLIENT_BOUND; /* Fall through */
- case RXRPC_CLIENT_UNBOUND: case RXRPC_CLIENT_BOUND: if (!m->msg_name && test_bit(RXRPC_SOCK_CONNECTED, &rx->flags)) {
From: Marcelo Ricardo Leitner marcelo.leitner@gmail.com
[ Upstream commit 4d1415811e492d9a8238f8a92dd0d51612c788e9 ]
It allocates the extended area for outbound streams only on sendmsg calls, if they are not yet allocated. When using the priority stream scheduler, this initialization may imply into a subsequent allocation, which may fail. In this case, it was aborting the stream scheduler initialization but leaving the ->ext pointer (allocated) in there, thus in a partially initialized state. On a subsequent call to sendmsg, it would notice the ->ext pointer in there, and trip on uninitialized stuff when trying to schedule the data chunk.
The fix is undo the ->ext initialization if the stream scheduler initialization fails and avoid the partially initialized state.
Although syzkaller bisected this to commit 4ff40b86262b ("sctp: set chunk transport correctly when it's a new asoc"), this bug was actually introduced on the commit I marked below.
Reported-by: syzbot+c1a380d42b190ad1e559@syzkaller.appspotmail.com Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Tested-by: Xin Long lucien.xin@gmail.com Signed-off-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Acked-by: Neil Horman nhorman@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sctp/stream.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
--- a/net/sctp/stream.c +++ b/net/sctp/stream.c @@ -168,13 +168,20 @@ out: int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid) { struct sctp_stream_out_ext *soute; + int ret;
soute = kzalloc(sizeof(*soute), GFP_KERNEL); if (!soute) return -ENOMEM; SCTP_SO(stream, sid)->ext = soute;
- return sctp_sched_init_sid(stream, sid, GFP_KERNEL); + ret = sctp_sched_init_sid(stream, sid, GFP_KERNEL); + if (ret) { + kfree(SCTP_SO(stream, sid)->ext); + SCTP_SO(stream, sid)->ext = NULL; + } + + return ret; }
void sctp_stream_free(struct sctp_stream *stream)
From: Xin Long lucien.xin@gmail.com
[ Upstream commit 9b6c08878e23adb7cc84bdca94d8a944b03f099e ]
Now when sctp_connect() is called with a wrong sa_family, it binds to a port but doesn't set bp->port, then sctp_get_af_specific will return NULL and sctp_connect() returns -EINVAL.
Then if sctp_bind() is called to bind to another port, the last port it has bound will leak due to bp->port is NULL by then.
sctp_connect() doesn't need to bind ports, as later __sctp_connect will do it if bp->port is NULL. So remove it from sctp_connect(). While at it, remove the unnecessary sockaddr.sa_family len check as it's already done in sctp_inet_connect.
Fixes: 644fbdeacf1d ("sctp: fix the issue that flags are ignored when using kernel_connect") Reported-by: syzbot+079bf326b38072f849d9@syzkaller.appspotmail.com Signed-off-by: Xin Long lucien.xin@gmail.com Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/sctp/socket.c | 24 +++--------------------- 1 file changed, 3 insertions(+), 21 deletions(-)
--- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -4828,35 +4828,17 @@ out_nounlock: static int sctp_connect(struct sock *sk, struct sockaddr *addr, int addr_len, int flags) { - struct inet_sock *inet = inet_sk(sk); struct sctp_af *af; - int err = 0; + int err = -EINVAL;
lock_sock(sk); - pr_debug("%s: sk:%p, sockaddr:%p, addr_len:%d\n", __func__, sk, addr, addr_len);
- /* We may need to bind the socket. */ - if (!inet->inet_num) { - if (sk->sk_prot->get_port(sk, 0)) { - release_sock(sk); - return -EAGAIN; - } - inet->inet_sport = htons(inet->inet_num); - } - /* Validate addr_len before calling common connect/connectx routine. */ - af = addr_len < offsetofend(struct sockaddr, sa_family) ? NULL : - sctp_get_af_specific(addr->sa_family); - if (!af || addr_len < af->sockaddr_len) { - err = -EINVAL; - } else { - /* Pass correct addr len to common routine (so it knows there - * is only one address being passed. - */ + af = sctp_get_af_specific(addr->sa_family); + if (af && addr_len >= af->sockaddr_len) err = __sctp_connect(sk, addr, af->sockaddr_len, flags, NULL); - }
release_sock(sk); return err;
From: Takashi Iwai tiwai@suse.de
[ Upstream commit a261e3797506bd561700be643fe1a85bf81e9661 ]
The onboard sky2 NIC on ASUS P6T WS PRO doesn't work after PM resume due to the infamous IRQ problem. Disabling MSI works around it, so let's add it to the blacklist.
Unfortunately the BIOS on the machine doesn't fill the standard DMI_SYS_* entry, so we pick up DMI_BOARD_* entries instead.
BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1142496 Reported-and-tested-by: Marcus Seyfarth m.seyfarth@gmail.com Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/marvell/sky2.c | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -4933,6 +4933,13 @@ static const struct dmi_system_id msi_bl DMI_MATCH(DMI_PRODUCT_NAME, "P-79"), }, }, + { + .ident = "ASUS P6T", + .matches = { + DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."), + DMI_MATCH(DMI_BOARD_NAME, "P6T"), + }, + }, {} };
From: Eric Dumazet edumazet@google.com
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]
Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented.
We should allow these flows to make progress.
This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit.
It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15
Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like :
static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk);
return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); }
Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: Andrew Prout aprout@ll.mit.edu Tested-by: Andrew Prout aprout@ll.mit.edu Tested-by: Jonathan Lemon jonathan.lemon@gmail.com Tested-by: Michal Kubecek mkubecek@suse.cz Acked-by: Neal Cardwell ncardwell@google.com Acked-by: Yuchung Cheng ycheng@google.com Acked-by: Christoph Paasch cpaasch@apple.com Cc: Jonathan Looney jtl@netflix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/tcp.h | 5 +++++ net/ipv4/tcp_output.c | 13 +++++++++++-- 2 files changed, 16 insertions(+), 2 deletions(-)
--- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1679,6 +1679,11 @@ static inline struct sk_buff *tcp_rtx_qu return skb_rb_first(&sk->tcp_rtx_queue); }
+static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) +{ + return skb_rb_last(&sk->tcp_rtx_queue); +} + static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk) { return skb_peek(&sk->sk_write_queue); --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1289,6 +1289,7 @@ int tcp_fragment(struct sock *sk, enum t struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *buff; int nsize, old_factor; + long limit; int nlen; u8 flags;
@@ -1299,8 +1300,16 @@ int tcp_fragment(struct sock *sk, enum t if (nsize < 0) nsize = 0;
- if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf && - tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) { + /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb. + * We need some allowance to not penalize applications setting small + * SO_SNDBUF values. + * Also allow first and last skb in retransmit queue to be split. + */ + limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE); + if (unlikely((sk->sk_wmem_queued >> 1) > limit && + tcp_queue != TCP_FRAG_IN_WRITE_QUEUE && + skb != tcp_rtx_queue_head(sk) && + skb != tcp_rtx_queue_tail(sk))) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; }
From: Eric Dumazet edumazet@google.com
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]
Neal reported incorrect use of ns_capable() from bpf hook.
bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1)
Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context.
As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site.
The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context.
Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Lawrence Brakmo brakmo@fb.com Reported-by: Neal Cardwell ncardwell@google.com Acked-by: Neal Cardwell ncardwell@google.com Acked-by: Lawrence Brakmo brakmo@fb.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/tcp.h | 3 ++- net/core/filter.c | 2 +- net/ipv4/tcp.c | 4 +++- net/ipv4/tcp_cong.c | 6 +++--- 4 files changed, 9 insertions(+), 6 deletions(-)
--- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1067,7 +1067,8 @@ void tcp_get_default_congestion_control( void tcp_get_available_congestion_control(char *buf, size_t len); void tcp_get_allowed_congestion_control(char *buf, size_t len); int tcp_set_allowed_congestion_control(char *allowed); -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit); +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, + bool reinit, bool cap_net_admin); u32 tcp_slow_start(struct tcp_sock *tp, u32 acked); void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
--- a/net/core/filter.c +++ b/net/core/filter.c @@ -4211,7 +4211,7 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_so TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; ret = tcp_set_congestion_control(sk, name, false, - reinit); + reinit, true); } else { struct tcp_sock *tp = tcp_sk(sk);
--- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2784,7 +2784,9 @@ static int do_tcp_setsockopt(struct sock name[val] = 0;
lock_sock(sk); - err = tcp_set_congestion_control(sk, name, true, true); + err = tcp_set_congestion_control(sk, name, true, true, + ns_capable(sock_net(sk)->user_ns, + CAP_NET_ADMIN)); release_sock(sk); return err; } --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -332,7 +332,8 @@ out: * tcp_reinit_congestion_control (if the current congestion control was * already initialized. */ -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit) +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, + bool reinit, bool cap_net_admin) { struct inet_connection_sock *icsk = inet_csk(sk); const struct tcp_congestion_ops *ca; @@ -368,8 +369,7 @@ int tcp_set_congestion_control(struct so } else { err = -EBUSY; } - } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || - ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) { + } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) { err = -EPERM; } else if (!try_module_get(ca->owner)) { err = -EBUSY;
From: Christoph Paasch cpaasch@apple.com
[ Upstream commit e858faf556d4e14c750ba1e8852783c6f9520a0e ]
If an app is playing tricks to reuse a socket via tcp_disconnect(), bytes_acked/received needs to be reset to 0. Otherwise tcp_info will report the sum of the current and the old connection..
Cc: Eric Dumazet edumazet@google.com Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: bdd1f9edacb5 ("tcp: add tcpi_bytes_received to tcp_info") Signed-off-by: Christoph Paasch cpaasch@apple.com Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv4/tcp.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2630,6 +2630,8 @@ int tcp_disconnect(struct sock *sk, int tcp_saved_syn_free(tp); tp->compressed_ack = 0; tp->bytes_sent = 0; + tp->bytes_acked = 0; + tp->bytes_received = 0; tp->bytes_retrans = 0; tp->duplicate_sack[0].start_seq = 0; tp->duplicate_sack[0].end_seq = 0;
From: Peter Kosyh p.kosyh@gmail.com
[ Upstream commit 107e47cc80ec37cb332bd41b22b1c7779e22e018 ]
vrf_process_v4_outbound() and vrf_process_v6_outbound() do routing using ip/ipv6 addresses, but don't make sure the header is available in skb->data[] (skb_headlen() is less then header size).
Case:
1) igb driver from intel. 2) Packet size is greater then 255. 3) MPLS forwards to VRF device.
So, patch adds pskb_may_pull() calls in vrf_process_v4/v6_outbound() functions.
Signed-off-by: Peter Kosyh p.kosyh@gmail.com Reviewed-by: David Ahern dsa@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/vrf.c | 58 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 35 insertions(+), 23 deletions(-)
--- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -169,23 +169,29 @@ static int vrf_ip6_local_out(struct net static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb, struct net_device *dev) { - const struct ipv6hdr *iph = ipv6_hdr(skb); + const struct ipv6hdr *iph; struct net *net = dev_net(skb->dev); - struct flowi6 fl6 = { - /* needed to match OIF rule */ - .flowi6_oif = dev->ifindex, - .flowi6_iif = LOOPBACK_IFINDEX, - .daddr = iph->daddr, - .saddr = iph->saddr, - .flowlabel = ip6_flowinfo(iph), - .flowi6_mark = skb->mark, - .flowi6_proto = iph->nexthdr, - .flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF, - }; + struct flowi6 fl6; int ret = NET_XMIT_DROP; struct dst_entry *dst; struct dst_entry *dst_null = &net->ipv6.ip6_null_entry->dst;
+ if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr))) + goto err; + + iph = ipv6_hdr(skb); + + memset(&fl6, 0, sizeof(fl6)); + /* needed to match OIF rule */ + fl6.flowi6_oif = dev->ifindex; + fl6.flowi6_iif = LOOPBACK_IFINDEX; + fl6.daddr = iph->daddr; + fl6.saddr = iph->saddr; + fl6.flowlabel = ip6_flowinfo(iph); + fl6.flowi6_mark = skb->mark; + fl6.flowi6_proto = iph->nexthdr; + fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF; + dst = ip6_route_output(net, NULL, &fl6); if (dst == dst_null) goto err; @@ -241,21 +247,27 @@ static int vrf_ip_local_out(struct net * static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, struct net_device *vrf_dev) { - struct iphdr *ip4h = ip_hdr(skb); + struct iphdr *ip4h; int ret = NET_XMIT_DROP; - struct flowi4 fl4 = { - /* needed to match OIF rule */ - .flowi4_oif = vrf_dev->ifindex, - .flowi4_iif = LOOPBACK_IFINDEX, - .flowi4_tos = RT_TOS(ip4h->tos), - .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_SKIP_NH_OIF, - .flowi4_proto = ip4h->protocol, - .daddr = ip4h->daddr, - .saddr = ip4h->saddr, - }; + struct flowi4 fl4; struct net *net = dev_net(vrf_dev); struct rtable *rt;
+ if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr))) + goto err; + + ip4h = ip_hdr(skb); + + memset(&fl4, 0, sizeof(fl4)); + /* needed to match OIF rule */ + fl4.flowi4_oif = vrf_dev->ifindex; + fl4.flowi4_iif = LOOPBACK_IFINDEX; + fl4.flowi4_tos = RT_TOS(ip4h->tos); + fl4.flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_SKIP_NH_OIF; + fl4.flowi4_proto = ip4h->protocol; + fl4.daddr = ip4h->daddr; + fl4.saddr = ip4h->saddr; + rt = ip_route_output_flow(net, &fl4, NULL); if (IS_ERR(rt)) goto err;
From: Aya Levin ayal@mellanox.com
[ Upstream commit ef1ce7d7b67b46661091c7ccc0396186b7a247ef ]
Check return value from mlx5e_attach_netdev, add error path on failure.
Fixes: 48935bbb7ae8 ("net/mlx5e: IPoIB, Add netdevice profile skeleton") Signed-off-by: Aya Levin ayal@mellanox.com Reviewed-by: Feras Daoud ferasda@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c @@ -698,7 +698,9 @@ static int mlx5_rdma_setup_rn(struct ib_
prof->init(mdev, netdev, prof, ipriv);
- mlx5e_attach_netdev(epriv); + err = mlx5e_attach_netdev(epriv); + if (err) + goto detach; netif_carrier_off(netdev);
/* set rdma_netdev func pointers */ @@ -714,6 +716,11 @@ static int mlx5_rdma_setup_rn(struct ib_
return 0;
+detach: + prof->cleanup(epriv); + if (ipriv->sub_interface) + return err; + mlx5e_destroy_mdev_resources(mdev); destroy_ht: mlx5i_pkey_qpn_ht_cleanup(netdev); return err;
From: Nikolay Aleksandrov nikolay@cumulusnetworks.com
[ Upstream commit e57f61858b7cf478ed6fa23ed4b3876b1c9625c4 ]
We take a pointer to grec prior to calling pskb_may_pull and use it afterwards to get nsrcs so record nsrcs before the pull when handling igmp3 and we get a pointer to nsrcs and call pskb_may_pull when handling mld2 which again could lead to reading 2 bytes out-of-bounds.
================================================================== BUG: KASAN: use-after-free in br_multicast_rcv+0x480c/0x4ad0 [bridge] Read of size 2 at addr ffff8880421302b4 by task ksoftirqd/1/16
CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G OE 5.2.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x280 ? br_multicast_rcv+0x480c/0x4ad0 [bridge] __kasan_report+0x152/0x1aa ? br_multicast_rcv+0x480c/0x4ad0 [bridge] ? br_multicast_rcv+0x480c/0x4ad0 [bridge] kasan_report+0xe/0x20 br_multicast_rcv+0x480c/0x4ad0 [bridge] ? br_multicast_disable_port+0x150/0x150 [bridge] ? ktime_get_with_offset+0xb4/0x150 ? __kasan_kmalloc.constprop.6+0xa6/0xf0 ? __netif_receive_skb+0x1b0/0x1b0 ? br_fdb_update+0x10e/0x6e0 [bridge] ? br_handle_frame_finish+0x3c6/0x11d0 [bridge] br_handle_frame_finish+0x3c6/0x11d0 [bridge] ? br_pass_frame_up+0x3a0/0x3a0 [bridge] ? virtnet_probe+0x1c80/0x1c80 [virtio_net] br_handle_frame+0x731/0xd90 [bridge] ? select_idle_sibling+0x25/0x7d0 ? br_handle_frame_finish+0x11d0/0x11d0 [bridge] __netif_receive_skb_core+0xced/0x2d70 ? virtqueue_get_buf_ctx+0x230/0x1130 [virtio_ring] ? do_xdp_generic+0x20/0x20 ? virtqueue_napi_complete+0x39/0x70 [virtio_net] ? virtnet_poll+0x94d/0xc78 [virtio_net] ? receive_buf+0x5120/0x5120 [virtio_net] ? __netif_receive_skb_one_core+0x97/0x1d0 __netif_receive_skb_one_core+0x97/0x1d0 ? __netif_receive_skb_core+0x2d70/0x2d70 ? _raw_write_trylock+0x100/0x100 ? __queue_work+0x41e/0xbe0 process_backlog+0x19c/0x650 ? _raw_read_lock_irq+0x40/0x40 net_rx_action+0x71e/0xbc0 ? __switch_to_asm+0x40/0x70 ? napi_complete_done+0x360/0x360 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 ? __schedule+0x85e/0x14d0 __do_softirq+0x1db/0x5f9 ? takeover_tasklets+0x5f0/0x5f0 run_ksoftirqd+0x26/0x40 smpboot_thread_fn+0x443/0x680 ? sort_range+0x20/0x20 ? schedule+0x94/0x210 ? __kthread_parkme+0x78/0xf0 ? sort_range+0x20/0x20 kthread+0x2ae/0x3a0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x35/0x40
The buggy address belongs to the page: page:ffffea0001084c00 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0xffffc000000000() raw: 00ffffc000000000 ffffea0000cfca08 ffffea0001098608 0000000000000000 raw: 0000000000000000 0000000000000003 00000000ffffff7f 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff888042130180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888042130200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888042130280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^ ffff888042130300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888042130380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint
Fixes: bc8c20acaea1 ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave") Reported-by: Martin Weinelt martin@linuxlounge.net Signed-off-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Tested-by: Martin Weinelt martin@linuxlounge.net Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bridge/br_multicast.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-)
--- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -934,6 +934,7 @@ static int br_ip4_multicast_igmp3_report int type; int err = 0; __be32 group; + u16 nsrcs;
ih = igmpv3_report_hdr(skb); num = ntohs(ih->ngrec); @@ -947,8 +948,9 @@ static int br_ip4_multicast_igmp3_report grec = (void *)(skb->data + len - sizeof(*grec)); group = grec->grec_mca; type = grec->grec_type; + nsrcs = ntohs(grec->grec_nsrcs);
- len += ntohs(grec->grec_nsrcs) * 4; + len += nsrcs * 4; if (!ip_mc_may_pull(skb, len)) return -EINVAL;
@@ -969,7 +971,7 @@ static int br_ip4_multicast_igmp3_report src = eth_hdr(skb)->h_source; if ((type == IGMPV3_CHANGE_TO_INCLUDE || type == IGMPV3_MODE_IS_INCLUDE) && - ntohs(grec->grec_nsrcs) == 0) { + nsrcs == 0) { br_ip4_multicast_leave_group(br, port, group, vid, src); } else { err = br_ip4_multicast_add_group(br, port, group, vid, @@ -1006,7 +1008,8 @@ static int br_ip6_multicast_mld2_report( len = skb_transport_offset(skb) + sizeof(*icmp6h);
for (i = 0; i < num; i++) { - __be16 *nsrcs, _nsrcs; + __be16 *_nsrcs, __nsrcs; + u16 nsrcs;
nsrcs_offset = len + offsetof(struct mld2_grec, grec_nsrcs);
@@ -1014,12 +1017,13 @@ static int br_ip6_multicast_mld2_report( nsrcs_offset + sizeof(_nsrcs)) return -EINVAL;
- nsrcs = skb_header_pointer(skb, nsrcs_offset, - sizeof(_nsrcs), &_nsrcs); - if (!nsrcs) + _nsrcs = skb_header_pointer(skb, nsrcs_offset, + sizeof(__nsrcs), &__nsrcs); + if (!_nsrcs) return -EINVAL;
- grec_len = struct_size(grec, grec_src, ntohs(*nsrcs)); + nsrcs = ntohs(*_nsrcs); + grec_len = struct_size(grec, grec_src, nsrcs);
if (!ipv6_mc_may_pull(skb, len + grec_len)) return -EINVAL; @@ -1044,7 +1048,7 @@ static int br_ip6_multicast_mld2_report( src = eth_hdr(skb)->h_source; if ((grec->grec_type == MLD2_CHANGE_TO_INCLUDE || grec->grec_type == MLD2_MODE_IS_INCLUDE) && - ntohs(*nsrcs) == 0) { + nsrcs == 0) { br_ip6_multicast_leave_group(br, port, &grec->grec_mca, vid, src); } else {
From: Nikolay Aleksandrov nikolay@cumulusnetworks.com
[ Upstream commit 3b26a5d03d35d8f732d75951218983c0f7f68dff ]
We get a pointer to the ipv6 hdr in br_ip6_multicast_query but we may call pskb_may_pull afterwards and end up using a stale pointer. So use the header directly, it's just 1 place where it's needed.
Fixes: 08b202b67264 ("bridge br_multicast: IPv6 MLD support.") Signed-off-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Tested-by: Martin Weinelt martin@linuxlounge.net Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bridge/br_multicast.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -1302,7 +1302,6 @@ static int br_ip6_multicast_query(struct u16 vid) { unsigned int transport_len = ipv6_transport_len(skb); - const struct ipv6hdr *ip6h = ipv6_hdr(skb); struct mld_msg *mld; struct net_bridge_mdb_entry *mp; struct mld2_query *mld2q; @@ -1346,7 +1345,7 @@ static int br_ip6_multicast_query(struct
if (is_general_query) { saddr.proto = htons(ETH_P_IPV6); - saddr.u.ip6 = ip6h->saddr; + saddr.u.ip6 = ipv6_hdr(skb)->saddr;
br_multicast_query_received(br, port, &br->ip6_other_query, &saddr, max_delay);
From: Nikolay Aleksandrov nikolay@cumulusnetworks.com
[ Upstream commit 3d26eb8ad1e9b906433903ce05f775cf038e747f ]
We would cache ether dst pointer on input in br_handle_frame_finish but after the neigh suppress code that could lead to a stale pointer since both ipv4 and ipv6 suppress code do pskb_may_pull. This means we have to always reload it after the suppress code so there's no point in having it cached just retrieve it directly.
Fixes: 057658cb33fbf ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports") Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports") Signed-off-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bridge/br_input.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
--- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -79,7 +79,6 @@ int br_handle_frame_finish(struct net *n struct net_bridge_fdb_entry *dst = NULL; struct net_bridge_mdb_entry *mdst; bool local_rcv, mcast_hit = false; - const unsigned char *dest; struct net_bridge *br; u16 vid = 0;
@@ -97,10 +96,9 @@ int br_handle_frame_finish(struct net *n br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, false);
local_rcv = !!(br->dev->flags & IFF_PROMISC); - dest = eth_hdr(skb)->h_dest; - if (is_multicast_ether_addr(dest)) { + if (is_multicast_ether_addr(eth_hdr(skb)->h_dest)) { /* by definition the broadcast is also a multicast address */ - if (is_broadcast_ether_addr(dest)) { + if (is_broadcast_ether_addr(eth_hdr(skb)->h_dest)) { pkt_type = BR_PKT_BROADCAST; local_rcv = true; } else { @@ -150,7 +148,7 @@ int br_handle_frame_finish(struct net *n } break; case BR_PKT_UNICAST: - dst = br_fdb_find_rcu(br, dest, vid); + dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, vid); default: break; }
From: Nikolay Aleksandrov nikolay@cumulusnetworks.com
[ Upstream commit 2446a68ae6a8cee6d480e2f5b52f5007c7c41312 ]
Don't cache eth dest pointer before calling pskb_may_pull.
Fixes: cf0f02d04a83 ("[BRIDGE]: use llc for receiving STP packets") Signed-off-by: Nikolay Aleksandrov nikolay@cumulusnetworks.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bridge/br_stp_bpdu.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/net/bridge/br_stp_bpdu.c +++ b/net/bridge/br_stp_bpdu.c @@ -147,7 +147,6 @@ void br_send_tcn_bpdu(struct net_bridge_ void br_stp_rcv(const struct stp_proto *proto, struct sk_buff *skb, struct net_device *dev) { - const unsigned char *dest = eth_hdr(skb)->h_dest; struct net_bridge_port *p; struct net_bridge *br; const unsigned char *buf; @@ -176,7 +175,7 @@ void br_stp_rcv(const struct stp_proto * if (p->state == BR_STATE_DISABLED) goto out;
- if (!ether_addr_equal(dest, br->group_addr)) + if (!ether_addr_equal(eth_hdr(skb)->h_dest, br->group_addr)) goto out;
if (p->flags & BR_BPDU_GUARD) {
From: Andreas Steinmetz ast@domdv.de
[ Upstream commit 095c02da80a41cf6d311c504d8955d6d1c2add10 ]
Fix use-after-free of skb when rx_handler returns RX_HANDLER_PASS.
Signed-off-by: Andreas Steinmetz ast@domdv.de Acked-by: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/macsec.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
--- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -1103,10 +1103,9 @@ static rx_handler_result_t macsec_handle }
skb = skb_unshare(skb, GFP_ATOMIC); - if (!skb) { - *pskb = NULL; + *pskb = skb; + if (!skb) return RX_HANDLER_CONSUMED; - }
pulled_sci = pskb_may_pull(skb, macsec_extra_len(true)); if (!pulled_sci) {
From: Andreas Steinmetz ast@domdv.de
[ Upstream commit 7d8b16b9facb0dd81d1469808dd9a575fa1d525a ]
Fix checksumming after decryption.
Signed-off-by: Andreas Steinmetz ast@domdv.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/macsec.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -869,6 +869,7 @@ static void macsec_reset_skb(struct sk_b
static void macsec_finalize_skb(struct sk_buff *skb, u8 icv_len, u8 hdr_len) { + skb->ip_summed = CHECKSUM_NONE; memmove(skb->data + hdr_len, skb->data, 2 * ETH_ALEN); skb_pull(skb, hdr_len); pskb_trim_unique(skb, skb->len - icv_len);
From: Cong Wang xiyou.wangcong@gmail.com
[ Upstream commit c8c8218ec5af5d2598381883acbefbf604e56b5e ]
When the skb is associated with a new sock, just assigning it to skb->sk is not sufficient, we have to set its destructor to free the sock properly too.
Reported-by: syzbot+d6636a36d3c34bd88938@syzkaller.appspotmail.com Signed-off-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netrom/af_netrom.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/netrom/af_netrom.c +++ b/net/netrom/af_netrom.c @@ -872,7 +872,7 @@ int nr_rx_frame(struct sk_buff *skb, str unsigned short frametype, flags, window, timeout; int ret;
- skb->sk = NULL; /* Initially we don't know who it's for */ + skb_orphan(skb);
/* * skb->data points to the netrom frame start @@ -971,6 +971,7 @@ int nr_rx_frame(struct sk_buff *skb, str window = skb->data[20];
skb->sk = make; + skb->destructor = sock_efree; make->sk_state = TCP_ESTABLISHED;
/* Fill in his circuit details */
From: Cong Wang xiyou.wangcong@gmail.com
[ Upstream commit 4638faac032756f7eab5524be7be56bee77e426b ]
sock_efree() releases the sock refcnt, if we don't hold this refcnt when setting skb->destructor to it, the refcnt would not be balanced. This leads to several bug reports from syzbot.
I have checked other users of sock_efree(), all of them hold the sock refcnt.
Fixes: c8c8218ec5af ("netrom: fix a memory leak in nr_rx_frame()") Reported-and-tested-by: syzbot+622bdabb128acc33427d@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+6eaef7158b19e3fec3a0@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+9399c158fcc09b21d0d2@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+a34e5f3d0300163f0c87@syzkaller.appspotmail.com Cc: Ralf Baechle ralf@linux-mips.org Signed-off-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netrom/af_netrom.c | 1 + 1 file changed, 1 insertion(+)
--- a/net/netrom/af_netrom.c +++ b/net/netrom/af_netrom.c @@ -970,6 +970,7 @@ int nr_rx_frame(struct sk_buff *skb, str
window = skb->data[20];
+ sock_hold(make); skb->sk = make; skb->destructor = sock_efree; make->sk_state = TCP_ESTABLISHED;
From: Frank de Brabander debrabander@gmail.com
[ Upstream commit cecaa76b2919aac2aa584ce476e9fcd5b084add5 ]
If mmap() fails it returns MAP_FAILED, which is defined as ((void *) -1). The current if-statement incorrectly tests if *ring is NULL.
Fixes: 358be656406d ("selftests/net: add txring_overwrite") Signed-off-by: Frank de Brabander debrabander@gmail.com Acked-by: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/testing/selftests/net/txring_overwrite.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/tools/testing/selftests/net/txring_overwrite.c +++ b/tools/testing/selftests/net/txring_overwrite.c @@ -113,7 +113,7 @@ static int setup_tx(char **ring)
*ring = mmap(0, req.tp_block_size * req.tp_block_nr, PROT_READ | PROT_WRITE, MAP_SHARED, fdt, 0); - if (!*ring) + if (*ring == MAP_FAILED) error(1, errno, "mmap");
return fdt;
From: Jakub Kicinski jakub.kicinski@netronome.com
[ Upstream commit 13aecb17acabc2a92187d08f7ca93bb8aad62c6f ]
David reports that RPC applications which use epoll() occasionally get stuck, and that TLS ULP causes the kernel to not wake applications, even though read() will return data.
This is indeed true. The ctx->rx_list which holds partially copied records is not consulted when deciding whether socket is readable.
Note that SO_RCVLOWAT with epoll() is and has always been broken for kernel TLS. We'd need to parse all records from the TCP layer, instead of just the first one.
Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records") Reported-by: David Beckett david.beckett@netronome.com Signed-off-by: Jakub Kicinski jakub.kicinski@netronome.com Reviewed-by: Dirk van der Merwe dirk.vandermerwe@netronome.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_sw.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1931,7 +1931,8 @@ bool tls_sw_stream_read(const struct soc ingress_empty = list_empty(&psock->ingress_msg); rcu_read_unlock();
- return !ingress_empty || ctx->recv_pkt; + return !ingress_empty || ctx->recv_pkt || + !skb_queue_empty(&ctx->rx_list); }
static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
From: Jakub Kicinski jakub.kicinski@netronome.com
[ Upstream commit 618bac45937a3dc6126ac0652747481e97000f99 ]
Neither drivers nor the tls offload code currently supports TLS version 1.3. Check the TLS version when installing connection state. TLS 1.3 will just fallback to the kernel crypto for now.
Fixes: 130b392c6cd6 ("net: tls: Add tls 1.3 support") Signed-off-by: Jakub Kicinski jakub.kicinski@netronome.com Reviewed-by: Dirk van der Merwe dirk.vandermerwe@netronome.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/tls/tls_device.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -746,6 +746,11 @@ int tls_set_device_offload(struct sock * }
crypto_info = &ctx->crypto_send.info; + if (crypto_info->version != TLS_1_2_VERSION) { + rc = -EOPNOTSUPP; + goto free_offload_ctx; + } + switch (crypto_info->cipher_type) { case TLS_CIPHER_AES_GCM_128: nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE; @@ -880,6 +885,9 @@ int tls_set_device_offload_rx(struct soc struct net_device *netdev; int rc = 0;
+ if (ctx->crypto_recv.info.version != TLS_1_2_VERSION) + return -EOPNOTSUPP; + /* We support starting offload on multiple sockets * concurrently, so we only need a read lock here. * This lock must precede get_netdev_for_sock to prevent races between
From: Eli Britstein elibr@mellanox.com
[ Upstream commit 914adbb1bcf89478ac138318d28b302704564d59 ]
GRE entropy calculation is a single bit per card, and not per port. Force disable GRE entropy calculation upon the first GRE encap rule, and release the force at the last GRE encap rule removal. This is done per port.
Fixes: 97417f6182f8 ("net/mlx5e: Fix GRE key by controlling port tunnel entropy calculation") Signed-off-by: Eli Britstein elibr@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c | 23 ++--------------- 1 file changed, 4 insertions(+), 19 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c @@ -100,27 +100,12 @@ static int mlx5_set_entropy(struct mlx5_ */ if (entropy_flags.gre_calc_supported && reformat_type == MLX5_REFORMAT_TYPE_L2_TO_NVGRE) { - /* Other applications may change the global FW entropy - * calculations settings. Check that the current entropy value - * is the negative of the updated value. - */ - if (entropy_flags.force_enabled && - enable == entropy_flags.gre_calc_enabled) { - mlx5_core_warn(tun_entropy->mdev, - "Unexpected GRE entropy calc setting - expected %d", - !entropy_flags.gre_calc_enabled); - return -EOPNOTSUPP; - } - err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev, enable, - entropy_flags.force_supported); + if (!entropy_flags.force_supported) + return 0; + err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev, + enable, !enable); if (err) return err; - /* if we turn on the entropy we don't need to force it anymore */ - if (entropy_flags.force_supported && enable) { - err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev, 1, 0); - if (err) - return err; - } } else if (entropy_flags.calc_supported) { /* Other applications may change the global FW entropy * calculations settings. Check that the current entropy value
From: Saeed Mahameed saeedm@mellanox.com
[ Upstream commit db849faa9bef993a1379dc510623f750a72fa7ce ]
CQE checksum full mode in new HW, provides a full checksum of rx frame. Covering bytes starting from eth protocol up to last byte in the received frame (frame_size - ETH_HLEN), as expected by the stack.
Fixing up skb->csum by the driver is not required in such case. This fix is to avoid wrong checksum calculation in drivers which already support the new hardware with the new checksum mode.
Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++ drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 7 ++++++- include/linux/mlx5/mlx5_ifc.h | 3 ++- 4 files changed, 12 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -294,6 +294,7 @@ enum { MLX5E_RQ_STATE_ENABLED, MLX5E_RQ_STATE_AM, MLX5E_RQ_STATE_NO_CSUM_COMPLETE, + MLX5E_RQ_STATE_CSUM_FULL, /* cqe_csum_full hw bit is set */ };
struct mlx5e_cq { --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -948,6 +948,9 @@ static int mlx5e_open_rq(struct mlx5e_ch if (err) goto err_destroy_rq;
+ if (MLX5_CAP_ETH(c->mdev, cqe_checksum_full)) + __set_bit(MLX5E_RQ_STATE_CSUM_FULL, &c->rq.state); + if (params->rx_dim_enabled) __set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -829,8 +829,14 @@ static inline void mlx5e_handle_csum(str if (unlikely(get_ip_proto(skb, network_depth, proto) == IPPROTO_SCTP)) goto csum_unnecessary;
+ stats->csum_complete++; skb->ip_summed = CHECKSUM_COMPLETE; skb->csum = csum_unfold((__force __sum16)cqe->check_sum); + + if (test_bit(MLX5E_RQ_STATE_CSUM_FULL, &rq->state)) + return; /* CQE csum covers all received bytes */ + + /* csum might need some fixups ...*/ if (network_depth > ETH_HLEN) /* CQE csum is calculated from the IP header and does * not cover VLAN headers (if present). This will add @@ -841,7 +847,6 @@ static inline void mlx5e_handle_csum(str skb->csum);
mlx5e_skb_padding_csum(skb, network_depth, proto, stats); - stats->csum_complete++; return; }
--- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -716,7 +716,8 @@ struct mlx5_ifc_per_protocol_networking_ u8 swp[0x1]; u8 swp_csum[0x1]; u8 swp_lso[0x1]; - u8 reserved_at_23[0xd]; + u8 cqe_checksum_full[0x1]; + u8 reserved_at_24[0xc]; u8 max_vxlan_udp_ports[0x8]; u8 reserved_at_38[0x6]; u8 max_geneve_opt_len[0x1];
From: Aya Levin ayal@mellanox.com
[ Upstream commit 39825350ae2a52f8513741b36e42118bd80dd689 ]
Fix timeout recover function to return a meaningful return value. When an interrupt was not sent by the FW, return IO error instead of 'true'.
Fixes: c7981bea48fb ("net/mlx5e: Fix return status of TX reporter timeout recover") Signed-off-by: Aya Levin ayal@mellanox.com Acked-by: Jiri Pirko jiri@mellanox.com Reviewed-by: Tariq Toukan tariqt@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c @@ -142,22 +142,20 @@ static int mlx5e_tx_reporter_timeout_rec { struct mlx5_eq_comp *eq = sq->cq.mcq.eq; u32 eqe_count; - int ret;
netdev_err(sq->channel->netdev, "EQ 0x%x: Cons = 0x%x, irqn = 0x%x\n", eq->core.eqn, eq->core.cons_index, eq->core.irqn);
eqe_count = mlx5_eq_poll_irq_disabled(eq); - ret = eqe_count ? false : true; if (!eqe_count) { clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state); - return ret; + return -EIO; }
netdev_err(sq->channel->netdev, "Recover %d eqes on EQ 0x%x\n", eqe_count, eq->core.eqn); sq->channel->stats->eq_rearm++; - return ret; + return 0; }
int mlx5e_tx_reporter_timeout(struct mlx5e_txqsq *sq)
From: Aya Levin ayal@mellanox.com
[ Upstream commit 99d31cbd8953c6929da978bf049ab0f0b4e503d9 ]
Fix tx reporter's diagnose callback. Propagate error when failing to gather diagnostics information or failing to print diagnostic data per queue.
Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support") Signed-off-by: Aya Levin ayal@mellanox.com Reviewed-by: Tariq Toukan tariqt@mellanox.com Acked-by: Jiri Pirko jiri@mellanox.com Signed-off-by: Saeed Mahameed saeedm@mellanox.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c @@ -262,13 +262,13 @@ static int mlx5e_tx_reporter_diagnose(st
err = mlx5_core_query_sq_state(priv->mdev, sq->sqn, &state); if (err) - break; + goto unlock;
err = mlx5e_tx_reporter_build_diagnose_output(fmsg, sq->sqn, state, netif_xmit_stopped(sq->txq)); if (err) - break; + goto unlock; } err = devlink_fmsg_arr_pair_nest_end(fmsg); if (err)
From: Jérôme Glisse jglisse@redhat.com
commit 5e383a9798990c69fc759a4930de224bb497e62c upstream.
The debugfs take reference on fence without dropping them.
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: Stéphane Marchesin marcheu@chromium.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org Link: https://patchwork.freedesktop.org/patch/msgid/20181206161840.6578-1-jglisse@... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/dma-buf/dma-buf.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1068,6 +1068,7 @@ static int dma_buf_debug_show(struct seq fence->ops->get_driver_name(fence), fence->ops->get_timeline_name(fence), dma_fence_is_signaled(fence) ? "" : "un"); + dma_fence_put(fence); } rcu_read_unlock();
From: Chris Wilson chris@chris-wilson.co.uk
commit f5b07b04e5f090a85d1e96938520f2b2b58e4a8e upstream.
If we have to drop the seqcount & rcu lock to perform a krealloc, we have to restart the loop. In doing so, be careful not to lose track of the already acquired exclusive fence.
Fixes: fedf54132d24 ("dma-buf: Restart reservation_object_get_fences_rcu() after writes") Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: stable@vger.kernel.org #v4.10 Reviewed-by: Christian König christian.koenig@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20190604125323.21396-1-chris@c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/dma-buf/reservation.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -357,6 +357,10 @@ int reservation_object_get_fences_rcu(st GFP_NOWAIT | __GFP_NOWARN); if (!nshared) { rcu_read_unlock(); + + dma_fence_put(fence_excl); + fence_excl = NULL; + nshared = krealloc(shared, sz, GFP_KERNEL); if (nshared) { shared = nshared;
From: Nishka Dasgupta nishkadg.linux@gmail.com
commit 89fea04c85e85f21ef4937611055abce82330d48 upstream.
Each iteration of for_each_child_of_node puts the previous node, but in the case of a break from the middle of the loop, there is no put, thus causing a memory leak. Hence add an of_node_put before the break. Issue found with Coccinelle.
Cc: stable@vger.kernel.org Signed-off-by: Nishka Dasgupta nishkadg.linux@gmail.com [Bartosz: tweaked the commit message] Signed-off-by: Bartosz Golaszewski bgolaszewski@baylibre.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/gpio/gpiolib-of.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/gpio/gpiolib-of.c +++ b/drivers/gpio/gpiolib-of.c @@ -155,6 +155,7 @@ static void of_gpio_flags_quirks(struct of_node_full_name(child)); *flags |= OF_GPIO_ACTIVE_LOW; } + of_node_put(child); break; } }
From: Keerthy j-keerthy@ti.com
commit 541e4095f388c196685685633c950cb9b97f8039 upstream.
Silence error prints in case of EPROBE_DEFER. This avoids multiple/duplicate defer prints during boot.
Cc: stable@vger.kernel.org Signed-off-by: Keerthy j-keerthy@ti.com Signed-off-by: Bartosz Golaszewski bgolaszewski@baylibre.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/gpio/gpio-davinci.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/drivers/gpio/gpio-davinci.c +++ b/drivers/gpio/gpio-davinci.c @@ -242,8 +242,9 @@ static int davinci_gpio_probe(struct pla for (i = 0; i < nirq; i++) { chips->irqs[i] = platform_get_irq(pdev, i); if (chips->irqs[i] < 0) { - dev_info(dev, "IRQ not populated, err = %d\n", - chips->irqs[i]); + if (chips->irqs[i] != -EPROBE_DEFER) + dev_info(dev, "IRQ not populated, err = %d\n", + chips->irqs[i]); return chips->irqs[i]; } }
From: Paul Cercueil paul@crapouillou.net
commit 1323c3b72a987de57141cabc44bf9cd83656bc70 upstream.
The pin mappings introduced in commit 636f8ba67fb6 ("MIPS: JZ4740: Qi LB60: Add pinctrl configuration for several drivers") are completely wrong. The pinctrl driver name is incorrect, and the function and group fields are swapped.
Fixes: 636f8ba67fb6 ("MIPS: JZ4740: Qi LB60: Add pinctrl configuration for several drivers") Cc: stable@vger.kernel.org Signed-off-by: Paul Cercueil paul@crapouillou.net Reviewed-by: Linus Walleij linus.walleij@linaro.org Signed-off-by: Paul Burton paul.burton@mips.com Cc: Ralf Baechle ralf@linux-mips.org Cc: James Hogan jhogan@kernel.org Cc: od@zcrc.me Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/mips/jz4740/board-qi_lb60.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
--- a/arch/mips/jz4740/board-qi_lb60.c +++ b/arch/mips/jz4740/board-qi_lb60.c @@ -469,27 +469,27 @@ static unsigned long pin_cfg_bias_disabl static struct pinctrl_map pin_map[] __initdata = { /* NAND pin configuration */ PIN_MAP_MUX_GROUP_DEFAULT("jz4740-nand", - "10010000.jz4740-pinctrl", "nand", "nand-cs1"), + "10010000.pin-controller", "nand-cs1", "nand"),
/* fbdev pin configuration */ PIN_MAP_MUX_GROUP("jz4740-fb", PINCTRL_STATE_DEFAULT, - "10010000.jz4740-pinctrl", "lcd", "lcd-8bit"), + "10010000.pin-controller", "lcd-8bit", "lcd"), PIN_MAP_MUX_GROUP("jz4740-fb", PINCTRL_STATE_SLEEP, - "10010000.jz4740-pinctrl", "lcd", "lcd-no-pins"), + "10010000.pin-controller", "lcd-no-pins", "lcd"),
/* MMC pin configuration */ PIN_MAP_MUX_GROUP_DEFAULT("jz4740-mmc.0", - "10010000.jz4740-pinctrl", "mmc", "mmc-1bit"), + "10010000.pin-controller", "mmc-1bit", "mmc"), PIN_MAP_MUX_GROUP_DEFAULT("jz4740-mmc.0", - "10010000.jz4740-pinctrl", "mmc", "mmc-4bit"), + "10010000.pin-controller", "mmc-4bit", "mmc"), PIN_MAP_CONFIGS_PIN_DEFAULT("jz4740-mmc.0", - "10010000.jz4740-pinctrl", "PD0", pin_cfg_bias_disable), + "10010000.pin-controller", "PD0", pin_cfg_bias_disable), PIN_MAP_CONFIGS_PIN_DEFAULT("jz4740-mmc.0", - "10010000.jz4740-pinctrl", "PD2", pin_cfg_bias_disable), + "10010000.pin-controller", "PD2", pin_cfg_bias_disable),
/* PWM pin configuration */ PIN_MAP_MUX_GROUP_DEFAULT("jz4740-pwm", - "10010000.jz4740-pinctrl", "pwm4", "pwm4"), + "10010000.pin-controller", "pwm4", "pwm4"), };
From: Song Liu songliubraving@fb.com
commit 9d49169c5958e429ffa6874fbef734ae7502ad65 upstream.
In pipe mode, session->header.env.arch is not populated until the events are processed. Therefore, the following command crashes:
perf record -o - | perf script
(gdb) bt
It fails when we try to compare env.arch against uts.machine:
if (!strcmp(uts.machine, session->header.env.arch) || (!strcmp(uts.machine, "x86_64") && !strcmp(session->header.env.arch, "i386"))) native_arch = true;
In pipe mode, it is tricky to find env.arch at this stage. To keep it simple, let's just assume native_arch is always true for pipe mode.
Reported-by: David Carrillo Cisneros davidca@fb.com Signed-off-by: Song Liu songliubraving@fb.com Tested-by: Arnaldo Carvalho de Melo acme@redhat.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Cc: Namhyung Kim namhyung@kernel.org Cc: kernel-team@fb.com Cc: stable@vger.kernel.org #v5.1+ Fixes: 3ab481a1cfe1 ("perf script: Support insn output for normal samples") Link: http://lkml.kernel.org/r/20190621014438.810342-1-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- tools/perf/builtin-script.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3669,7 +3669,8 @@ int cmd_script(int argc, const char **ar goto out_delete;
uname(&uts); - if (!strcmp(uts.machine, session->header.env.arch) || + if (data.is_pipe || /* assume pipe_mode indicates native_arch */ + !strcmp(uts.machine, session->header.env.arch) || (!strcmp(uts.machine, "x86_64") && !strcmp(session->header.env.arch, "i386"))) native_arch = true;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
So far, we tried to disallow grouping exclusive events for the fear of complications they would cause with moving between contexts. Specifically, moving a software group to a hardware context would violate the exclusivity rules if both groups contain matching exclusive events.
This attempt was, however, unsuccessful: the check that we have in the perf_event_open() syscall is both wrong (looks at wrong PMU) and insufficient (group leader may still be exclusive), as can be illustrated by running:
$ perf record -e '{intel_pt//,cycles}' uname $ perf record -e '{cycles,intel_pt//}' uname
ultimately successfully.
Furthermore, we are completely free to trigger the exclusivity violation by:
perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
even though the helpful perf record will not allow that, the ABI will.
The warning later in the perf_event_open() path will also not trigger, because it's also wrong.
Fix all this by validating the original group before moving, getting rid of broken safeguards and placing a useful one to perf_install_in_context().
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: mathieu.poirier@linaro.org Cc: will.deacon@arm.com Fixes: bed5b25ad9c8a ("perf: Add a pmu capability for "exclusive" events") Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.in... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- include/linux/perf_event.h | 5 +++++ kernel/events/core.c | 34 ++++++++++++++++++++++------------ 2 files changed, 27 insertions(+), 12 deletions(-)
--- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1044,6 +1044,11 @@ static inline int in_software_context(st return event->ctx->pmu->task_ctx_nr == perf_sw_context; }
+static inline int is_exclusive_pmu(struct pmu *pmu) +{ + return pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE; +} + extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2543,6 +2543,9 @@ unlock: return ret; }
+static bool exclusive_event_installable(struct perf_event *event, + struct perf_event_context *ctx); + /* * Attach a performance event to a context. * @@ -2557,6 +2560,8 @@ perf_install_in_context(struct perf_even
lockdep_assert_held(&ctx->mutex);
+ WARN_ON_ONCE(!exclusive_event_installable(event, ctx)); + if (event->cpu != -1) event->cpu = cpu;
@@ -4348,7 +4353,7 @@ static int exclusive_event_init(struct p { struct pmu *pmu = event->pmu;
- if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE)) + if (!is_exclusive_pmu(pmu)) return 0;
/* @@ -4379,7 +4384,7 @@ static void exclusive_event_destroy(stru { struct pmu *pmu = event->pmu;
- if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE)) + if (!is_exclusive_pmu(pmu)) return;
/* see comment in exclusive_event_init() */ @@ -4399,14 +4404,15 @@ static bool exclusive_event_match(struct return false; }
-/* Called under the same ctx::mutex as perf_install_in_context() */ static bool exclusive_event_installable(struct perf_event *event, struct perf_event_context *ctx) { struct perf_event *iter_event; struct pmu *pmu = event->pmu;
- if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE)) + lockdep_assert_held(&ctx->mutex); + + if (!is_exclusive_pmu(pmu)) return true;
list_for_each_entry(iter_event, &ctx->event_list, event_entry) { @@ -10899,11 +10905,6 @@ SYSCALL_DEFINE5(perf_event_open, goto err_alloc; }
- if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) { - err = -EBUSY; - goto err_context; - } - /* * Look up the group leader (we will attach this event to it): */ @@ -10991,6 +10992,18 @@ SYSCALL_DEFINE5(perf_event_open, move_group = 0; } } + + /* + * Failure to create exclusive events returns -EBUSY. + */ + err = -EBUSY; + if (!exclusive_event_installable(group_leader, ctx)) + goto err_locked; + + for_each_sibling_event(sibling, group_leader) { + if (!exclusive_event_installable(sibling, ctx)) + goto err_locked; + } } else { mutex_lock(&ctx->mutex); } @@ -11027,9 +11040,6 @@ SYSCALL_DEFINE5(perf_event_open, * because we need to serialize with concurrent event creation. */ if (!exclusive_event_installable(event, ctx)) { - /* exclusive and group stuff are assumed mutually exclusive */ - WARN_ON_ONCE(move_group); - err = -EBUSY; goto err_locked; }
From: Peter Zijlstra peterz@infradead.org
commit 1cf8dfe8a661f0462925df943140e9f6d1ea5233 upstream.
Syzcaller reported the following Use-after-Free bug:
close() clone()
copy_process() perf_event_init_task() perf_event_init_context() mutex_lock(parent_ctx->mutex) inherit_task_group() inherit_group() inherit_event() mutex_lock(event->child_mutex) // expose event on child list list_add_tail() mutex_unlock(event->child_mutex) mutex_unlock(parent_ctx->mutex)
... goto bad_fork_*
bad_fork_cleanup_perf: perf_event_free_task()
perf_release() perf_event_release_kernel() list_for_each_entry() mutex_lock(ctx->mutex) mutex_lock(event->child_mutex) // event is from the failing inherit // on the other CPU perf_remove_from_context() list_move() mutex_unlock(event->child_mutex) mutex_unlock(ctx->mutex)
mutex_lock(ctx->mutex) list_for_each_entry_safe() // event already stolen mutex_unlock(ctx->mutex)
delayed_free_task() free_task()
list_for_each_entry_safe() list_del() free_event() _free_event() // and so event->hw.target // is the already freed failed clone() if (event->hw.target) put_task_struct(event->hw.target) // WHOOPSIE, already quite dead
Which puts the lie to the the comment on perf_event_free_task(): 'unexposed, unused context' not so much.
Which is a 'fun' confluence of fail; copy_process() doing an unconditional free_task() and not respecting refcounts, and perf having creative locking. In particular:
82d94856fa22 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
seems to have overlooked this 'fun' parade.
Solve it by using the fact that detached events still have a reference count on their (previous) context. With this perf_event_free_task() can detect when events have escaped and wait for their destruction.
Debugged-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Mark Rutland mark.rutland@arm.com Cc: stable@vger.kernel.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Fixes: 82d94856fa22 ("perf/core: Fix lock inversion between perf,trace,cpuhp") Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- kernel/events/core.c | 49 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 41 insertions(+), 8 deletions(-)
--- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4459,12 +4459,20 @@ static void _free_event(struct perf_even if (event->destroy) event->destroy(event);
- if (event->ctx) - put_ctx(event->ctx); - + /* + * Must be after ->destroy(), due to uprobe_perf_close() using + * hw.target. + */ if (event->hw.target) put_task_struct(event->hw.target);
+ /* + * perf_event_free_task() relies on put_ctx() being 'last', in particular + * all task references must be cleaned up. + */ + if (event->ctx) + put_ctx(event->ctx); + exclusive_event_destroy(event); module_put(event->pmu->module);
@@ -4644,8 +4652,17 @@ again: mutex_unlock(&event->child_mutex);
list_for_each_entry_safe(child, tmp, &free_list, child_list) { + void *var = &child->ctx->refcount; + list_del(&child->child_list); free_event(child); + + /* + * Wake any perf_event_free_task() waiting for this event to be + * freed. + */ + smp_mb(); /* pairs with wait_var_event() */ + wake_up_var(var); }
no_ctx: @@ -11506,11 +11523,11 @@ static void perf_free_event(struct perf_ }
/* - * Free an unexposed, unused context as created by inheritance by - * perf_event_init_task below, used by fork() in case of fail. + * Free a context as created by inheritance by perf_event_init_task() below, + * used by fork() in case of fail. * - * Not all locks are strictly required, but take them anyway to be nice and - * help out with the lockdep assertions. + * Even though the task has never lived, the context and events have been + * exposed through the child_list, so we must take care tearing it all down. */ void perf_event_free_task(struct task_struct *task) { @@ -11540,7 +11557,23 @@ void perf_event_free_task(struct task_st perf_free_event(event, ctx);
mutex_unlock(&ctx->mutex); - put_ctx(ctx); + + /* + * perf_event_release_kernel() could've stolen some of our + * child events and still have them on its free_list. In that + * case we must wait for these events to have been freed (in + * particular all their references to this task must've been + * dropped). + * + * Without this copy_process() will unconditionally free this + * task (irrespective of its reference count) and + * _free_event()'s put_task_struct(event->hw.target) will be a + * use-after-free. + * + * Wait for all events to drop their context reference. + */ + wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1); + put_ctx(ctx); /* must be last */ } }
From: Darrick J. Wong darrick.wong@oracle.com
commit 2e53840362771c73eb0a5ff71611507e64e8eecd upstream.
Don't allow any modifications to a file that's marked immutable, which means that we have to flush all the writable pages to make the readonly and we have to check the setattr/setflags parameters more closely.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ext4/ioctl.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-)
--- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -269,6 +269,29 @@ static int uuid_is_zero(__u8 u[16]) } #endif
+/* + * If immutable is set and we are not clearing it, we're not allowed to change + * anything else in the inode. Don't error out if we're only trying to set + * immutable on an immutable file. + */ +static int ext4_ioctl_check_immutable(struct inode *inode, __u32 new_projid, + unsigned int flags) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + unsigned int oldflags = ei->i_flags; + + if (!(oldflags & EXT4_IMMUTABLE_FL) || !(flags & EXT4_IMMUTABLE_FL)) + return 0; + + if ((oldflags & ~EXT4_IMMUTABLE_FL) != (flags & ~EXT4_IMMUTABLE_FL)) + return -EPERM; + if (ext4_has_feature_project(inode->i_sb) && + __kprojid_val(ei->i_projid) != new_projid) + return -EPERM; + + return 0; +} + static int ext4_ioctl_setflags(struct inode *inode, unsigned int flags) { @@ -322,6 +345,20 @@ static int ext4_ioctl_setflags(struct in goto flags_out; }
+ /* + * Wait for all pending directio and then flush all the dirty pages + * for this file. The flush marks all the pages readonly, so any + * subsequent attempt to write to the file (particularly mmap pages) + * will come through the filesystem and fail. + */ + if (S_ISREG(inode->i_mode) && !IS_IMMUTABLE(inode) && + (flags & EXT4_IMMUTABLE_FL)) { + inode_dio_wait(inode); + err = filemap_write_and_wait(inode->i_mapping); + if (err) + goto flags_out; + } + handle = ext4_journal_start(inode, EXT4_HT_INODE, 1); if (IS_ERR(handle)) { err = PTR_ERR(handle); @@ -751,7 +788,11 @@ long ext4_ioctl(struct file *filp, unsig return err;
inode_lock(inode); - err = ext4_ioctl_setflags(inode, flags); + err = ext4_ioctl_check_immutable(inode, + from_kprojid(&init_user_ns, ei->i_projid), + flags); + if (!err) + err = ext4_ioctl_setflags(inode, flags); inode_unlock(inode); mnt_drop_write_file(filp); return err; @@ -1121,6 +1162,9 @@ resizefs_out: goto out; flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) | (flags & EXT4_FL_XFLAG_VISIBLE); + err = ext4_ioctl_check_immutable(inode, fa.fsx_projid, flags); + if (err) + goto out; err = ext4_ioctl_setflags(inode, flags); if (err) goto out;
From: Theodore Ts'o tytso@mit.edu
commit 02b016ca7f99229ae6227e7b2fc950c4e140d74a upstream.
According to the chattr man page, "a file with the 'i' attribute cannot be modified..." Historically, this was only enforced when the file was opened, per the rest of the description, "... and the file can not be opened in write mode".
There is general agreement that we should standardize all file systems to prevent modifications even for files that were opened at the time the immutable flag is set. Eventually, a change to enforce this at the VFS layer should be landing in mainline. Until then, enforce this at the ext4 level to prevent xfstests generic/553 from failing.
Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: "Darrick J. Wong" darrick.wong@oracle.com Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ext4/file.c | 4 ++++ fs/ext4/inode.c | 11 +++++++++++ 2 files changed, 15 insertions(+)
--- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -165,6 +165,10 @@ static ssize_t ext4_write_checks(struct ret = generic_write_checks(iocb, from); if (ret <= 0) return ret; + + if (unlikely(IS_IMMUTABLE(inode))) + return -EPERM; + /* * If we have encountered a bitmap-format file, the size limit * is smaller than s_maxbytes, which is for extent-mapped files. --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5514,6 +5514,14 @@ int ext4_setattr(struct dentry *dentry, if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO;
+ if (unlikely(IS_IMMUTABLE(inode))) + return -EPERM; + + if (unlikely(IS_APPEND(inode) && + (ia_valid & (ATTR_MODE | ATTR_UID | + ATTR_GID | ATTR_TIMES_SET)))) + return -EPERM; + error = setattr_prepare(dentry, attr); if (error) return error; @@ -6184,6 +6192,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_f get_block_t *get_block; int retries = 0;
+ if (unlikely(IS_IMMUTABLE(inode))) + return VM_FAULT_SIGBUS; + sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file);
From: Ross Zwisler zwisler@chromium.org
commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
In the spirit of filemap_fdatawait_range() and filemap_fdatawait_keep_errors(), introduce filemap_fdatawait_range_keep_errors() which both takes a range upon which to wait and does not clear errors from the address space.
Signed-off-by: Ross Zwisler zwisler@google.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- include/linux/fs.h | 2 ++ mm/filemap.c | 22 ++++++++++++++++++++++ 2 files changed, 24 insertions(+)
--- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2703,6 +2703,8 @@ extern int filemap_flush(struct address_ extern int filemap_fdatawait_keep_errors(struct address_space *mapping); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte);
static inline int filemap_fdatawait(struct address_space *mapping) { --- a/mm/filemap.c +++ b/mm/filemap.c @@ -548,6 +548,28 @@ int filemap_fdatawait_range(struct addre EXPORT_SYMBOL(filemap_fdatawait_range);
/** + * filemap_fdatawait_range_keep_errors - wait for writeback to complete + * @mapping: address space structure to wait for + * @start_byte: offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Walk the list of under-writeback pages of the given address space in the + * given range and wait for all of them. Unlike filemap_fdatawait_range(), + * this function does not clear error status of the address space. + * + * Use this function if callers don't handle errors themselves. Expected + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), + * fsfreeze(8) + */ +int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_and_keep_errors(mapping); +} +EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors); + +/** * file_fdatawait_range - wait for writeback to complete * @file: file pointing to address space structure to wait for * @start_byte: offset in bytes where the range starts
From: Ross Zwisler zwisler@chromium.org
commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out.
The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation.
We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure.
This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to.
Signed-off-by: Ross Zwisler zwisler@google.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/jbd2/commit.c | 23 +++++++++++++++++------ fs/jbd2/journal.c | 4 ++++ fs/jbd2/transaction.c | 49 ++++++++++++++++++++++++++++--------------------- include/linux/jbd2.h | 22 ++++++++++++++++++++++ 4 files changed, 71 insertions(+), 27 deletions(-)
--- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -187,14 +187,15 @@ static int journal_wait_on_commit_record * use writepages() because with dealyed allocation we may be doing * block allocation in writepages(). */ -static int journal_submit_inode_data_buffers(struct address_space *mapping) +static int journal_submit_inode_data_buffers(struct address_space *mapping, + loff_t dirty_start, loff_t dirty_end) { int ret; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = mapping->nrpages * 2, - .range_start = 0, - .range_end = i_size_read(mapping->host), + .range_start = dirty_start, + .range_end = dirty_end, };
ret = generic_writepages(mapping, &wbc); @@ -218,6 +219,9 @@ static int journal_submit_data_buffers(j
spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WRITE_DATA)) continue; mapping = jinode->i_vfs_inode->i_mapping; @@ -230,7 +234,8 @@ static int journal_submit_data_buffers(j * only allocated blocks here. */ trace_jbd2_submit_inode_data(jinode->i_vfs_inode); - err = journal_submit_inode_data_buffers(mapping); + err = journal_submit_inode_data_buffers(mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -257,12 +262,16 @@ static int journal_finish_inode_data_buf /* For locking, see the comment in journal_submit_data_buffers() */ spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WAIT_DATA)) continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - err = filemap_fdatawait_keep_errors( - jinode->i_vfs_inode->i_mapping); + err = filemap_fdatawait_range_keep_errors( + jinode->i_vfs_inode->i_mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -282,6 +291,8 @@ static int journal_finish_inode_data_buf &jinode->i_transaction->t_inode_list); } else { jinode->i_transaction = NULL; + jinode->i_dirty_start = 0; + jinode->i_dirty_end = 0; } } spin_unlock(&journal->j_list_lock); --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -94,6 +94,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_b EXPORT_SYMBOL(jbd2_journal_force_commit); EXPORT_SYMBOL(jbd2_journal_inode_add_write); EXPORT_SYMBOL(jbd2_journal_inode_add_wait); +EXPORT_SYMBOL(jbd2_journal_inode_ranged_write); +EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait); EXPORT_SYMBOL(jbd2_journal_init_jbd_inode); EXPORT_SYMBOL(jbd2_journal_release_jbd_inode); EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate); @@ -2574,6 +2576,8 @@ void jbd2_journal_init_jbd_inode(struct jinode->i_next_transaction = NULL; jinode->i_vfs_inode = inode; jinode->i_flags = 0; + jinode->i_dirty_start = 0; + jinode->i_dirty_end = 0; INIT_LIST_HEAD(&jinode->i_list); }
--- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -2565,7 +2565,7 @@ void jbd2_journal_refile_buffer(journal_ * File inode in the inode list of the handle's transaction */ static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode, - unsigned long flags) + unsigned long flags, loff_t start_byte, loff_t end_byte) { transaction_t *transaction = handle->h_transaction; journal_t *journal; @@ -2577,26 +2577,17 @@ static int jbd2_journal_file_inode(handl jbd_debug(4, "Adding inode %lu, tid:%d\n", jinode->i_vfs_inode->i_ino, transaction->t_tid);
- /* - * First check whether inode isn't already on the transaction's - * lists without taking the lock. Note that this check is safe - * without the lock as we cannot race with somebody removing inode - * from the transaction. The reason is that we remove inode from the - * transaction only in journal_release_jbd_inode() and when we commit - * the transaction. We are guarded from the first case by holding - * a reference to the inode. We are safe against the second case - * because if jinode->i_transaction == transaction, commit code - * cannot touch the transaction because we hold reference to it, - * and if jinode->i_next_transaction == transaction, commit code - * will only file the inode where we want it. - */ - if ((jinode->i_transaction == transaction || - jinode->i_next_transaction == transaction) && - (jinode->i_flags & flags) == flags) - return 0; - spin_lock(&journal->j_list_lock); jinode->i_flags |= flags; + + if (jinode->i_dirty_end) { + jinode->i_dirty_start = min(jinode->i_dirty_start, start_byte); + jinode->i_dirty_end = max(jinode->i_dirty_end, end_byte); + } else { + jinode->i_dirty_start = start_byte; + jinode->i_dirty_end = end_byte; + } + /* Is inode already attached where we need it? */ if (jinode->i_transaction == transaction || jinode->i_next_transaction == transaction) @@ -2631,12 +2622,28 @@ done: int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode) { return jbd2_journal_file_inode(handle, jinode, - JI_WRITE_DATA | JI_WAIT_DATA); + JI_WRITE_DATA | JI_WAIT_DATA, 0, LLONG_MAX); }
int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode) { - return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA); + return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA, 0, + LLONG_MAX); +} + +int jbd2_journal_inode_ranged_write(handle_t *handle, + struct jbd2_inode *jinode, loff_t start_byte, loff_t length) +{ + return jbd2_journal_file_inode(handle, jinode, + JI_WRITE_DATA | JI_WAIT_DATA, start_byte, + start_byte + length - 1); +} + +int jbd2_journal_inode_ranged_wait(handle_t *handle, struct jbd2_inode *jinode, + loff_t start_byte, loff_t length) +{ + return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA, + start_byte, start_byte + length - 1); }
/* --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -454,6 +454,22 @@ struct jbd2_inode { * @i_flags: Flags of inode [j_list_lock] */ unsigned long i_flags; + + /** + * @i_dirty_start: + * + * Offset in bytes where the dirty range for this inode starts. + * [j_list_lock] + */ + loff_t i_dirty_start; + + /** + * @i_dirty_end: + * + * Inclusive offset in bytes where the dirty range for this inode + * ends. [j_list_lock] + */ + loff_t i_dirty_end; };
struct jbd2_revoke_table_s; @@ -1400,6 +1416,12 @@ extern int jbd2_journal_force_commit( extern int jbd2_journal_force_commit_nested(journal_t *); extern int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode); extern int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode); +extern int jbd2_journal_inode_ranged_write(handle_t *handle, + struct jbd2_inode *inode, loff_t start_byte, + loff_t length); +extern int jbd2_journal_inode_ranged_wait(handle_t *handle, + struct jbd2_inode *inode, loff_t start_byte, + loff_t length); extern int jbd2_journal_begin_ordered_truncate(journal_t *journal, struct jbd2_inode *inode, loff_t new_size); extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
From: Ross Zwisler zwisler@chromium.org
commit 73131fbb003b3691cfcf9656f234b00da497fcd6 upstream.
Use the newly introduced jbd2_inode dirty range scoping to prevent us from waiting forever when trying to complete a journal transaction.
Signed-off-by: Ross Zwisler zwisler@google.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ext4/ext4_jbd2.h | 12 ++++++------ fs/ext4/inode.c | 13 ++++++++++--- fs/ext4/move_extent.c | 3 ++- 3 files changed, 18 insertions(+), 10 deletions(-)
--- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -361,20 +361,20 @@ static inline int ext4_journal_force_com }
static inline int ext4_jbd2_inode_add_write(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_write(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_write(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; }
static inline int ext4_jbd2_inode_add_wait(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_wait(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_wait(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; }
--- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -727,10 +727,16 @@ out_sem: !(flags & EXT4_GET_BLOCKS_ZERO) && !ext4_is_quota_file(inode) && ext4_should_order_data(inode)) { + loff_t start_byte = + (loff_t)map->m_lblk << inode->i_blkbits; + loff_t length = (loff_t)map->m_len << inode->i_blkbits; + if (flags & EXT4_GET_BLOCKS_IO_SUBMIT) - ret = ext4_jbd2_inode_add_wait(handle, inode); + ret = ext4_jbd2_inode_add_wait(handle, inode, + start_byte, length); else - ret = ext4_jbd2_inode_add_write(handle, inode); + ret = ext4_jbd2_inode_add_write(handle, inode, + start_byte, length); if (ret) return ret; } @@ -4081,7 +4087,8 @@ static int __ext4_block_zero_page_range( err = 0; mark_buffer_dirty(bh); if (ext4_should_order_data(inode)) - err = ext4_jbd2_inode_add_write(handle, inode); + err = ext4_jbd2_inode_add_write(handle, inode, from, + length); }
unlock: --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -390,7 +390,8 @@ data_copy:
/* Even in case of data=writeback it is reasonable to pin * inode to transaction, to prevent unexpected data loss */ - *err = ext4_jbd2_inode_add_write(handle, orig_inode); + *err = ext4_jbd2_inode_add_write(handle, orig_inode, + (loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
unlock_pages: unlock_page(pagep[0]);
From: Theodore Ts'o tytso@mit.edu
commit 4e19d6b65fb4fc42e352ce9883649e049da14743 upstream.
The largedir feature was intended to allow ext4 directories to have unmapped directory blocks (e.g., directory holes). And so the released e2fsprogs no longer enforces this for largedir file systems; however, the corresponding change to the kernel-side code was not made.
This commit fixes this oversight.
Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- fs/ext4/dir.c | 19 +++++++++---------- fs/ext4/namei.c | 45 +++++++++++++++++++++++++++++++++++++-------- 2 files changed, 46 insertions(+), 18 deletions(-)
--- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -108,7 +108,6 @@ static int ext4_readdir(struct file *fil struct inode *inode = file_inode(file); struct super_block *sb = inode->i_sb; struct buffer_head *bh = NULL; - int dir_has_error = 0; struct fscrypt_str fstr = FSTR_INIT(NULL, 0);
if (IS_ENCRYPTED(inode)) { @@ -144,8 +143,6 @@ static int ext4_readdir(struct file *fil return err; }
- offset = ctx->pos & (sb->s_blocksize - 1); - while (ctx->pos < inode->i_size) { struct ext4_map_blocks map;
@@ -154,9 +151,18 @@ static int ext4_readdir(struct file *fil goto errout; } cond_resched(); + offset = ctx->pos & (sb->s_blocksize - 1); map.m_lblk = ctx->pos >> EXT4_BLOCK_SIZE_BITS(sb); map.m_len = 1; err = ext4_map_blocks(NULL, inode, &map, 0); + if (err == 0) { + /* m_len should never be zero but let's avoid + * an infinite loop if it somehow is */ + if (map.m_len == 0) + map.m_len = 1; + ctx->pos += map.m_len * sb->s_blocksize; + continue; + } if (err > 0) { pgoff_t index = map.m_pblk >> (PAGE_SHIFT - inode->i_blkbits); @@ -175,13 +181,6 @@ static int ext4_readdir(struct file *fil }
if (!bh) { - if (!dir_has_error) { - EXT4_ERROR_FILE(file, 0, - "directory contains a " - "hole at offset %llu", - (unsigned long long) ctx->pos); - dir_has_error = 1; - } /* corrupt size? Maybe no more blocks to read */ if (ctx->pos > inode->i_blocks << 9) break; --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -81,8 +81,18 @@ static struct buffer_head *ext4_append(h static int ext4_dx_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent);
+/* + * Hints to ext4_read_dirblock regarding whether we expect a directory + * block being read to be an index block, or a block containing + * directory entries (and if the latter, whether it was found via a + * logical block in an htree index block). This is used to control + * what sort of sanity checkinig ext4_read_dirblock() will do on the + * directory block read from the storage device. EITHER will means + * the caller doesn't know what kind of directory block will be read, + * so no specific verification will be done. + */ typedef enum { - EITHER, INDEX, DIRENT + EITHER, INDEX, DIRENT, DIRENT_HTREE } dirblock_type_t;
#define ext4_read_dirblock(inode, block, type) \ @@ -108,11 +118,14 @@ static struct buffer_head *__ext4_read_d
return bh; } - if (!bh) { + if (!bh && (type == INDEX || type == DIRENT_HTREE)) { ext4_error_inode(inode, func, line, block, - "Directory hole found"); + "Directory hole found for htree %s block", + (type == INDEX) ? "index" : "leaf"); return ERR_PTR(-EFSCORRUPTED); } + if (!bh) + return NULL; dirent = (struct ext4_dir_entry *) bh->b_data; /* Determine whether or not we have an index block */ if (is_dx(inode)) { @@ -979,7 +992,7 @@ static int htree_dirblock_to_tree(struct
dxtrace(printk(KERN_INFO "In htree dirblock_to_tree: block %lu\n", (unsigned long)block)); - bh = ext4_read_dirblock(dir, block, DIRENT); + bh = ext4_read_dirblock(dir, block, DIRENT_HTREE); if (IS_ERR(bh)) return PTR_ERR(bh);
@@ -1509,7 +1522,7 @@ static struct buffer_head * ext4_dx_find return (struct buffer_head *) frame; do { block = dx_get_block(frame->at); - bh = ext4_read_dirblock(dir, block, DIRENT); + bh = ext4_read_dirblock(dir, block, DIRENT_HTREE); if (IS_ERR(bh)) goto errout;
@@ -2079,6 +2092,11 @@ static int ext4_add_entry(handle_t *hand blocks = dir->i_size >> sb->s_blocksize_bits; for (block = 0; block < blocks; block++) { bh = ext4_read_dirblock(dir, block, DIRENT); + if (bh == NULL) { + bh = ext4_bread(handle, dir, block, + EXT4_GET_BLOCKS_CREATE); + goto add_to_new_block; + } if (IS_ERR(bh)) { retval = PTR_ERR(bh); bh = NULL; @@ -2099,6 +2117,7 @@ static int ext4_add_entry(handle_t *hand brelse(bh); } bh = ext4_append(handle, dir, &block); +add_to_new_block: if (IS_ERR(bh)) { retval = PTR_ERR(bh); bh = NULL; @@ -2143,7 +2162,7 @@ again: return PTR_ERR(frame); entries = frame->entries; at = frame->at; - bh = ext4_read_dirblock(dir, dx_get_block(frame->at), DIRENT); + bh = ext4_read_dirblock(dir, dx_get_block(frame->at), DIRENT_HTREE); if (IS_ERR(bh)) { err = PTR_ERR(bh); bh = NULL; @@ -2691,7 +2710,10 @@ bool ext4_empty_dir(struct inode *inode) EXT4_ERROR_INODE(inode, "invalid size"); return true; } - bh = ext4_read_dirblock(inode, 0, EITHER); + /* The first directory block must not be a hole, + * so treat it as DIRENT_HTREE + */ + bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE); if (IS_ERR(bh)) return true;
@@ -2713,6 +2735,10 @@ bool ext4_empty_dir(struct inode *inode) brelse(bh); lblock = offset >> EXT4_BLOCK_SIZE_BITS(sb); bh = ext4_read_dirblock(inode, lblock, EITHER); + if (bh == NULL) { + offset += sb->s_blocksize; + continue; + } if (IS_ERR(bh)) return true; de = (struct ext4_dir_entry_2 *) bh->b_data; @@ -3256,7 +3282,10 @@ static struct buffer_head *ext4_get_firs struct buffer_head *bh;
if (!ext4_has_inline_data(inode)) { - bh = ext4_read_dirblock(inode, 0, EITHER); + /* The first directory block must not be a hole, so + * treat it as DIRENT_HTREE + */ + bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE); if (IS_ERR(bh)) { *retval = PTR_ERR(bh); return NULL;
From: Paolo Bonzini pbonzini@redhat.com
commit 88dddc11a8d6b09201b4db9d255b3394d9bc9e57 upstream.
If a KVM guest is reset while running a nested guest, free_nested will disable the shadow VMCS execution control in the vmcs01. However, on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync the VMCS12 to the shadow VMCS which has since been freed.
This causes a vmptrld of a NULL pointer on my machime, but Jan reports the host to hang altogether. Let's see how much this trivial patch fixes.
Reported-by: Jan Kiszka jan.kiszka@siemens.com Cc: Liran Alon liran.alon@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/vmx/nested.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
--- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -184,6 +184,7 @@ static void vmx_disable_shadow_vmcs(stru { vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_SHADOW_VMCS); vmcs_write64(VMCS_LINK_POINTER, -1ull); + vmx->nested.need_vmcs12_sync = false; }
static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) @@ -1328,6 +1329,9 @@ static void copy_shadow_to_vmcs12(struct u64 field_value; struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs;
+ if (WARN_ON(!shadow_vmcs)) + return; + preempt_disable();
vmcs_load(shadow_vmcs); @@ -1366,6 +1370,9 @@ static void copy_vmcs12_to_shadow(struct u64 field_value = 0; struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs;
+ if (WARN_ON(!shadow_vmcs)) + return; + vmcs_load(shadow_vmcs);
for (q = 0; q < ARRAY_SIZE(fields); q++) { @@ -4336,7 +4343,6 @@ static inline void nested_release_vmcs12 /* copy to memory all shadowed fields in case they were modified */ copy_shadow_to_vmcs12(vmx); - vmx->nested.need_vmcs12_sync = false; vmx_disable_shadow_vmcs(vmx); } vmx->nested.posted_intr_nv = -1;
From: Jan Kiszka jan.kiszka@siemens.com
commit cf64527bb33f6cec2ed50f89182fc4688d0056b6 upstream.
Letting this pend may cause nested_get_vmcs12_pages to run against an invalid state, corrupting the effective vmcs of L1.
This was triggerable in QEMU after a guest corruption in L2, followed by a L1 reset.
Signed-off-by: Jan Kiszka jan.kiszka@siemens.com Reviewed-by: Liran Alon liran.alon@oracle.com Cc: stable@vger.kernel.org Fixes: 7f7f1ba33cf2 ("KVM: x86: do not load vmcs12 pages while still in SMM") Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/kvm/vmx/nested.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -212,6 +212,8 @@ static void free_nested(struct kvm_vcpu if (!vmx->nested.vmxon && !vmx->nested.smm.vmxon) return;
+ kvm_clear_request(KVM_REQ_GET_VMCS12_PAGES, vcpu); + vmx->nested.vmxon = false; vmx->nested.smm.vmxon = false; free_vpid(vmx->nested.vpid02);
From: Paolo Bonzini pbonzini@redhat.com
commit ec269475cba7bcdd1eb8fdf8e87f4c6c81a376fe upstream.
This reverts commit 240c35a3783ab9b3a0afaba0dde7291295680a6b ("kvm: x86: Use task structs fpu field for user", 2018-11-06). The commit is broken and causes QEMU's FPU state to be destroyed when KVM_RUN is preempted.
Fixes: 240c35a3783a ("kvm: x86: Use task structs fpu field for user") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- arch/x86/include/asm/kvm_host.h | 7 ++++--- arch/x86/kvm/x86.c | 4 ++-- 2 files changed, 6 insertions(+), 5 deletions(-)
--- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -609,15 +609,16 @@ struct kvm_vcpu_arch {
/* * QEMU userspace and the guest each have their own FPU state. - * In vcpu_run, we switch between the user, maintained in the - * task_struct struct, and guest FPU contexts. While running a VCPU, - * the VCPU thread will have the guest FPU context. + * In vcpu_run, we switch between the user and guest FPU contexts. + * While running a VCPU, the VCPU thread will have the guest FPU + * context. * * Note that while the PKRU state lives inside the fpu registers, * it is switched out separately at VMENTER and VMEXIT time. The * "guest_fpu" state here contains the guest FPU context, with the * host PRKU bits. */ + struct fpu user_fpu; struct fpu *guest_fpu;
u64 xcr0; --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8172,7 +8172,7 @@ static int complete_emulated_mmio(struct static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) { preempt_disable(); - copy_fpregs_to_fpstate(¤t->thread.fpu); + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); /* PKRU is separately restored in kvm_x86_ops->run. */ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state, ~XFEATURE_MASK_PKRU); @@ -8185,7 +8185,7 @@ static void kvm_put_guest_fpu(struct kvm { preempt_disable(); copy_fpregs_to_fpstate(vcpu->arch.guest_fpu); - copy_kernel_to_fpregs(¤t->thread.fpu.state); + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); preempt_enable(); ++vcpu->stat.fpu_reload; trace_kvm_fpu(0);
From: Damien Le Moal damien.lemoal@wdc.com
commit b091ac616846a1da75b1f2566b41255ce7f0e0a6 upstream.
During disk scan and revalidation done with sd_revalidate(), the zones of a zoned disk are checked using the helper function blk_revalidate_disk_zones() if a configuration change is detected (change in the number of zones or zone size). The function blk_revalidate_disk_zones() issues report_zones calls that are very large, that is, to obtain zone information for all zones of the disk with a single command. The size of the report zones command buffer necessary for such large request generally is lower than the disk max_hw_sectors and KMALLOC_MAX_SIZE (4MB) and succeeds on boot (no memory fragmentation), but often fail at run time (e.g. hot-plug event). This causes the disk revalidation to fail and the disk capacity to be changed to 0.
This problem can be avoided by using vmalloc() instead of kmalloc() for the buffer allocation. To limit the amount of memory to be allocated, this patch also introduces the arbitrary SD_ZBC_REPORT_MAX_ZONES maximum number of zones to report with a single report zones command. This limit may be lowered further to satisfy the disk max_hw_sectors limit. Finally, to ensure that the vmalloc-ed buffer can always be mapped in a request, the buffer size is further limited to at most queue_max_segments() pages, allowing successful mapping of the buffer even in the worst case scenario where none of the buffer pages are contiguous.
Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- drivers/scsi/sd_zbc.c | 104 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 75 insertions(+), 29 deletions(-)
--- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -23,6 +23,8 @@ */
#include <linux/blkdev.h> +#include <linux/vmalloc.h> +#include <linux/sched/mm.h>
#include <asm/unaligned.h>
@@ -64,7 +66,7 @@ static void sd_zbc_parse_report(struct s /** * sd_zbc_do_report_zones - Issue a REPORT ZONES scsi command. * @sdkp: The target disk - * @buf: Buffer to use for the reply + * @buf: vmalloc-ed buffer to use for the reply * @buflen: the buffer size * @lba: Start LBA of the report * @partial: Do partial report @@ -93,7 +95,6 @@ static int sd_zbc_do_report_zones(struct put_unaligned_be32(buflen, &cmd[10]); if (partial) cmd[14] = ZBC_REPORT_ZONE_PARTIAL; - memset(buf, 0, buflen);
result = scsi_execute_req(sdp, cmd, DMA_FROM_DEVICE, buf, buflen, &sshdr, @@ -117,6 +118,53 @@ static int sd_zbc_do_report_zones(struct return 0; }
+/* + * Maximum number of zones to get with one report zones command. + */ +#define SD_ZBC_REPORT_MAX_ZONES 8192U + +/** + * Allocate a buffer for report zones reply. + * @sdkp: The target disk + * @nr_zones: Maximum number of zones to report + * @buflen: Size of the buffer allocated + * + * Try to allocate a reply buffer for the number of requested zones. + * The size of the buffer allocated may be smaller than requested to + * satify the device constraint (max_hw_sectors, max_segments, etc). + * + * Return the address of the allocated buffer and update @buflen with + * the size of the allocated buffer. + */ +static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp, + unsigned int nr_zones, size_t *buflen) +{ + struct request_queue *q = sdkp->disk->queue; + size_t bufsize; + void *buf; + + /* + * Report zone buffer size should be at most 64B times the number of + * zones requested plus the 64B reply header, but should be at least + * SECTOR_SIZE for ATA devices. + * Make sure that this size does not exceed the hardware capabilities. + * Furthermore, since the report zone command cannot be split, make + * sure that the allocated buffer can always be mapped by limiting the + * number of pages allocated to the HBA max segments limit. + */ + nr_zones = min(nr_zones, SD_ZBC_REPORT_MAX_ZONES); + bufsize = roundup((nr_zones + 1) * 64, 512); + bufsize = min_t(size_t, bufsize, + queue_max_hw_sectors(q) << SECTOR_SHIFT); + bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT); + + buf = vzalloc(bufsize); + if (buf) + *buflen = bufsize; + + return buf; +} + /** * sd_zbc_report_zones - Disk report zones operation. * @disk: The target disk @@ -132,30 +180,23 @@ int sd_zbc_report_zones(struct gendisk * gfp_t gfp_mask) { struct scsi_disk *sdkp = scsi_disk(disk); - unsigned int i, buflen, nrz = *nr_zones; + unsigned int i, nrz = *nr_zones; unsigned char *buf; - size_t offset = 0; + size_t buflen = 0, offset = 0; int ret = 0;
if (!sd_is_zoned(sdkp)) /* Not a zoned device */ return -EOPNOTSUPP;
- /* - * Get a reply buffer for the number of requested zones plus a header, - * without exceeding the device maximum command size. For ATA disks, - * buffers must be aligned to 512B. - */ - buflen = min(queue_max_hw_sectors(disk->queue) << 9, - roundup((nrz + 1) * 64, 512)); - buf = kmalloc(buflen, gfp_mask); + buf = sd_zbc_alloc_report_buffer(sdkp, nrz, &buflen); if (!buf) return -ENOMEM;
ret = sd_zbc_do_report_zones(sdkp, buf, buflen, sectors_to_logical(sdkp->device, sector), true); if (ret) - goto out_free_buf; + goto out;
nrz = min(nrz, get_unaligned_be32(&buf[0]) / 64); for (i = 0; i < nrz; i++) { @@ -166,8 +207,8 @@ int sd_zbc_report_zones(struct gendisk *
*nr_zones = nrz;
-out_free_buf: - kfree(buf); +out: + kvfree(buf);
return ret; } @@ -301,8 +342,6 @@ static int sd_zbc_check_zoned_characteri return 0; }
-#define SD_ZBC_BUF_SIZE 131072U - /** * sd_zbc_check_zones - Check the device capacity and zone sizes * @sdkp: Target disk @@ -318,22 +357,28 @@ static int sd_zbc_check_zoned_characteri */ static int sd_zbc_check_zones(struct scsi_disk *sdkp, u32 *zblocks) { + size_t bufsize, buflen; + unsigned int noio_flag; u64 zone_blocks = 0; sector_t max_lba, block = 0; unsigned char *buf; unsigned char *rec; - unsigned int buf_len; - unsigned int list_length; int ret; u8 same;
+ /* Do all memory allocations as if GFP_NOIO was specified */ + noio_flag = memalloc_noio_save(); + /* Get a buffer */ - buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL); - if (!buf) - return -ENOMEM; + buf = sd_zbc_alloc_report_buffer(sdkp, SD_ZBC_REPORT_MAX_ZONES, + &bufsize); + if (!buf) { + ret = -ENOMEM; + goto out; + }
/* Do a report zone to get max_lba and the same field */ - ret = sd_zbc_do_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, 0, false); + ret = sd_zbc_do_report_zones(sdkp, buf, bufsize, 0, false); if (ret) goto out_free;
@@ -369,12 +414,12 @@ static int sd_zbc_check_zones(struct scs do {
/* Parse REPORT ZONES header */ - list_length = get_unaligned_be32(&buf[0]) + 64; + buflen = min_t(size_t, get_unaligned_be32(&buf[0]) + 64, + bufsize); rec = buf + 64; - buf_len = min(list_length, SD_ZBC_BUF_SIZE);
/* Parse zone descriptors */ - while (rec < buf + buf_len) { + while (rec < buf + buflen) { u64 this_zone_blocks = get_unaligned_be64(&rec[8]);
if (zone_blocks == 0) { @@ -390,8 +435,8 @@ static int sd_zbc_check_zones(struct scs }
if (block < sdkp->capacity) { - ret = sd_zbc_do_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, - block, true); + ret = sd_zbc_do_report_zones(sdkp, buf, bufsize, block, + true); if (ret) goto out_free; } @@ -422,7 +467,8 @@ out: }
out_free: - kfree(buf); + memalloc_noio_restore(noio_flag); + kvfree(buf);
return ret; }
From: Damien Le Moal damien.lemoal@wdc.com
commit 26202928fafad8bda8b478edb7e62c885be623d7 upstream.
Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages().
Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- block/blk-zoned.c | 46 ++++++++++++++++++++++++++++++---------------- include/linux/blkdev.h | 5 +++++ 2 files changed, 35 insertions(+), 16 deletions(-)
--- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -13,6 +13,9 @@ #include <linux/rbtree.h> #include <linux/blkdev.h> #include <linux/blk-mq.h> +#include <linux/mm.h> +#include <linux/vmalloc.h> +#include <linux/sched/mm.h>
#include "blk.h"
@@ -372,22 +375,25 @@ static inline unsigned long *blk_alloc_z * Allocate an array of struct blk_zone to get nr_zones zone information. * The allocated array may be smaller than nr_zones. */ -static struct blk_zone *blk_alloc_zones(int node, unsigned int *nr_zones) +static struct blk_zone *blk_alloc_zones(unsigned int *nr_zones) { - size_t size = *nr_zones * sizeof(struct blk_zone); - struct page *page; - int order; - - for (order = get_order(size); order >= 0; order--) { - page = alloc_pages_node(node, GFP_NOIO | __GFP_ZERO, order); - if (page) { - *nr_zones = min_t(unsigned int, *nr_zones, - (PAGE_SIZE << order) / sizeof(struct blk_zone)); - return page_address(page); - } + struct blk_zone *zones; + size_t nrz = min(*nr_zones, BLK_ZONED_REPORT_MAX_ZONES); + + /* + * GFP_KERNEL here is meaningless as the caller task context has + * the PF_MEMALLOC_NOIO flag set in blk_revalidate_disk_zones() + * with memalloc_noio_save(). + */ + zones = kvcalloc(nrz, sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) { + *nr_zones = 0; + return NULL; }
- return NULL; + *nr_zones = nrz; + + return zones; }
void blk_queue_free_zone_bitmaps(struct request_queue *q) @@ -414,6 +420,7 @@ int blk_revalidate_disk_zones(struct gen unsigned long *seq_zones_wlock = NULL, *seq_zones_bitmap = NULL; unsigned int i, rep_nr_zones = 0, z = 0, nrz; struct blk_zone *zones = NULL; + unsigned int noio_flag; sector_t sector = 0; int ret = 0;
@@ -426,6 +433,12 @@ int blk_revalidate_disk_zones(struct gen return 0; }
+ /* + * Ensure that all memory allocations in this context are done as + * if GFP_NOIO was specified. + */ + noio_flag = memalloc_noio_save(); + if (!blk_queue_is_zoned(q) || !nr_zones) { nr_zones = 0; goto update; @@ -442,7 +455,7 @@ int blk_revalidate_disk_zones(struct gen
/* Get zone information and initialize seq_zones_bitmap */ rep_nr_zones = nr_zones; - zones = blk_alloc_zones(q->node, &rep_nr_zones); + zones = blk_alloc_zones(&rep_nr_zones); if (!zones) goto out;
@@ -479,8 +492,9 @@ update: blk_mq_unfreeze_queue(q);
out: - free_pages((unsigned long)zones, - get_order(rep_nr_zones * sizeof(struct blk_zone))); + memalloc_noio_restore(noio_flag); + + kvfree(zones); kfree(seq_zones_wlock); kfree(seq_zones_bitmap);
--- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -344,6 +344,11 @@ struct queue_limits {
#ifdef CONFIG_BLK_DEV_ZONED
+/* + * Maximum number of zones to report with a single report zones command. + */ +#define BLK_ZONED_REPORT_MAX_ZONES 8192U + extern unsigned int blkdev_nr_zones(struct block_device *bdev); extern int blkdev_report_zones(struct block_device *bdev, sector_t sector, struct blk_zone *zones,
From: Kuo-Hsin Yang vovoy@chromium.org
commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa upstream.
When file refaults are detected and there are many inactive file pages, the system never reclaim anonymous pages, the file pages are dropped aggressively when there are still a lot of cold anonymous pages and system thrashes. This issue impacts the performance of applications with large executable, e.g. chrome.
With this patch, when file refault is detected, inactive_list_is_low() always returns true for file pages in get_scan_count() to enable scanning anonymous pages.
The problem can be reproduced by the following test program.
---8<--- void fallocate_file(const char *filename, off_t size) { struct stat st; int fd;
if (!stat(filename, &st) && st.st_size >= size) return;
fd = open(filename, O_WRONLY | O_CREAT, 0600); if (fd < 0) { perror("create file"); exit(1); } if (posix_fallocate(fd, 0, size)) { perror("fallocate"); exit(1); } close(fd); }
long *alloc_anon(long size) { long *start = malloc(size); memset(start, 1, size); return start; }
long access_file(const char *filename, long size, long rounds) { int fd, i; volatile char *start1, *end1, *start2; const int page_size = getpagesize(); long sum = 0;
fd = open(filename, O_RDONLY); if (fd == -1) { perror("open"); exit(1); }
/* * Some applications, e.g. chrome, use a lot of executable file * pages, map some of the pages with PROT_EXEC flag to simulate * the behavior. */ start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0); if (start1 == MAP_FAILED) { perror("mmap"); exit(1); } end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2); if (start2 == MAP_FAILED) { perror("mmap"); exit(1); }
for (i = 0; i < rounds; ++i) { struct timeval before, after; volatile char *ptr1 = start1, *ptr2 = start2; gettimeofday(&before, NULL); for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size) sum += *ptr1 + *ptr2; gettimeofday(&after, NULL); printf("File access time, round %d: %f (sec) ", i, (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000000.0); } return sum; }
int main(int argc, char *argv[]) { const long MB = 1024 * 1024; long anon_mb, file_mb, file_rounds; const char filename[] = "large"; long *ret1; long ret2;
if (argc != 4) { printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS "); exit(0); } anon_mb = atoi(argv[1]); file_mb = atoi(argv[2]); file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB); printf("Allocate %ld MB anonymous pages ", anon_mb); ret1 = alloc_anon(anon_mb * MB); printf("Access %ld MB file pages ", file_mb); ret2 = access_file(filename, file_mb * MB, file_rounds); printf("Print result to prevent optimization: %ld ", *ret1 + ret2); return 0; } ---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without this patch, the file cache is dropped aggresively and every access to the file is from disk.
$ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.489316 (sec) File access time, round 1: 2.581277 (sec) File access time, round 2: 2.487624 (sec) File access time, round 3: 2.449100 (sec) File access time, round 4: 2.420423 (sec) File access time, round 5: 2.343411 (sec) File access time, round 6: 2.454833 (sec) File access time, round 7: 2.483398 (sec) File access time, round 8: 2.572701 (sec) File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.475189 (sec) File access time, round 1: 2.440777 (sec) File access time, round 2: 2.411671 (sec) File access time, round 3: 1.955267 (sec) File access time, round 4: 0.029924 (sec) File access time, round 5: 0.000808 (sec) File access time, round 6: 0.000771 (sec) File access time, round 7: 0.000746 (sec) File access time, round 8: 0.000738 (sec) File access time, round 9: 0.000747 (sec)
Checked the swap out stats during the test [1], 19006 pages swapped out with this patch, 3418 pages swapped out without this patch. There are more swap out, but I think it's within reasonable range when file backed data set doesn't fit into the memory.
$ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644, inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB) pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages Access 2100 MB regular file pages File access time, round 0: 12.165, (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896, inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec) active_anon: 1430576, inactive_anon: 477144, active_file: 25440, inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec) active_anon: 1427436, inactive_anon: 476060, active_file: 21112, inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec) active_anon: 1420444, inactive_anon: 473632, active_file: 23216, inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec) active_anon: 1413964, inactive_anon: 471460, active_file: 31728, inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366 (+ 11309120)
With 4 processes accessing non-overlapping parts of a large file, 30316 pages swapped out with this patch, 5152 pages swapped out without this patch. The swapout number is small comparing to pgpgin.
[1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty") Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty") Signed-off-by: Kuo-Hsin Yang vovoy@chromium.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Sonny Rao sonnyrao@chromium.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Rik van Riel riel@redhat.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Minchan Kim minchan@kernel.org Cc: stable@vger.kernel.org [4.12+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [backported to 4.14.y, 4.19.y, 5.1.y: adjust context] Signed-off-by: Kuo-Hsin Yang vovoy@chromium.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
--- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2176,7 +2176,7 @@ static void shrink_active_list(unsigned * 10TB 320 32GB */ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, - struct scan_control *sc, bool actual_reclaim) + struct scan_control *sc, bool trace) { enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE; struct pglist_data *pgdat = lruvec_pgdat(lruvec); @@ -2202,7 +2202,7 @@ static bool inactive_list_is_low(struct * rid of the stale workingset quickly. */ refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE); - if (file && actual_reclaim && lruvec->refaults != refaults) { + if (file && lruvec->refaults != refaults) { inactive_ratio = 0; } else { gb = (inactive + active) >> (30 - PAGE_SHIFT); @@ -2212,7 +2212,7 @@ static bool inactive_list_is_low(struct inactive_ratio = 1; }
- if (actual_reclaim) + if (trace) trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx, lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive, lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
On 7/26/19 9:24 AM, Greg Kroah-Hartman wrote:
Note, this will be the LAST 5.1.y kernel release. Everyone should move to the 5.2.y series at this point in time.
This is the start of the stable review cycle for the 5.1.21 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun 28 Jul 2019 03:21:13 PM UTC. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.1.21-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.1.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
thanks, -- Shuah
stable-rc/linux-5.1.y boot: 127 boots: 2 failed, 81 passed with 44 offline (v5.1.20-63-gf878628d8f1e)
Full Boot Summary: https://kernelci.org/boot/all/job/stable-rc/branch/linux-5.1.y/kernel/v5.1.2... Full Build Summary: https://kernelci.org/build/stable-rc/branch/linux-5.1.y/kernel/v5.1.20-63-gf...
Tree: stable-rc Branch: linux-5.1.y Git Describe: v5.1.20-63-gf878628d8f1e Git Commit: f878628d8f1efc883e9bd6f9f81173194b4a01dd Git URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Tested: 74 unique boards, 27 SoC families, 17 builds out of 209
Boot Failures Detected:
arm64: defconfig: gcc-8: meson-g12a-x96-max: 1 failed lab
arm: multi_v7_defconfig: gcc-8: bcm4708-smartrg-sr400ac: 1 failed lab
Offline Platforms:
arm64:
defconfig: gcc-8 meson-axg-s400: 1 offline lab meson-g12a-u200: 1 offline lab meson-g12a-x96-max: 1 offline lab meson-gxbb-odroidc2: 1 offline lab meson-gxl-s905d-p230: 1 offline lab meson-gxl-s905x-libretech-cc: 1 offline lab meson-gxl-s905x-nexbox-a95x: 1 offline lab meson-gxl-s905x-p212: 1 offline lab meson-gxm-nexbox-a1: 1 offline lab rk3399-firefly: 1 offline lab sun50i-a64-pine64-plus: 1 offline lab
mips:
pistachio_defconfig: gcc-8 pistachio_marduk: 1 offline lab
arm:
exynos_defconfig: gcc-8 exynos5250-arndale: 1 offline lab exynos5420-arndale-octa: 1 offline lab exynos5800-peach-pi: 1 offline lab
multi_v7_defconfig: gcc-8 bcm72521-bcm97252sffe: 1 offline lab bcm7445-bcm97445c: 1 offline lab exynos5250-arndale: 1 offline lab exynos5420-arndale-octa: 1 offline lab exynos5800-peach-pi: 1 offline lab imx6dl-wandboard_dual: 1 offline lab imx6dl-wandboard_solo: 1 offline lab imx6q-wandboard: 1 offline lab imx7s-warp: 1 offline lab meson8b-odroidc1: 1 offline lab omap3-beagle: 1 offline lab omap4-panda: 1 offline lab qcom-apq8064-ifc6410: 1 offline lab stih410-b2120: 1 offline lab sun4i-a10-cubieboard: 1 offline lab sun7i-a20-bananapi: 1 offline lab vf610-colibri-eval-v3: 1 offline lab
omap2plus_defconfig: gcc-8 omap3-beagle: 1 offline lab omap4-panda: 1 offline lab
qcom_defconfig: gcc-8 qcom-apq8064-ifc6410: 1 offline lab
davinci_all_defconfig: gcc-8 da850-evm: 1 offline lab dm365evm,legacy: 1 offline lab
imx_v6_v7_defconfig: gcc-8 imx6dl-wandboard_dual: 1 offline lab imx6dl-wandboard_solo: 1 offline lab imx6q-wandboard: 1 offline lab imx7s-warp: 1 offline lab vf610-colibri-eval-v3: 1 offline lab
sunxi_defconfig: gcc-8 sun4i-a10-cubieboard: 1 offline lab sun7i-a20-bananapi: 1 offline lab
--- For more info write to info@kernelci.org
On Fri, 26 Jul 2019 at 20:59, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
Note, this will be the LAST 5.1.y kernel release. Everyone should move to the 5.2.y series at this point in time.
This is the start of the stable review cycle for the 5.1.21 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun 28 Jul 2019 03:21:13 PM UTC. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.1.21-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.1.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Summary ------------------------------------------------------------------------
kernel: 5.1.21-rc1 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-5.1.y git commit: f878628d8f1efc883e9bd6f9f81173194b4a01dd git describe: v5.1.20-63-gf878628d8f1e Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-5.1-oe/build/v5.1.20-63-g...
No regressions (compared to build v5.1.20)
No fixes (compared to build v5.1.20)
Ran 21561 total tests in the following environments and test suites.
Environments -------------- - dragonboard-410c - hi6220-hikey - i386 - juno-r2 - qemu_arm - qemu_arm64 - qemu_i386 - qemu_x86_64 - x15 - x86
Test Suites ----------- * build * install-android-platform-tools-r2600 * kselftest * libgpiod * libhugetlbfs * ltp-cap_bounds-tests * ltp-commands-tests * ltp-containers-tests * ltp-cpuhotplug-tests * ltp-cve-tests * ltp-dio-tests * ltp-fcntl-locktests-tests * ltp-filecaps-tests * ltp-fs-tests * ltp-fs_bind-tests * ltp-fs_perms_simple-tests * ltp-fsx-tests * ltp-hugetlb-tests * ltp-io-tests * ltp-ipc-tests * ltp-math-tests * ltp-mm-tests * ltp-nptl-tests * ltp-pty-tests * ltp-sched-tests * ltp-securebits-tests * ltp-syscalls-tests * ltp-timers-tests * network-basic-tests * perf * spectre-meltdown-checker-test * v4l2-compliance * ltp-open-posix-tests * kvm-unit-tests * kselftest-vsyscall-mode-native * kselftest-vsyscall-mode-none
On 7/26/19 8:24 AM, Greg Kroah-Hartman wrote:
Note, this will be the LAST 5.1.y kernel release. Everyone should move to the 5.2.y series at this point in time.
This is the start of the stable review cycle for the 5.1.21 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun 28 Jul 2019 03:21:13 PM UTC. Anything received after that time might be too late.
Build results: total: 159 pass: 159 fail: 0 Qemu test results: total: 364 pass: 364 fail: 0
Guenter
On 26/07/2019 16:24, Greg Kroah-Hartman wrote:
Note, this will be the LAST 5.1.y kernel release. Everyone should move to the 5.2.y series at this point in time.
This is the start of the stable review cycle for the 5.1.21 release. There are 62 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun 28 Jul 2019 03:21:13 PM UTC. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.1.21-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.1.y and the diffstat can be found below.
thanks,
greg k-h
All tests are passing for Tegra ...
Test results for stable-v5.1: 12 builds: 12 pass, 0 fail 22 boots: 22 pass, 0 fail 32 tests: 32 pass, 0 fail
Linux version: 5.1.21-g4a9b1eb8bc3b Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000, tegra194-p2972-0000, tegra20-ventana, tegra210-p2371-2180, tegra30-cardhu-a04
Cheers Jon
linux-stable-mirror@lists.linaro.org