On 21/11/2017 15:22, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 12:44:10PM +0200, Mark Bloch wrote:
>> Hi,
>>
>> On 21/11/2017 12:26, Leon Romanovsky wrote:
>>> From: Daniel Jurgens <danielj(a)mellanox.com>
>>>
>>> For now the only LSM security enforcement mechanism available is
>>> specific to InfiniBand. Bypass enforcement for non-IB link types.
>>> This fixes a regression where modify_qp fails for iWARP because
>>> querying the PKEY returns -EINVAL.
>>>
>>> Cc: Paul Moore <paul(a)paul-moore.com>
>>> Cc: Don Dutile <ddutile(a)redhat.com>
>>> Cc: stable(a)vger.kernel.org
>>> Reported-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
>>> Fixes: d291f1a65232("IB/core: Enforce PKey security on QPs")
>>> Fixes: 47a2b338fe63("IB/core: Enforce security on management datagrams")
>>> Signed-off-by: Daniel Jurgens <danielj(a)mellanox.com>
>>> Reviewed-by: Parav Pandit <parav(a)mellanox.com>
>>> Tested-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
>>> Signed-off-by: Leon Romanovsky <leon(a)kernel.org>
>>> ---
>>> drivers/infiniband/core/security.c | 9 +++++++++
>>> 1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c
>>> index 23278ed5be45..314bf1137c7b 100644
>>> --- a/drivers/infiniband/core/security.c
>>> +++ b/drivers/infiniband/core/security.c
>>> @@ -417,8 +417,17 @@ void ib_close_shared_qp_security(struct ib_qp_security *sec)
>>>
>>> int ib_create_qp_security(struct ib_qp *qp, struct ib_device *dev)
>>> {
>>> + u8 i = rdma_start_port(dev);
>>> + bool is_ib = false;
>>> int ret;
>>>
>>> + while (i <= rdma_end_port(dev) && !is_ib)
>>> + is_ib = rdma_protocol_ib(dev, i++);
>>> +
>>
>> What happens if we have mixed port types?
>
> We will have is_ib set and qp_sec will be allocated on device layer and
> not on port level, but because pkeys are IB specific term (at least, I
> didn't find any mentioning in RoCE spec), the modify_qp won't query
> for PKEYS.
>
What is port level for qp_sec?
I just gave mlx4 as an example, but I was talking more about the ability of the RDMA
subsystem to support mixed port types, so in this case, if in the future a vendor
will come with an ib_device with 2 ports, one is IB and one is iWARP bad things will happen.
>> I believe mlx4 can expose two ports where each port uses a different ll protocol.
>> Was that changed?
>
> It is still true.
>
>>
>>> + /* If this isn't an IB device don't create the security context */
>>> + if (!is_ib)
>>> + return 0;
>>> +
>>> qp->qp_sec = kzalloc(sizeof(*qp->qp_sec), GFP_KERNEL);
>>> if (!qp->qp_sec)
>>> return -ENOMEM;
>>>
>>
>> Mark.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo(a)vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark
From: Yazen Ghannam <yazen.ghannam(a)amd.com>
[Upstream commit d65dfc81bb3894fdb68cbc74bbf5fb48d2354071]
The AMD severity grading function was introduced in kernel 4.1. The
current logic can possibly give MCE_AR_SEVERITY for uncorrectable
errors in kernel context. The system may then get stuck in a loop as
memory_failure() will try to handle the bad kernel memory and find it
busy.
Return MCE_PANIC_SEVERITY for all UC errors IN_KERNEL context on AMD
systems.
After:
b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
was accepted in v4.6, this issue was masked because of the tail-end attempt
at kernel mode recovery in the #MC handler.
However, uncorrectable errors IN_KERNEL context should always be considered
unrecoverable and cause a panic.
Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com>
Cc: <stable(a)vger.kernel.org> # 4.1.x, 4.4.x
Fixes: bf80bbd7dcf5 (x86/mce: Add an AMD severities-grading function)
---
Same as
d65dfc81bb38 ("x86/MCE/AMD: Always give panic severity for UC errors in kernel context")
but fixed up to apply to v4.1 and v4.4 stable branches.
arch/x86/kernel/cpu/mcheck/mce-severity.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c222071..a9287f0f06f2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -200,6 +200,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc
if (m->status & MCI_STATUS_UC) {
+ if (ctx == IN_KERNEL)
+ return MCE_PANIC_SEVERITY;
+
/*
* On older systems where overflow_recov flag is not present, we
* should simply panic if an error overflow occurs. If
@@ -207,10 +210,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc
* to at least kill process to prolong system operation.
*/
if (mce_flags.overflow_recov) {
- /* software can try to contain */
- if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == IN_KERNEL))
- return MCE_PANIC_SEVERITY;
-
/* kill current process */
return MCE_AR_SEVERITY;
} else {
--
2.14.1
This is a note to let you know that I've just added the patch titled
vlan: fix a use-after-free in vlan_device_event()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vlan-fix-a-use-after-free-in-vlan_device_event.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Cong Wang <xiyou.wangcong(a)gmail.com>
Date: Thu, 9 Nov 2017 16:43:13 -0800
Subject: vlan: fix a use-after-free in vlan_device_event()
From: Cong Wang <xiyou.wangcong(a)gmail.com>
[ Upstream commit 052d41c01b3a2e3371d66de569717353af489d63 ]
After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:
RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);
However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
goto out;
grp = &vlan_info->grp;
Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().
Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Tested-by: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/8021q/vlan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -376,6 +376,9 @@ static int vlan_device_event(struct noti
dev->name);
vlan_vid_add(dev, htons(ETH_P_8021Q), 0);
}
+ if (event == NETDEV_DOWN &&
+ (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER))
+ vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
@@ -420,9 +423,6 @@ static int vlan_device_event(struct noti
break;
case NETDEV_DOWN:
- if (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
- vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
-
/* Put all VLANs for this dev in the down state too. */
vlan_group_for_each_dev(grp, i, vlandev) {
flgs = vlandev->flags;
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-3.18/ipv6-dccp-do-not-inherit-ipv6_mc_list-from-parent.patch
queue-3.18/vlan-fix-a-use-after-free-in-vlan_device_event.patch
This is a note to let you know that I've just added the patch titled
tcp: do not mangle skb->cb[] in tcp_make_synack()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Eric Dumazet <edumazet(a)google.com>
Date: Thu, 2 Nov 2017 12:30:25 -0700
Subject: tcp: do not mangle skb->cb[] in tcp_make_synack()
From: Eric Dumazet <edumazet(a)google.com>
[ Upstream commit 3b11775033dc87c3d161996c54507b15ba26414a ]
Christoph Paasch sent a patch to address the following issue :
tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()
tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
tcp_make_synack() should not use tcp_init_nondata_skb() :
tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())
This patch fixes the issue and should even save few cpu cycles ;)
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Reviewed-by: Christoph Paasch <cpaasch(a)apple.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp_output.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2911,13 +2911,8 @@ struct sk_buff *tcp_make_synack(struct s
tcp_ecn_make_synack(req, th, sk);
th->source = htons(ireq->ir_num);
th->dest = ireq->ir_rmt_port;
- /* Setting of flags are superfluous here for callers (and ECE is
- * not even correctly set)
- */
- tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
- TCPHDR_SYN | TCPHDR_ACK);
-
- th->seq = htonl(TCP_SKB_CB(skb)->seq);
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ th->seq = htonl(tcp_rsk(req)->snt_isn);
/* XXX data is queued and acked as is. No buffer/window check */
th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
Patches currently in stable-queue which might be from edumazet(a)google.com are
queue-3.18/tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch
queue-3.18/ipv6-dccp-do-not-inherit-ipv6_mc_list-from-parent.patch
This is a note to let you know that I've just added the patch titled
sctp: do not peel off an assoc from one netns to another one
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sctp-do-not-peel-off-an-assoc-from-one-netns-to-another-one.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Xin Long <lucien.xin(a)gmail.com>
Date: Tue, 17 Oct 2017 23:26:10 +0800
Subject: sctp: do not peel off an assoc from one netns to another one
From: Xin Long <lucien.xin(a)gmail.com>
[ Upstream commit df80cd9b28b9ebaa284a41df611dbf3a2d05ca74 ]
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.
As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.
This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:
socket$inet6_sctp()
bind$inet6()
sendto$inet6()
unshare(0x40000000)
getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()
This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.
Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.
Reported-by: ChunYu Wang <chunwang(a)redhat.com>
Signed-off-by: Xin Long <lucien.xin(a)gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner(a)gmail.com>
Acked-by: Neil Horman <nhorman(a)tuxdriver.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/sctp/socket.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4466,6 +4466,10 @@ int sctp_do_peeloff(struct sock *sk, sct
struct socket *sock;
int err = 0;
+ /* Do not peel off from one netns to another one. */
+ if (!net_eq(current->nsproxy->net_ns, sock_net(sk)))
+ return -EINVAL;
+
if (!asoc)
return -EINVAL;
Patches currently in stable-queue which might be from lucien.xin(a)gmail.com are
queue-3.18/sctp-do-not-peel-off-an-assoc-from-one-netns-to-another-one.patch
This is a note to let you know that I've just added the patch titled
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
netfilter-ipvs-clear-ipvs_property-flag-when-skb-net-namespace-changed.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Ye Yin <hustcat(a)gmail.com>
Date: Thu, 26 Oct 2017 16:57:05 +0800
Subject: netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
From: Ye Yin <hustcat(a)gmail.com>
[ Upstream commit 2b5ec1a5f9738ee7bf8f5ec0526e75e00362c48f ]
When run ipvs in two different network namespace at the same host, and one
ipvs transport network traffic to the other network namespace ipvs.
'ipvs_property' flag will make the second ipvs take no effect. So we should
clear 'ipvs_property' when SKB network namespace changed.
Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Ye Yin <hustcat(a)gmail.com>
Signed-off-by: Wei Zhou <chouryzhou(a)gmail.com>
Signed-off-by: Julian Anastasov <ja(a)ssi.bg>
Signed-off-by: Simon Horman <horms(a)verge.net.au>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/skbuff.h | 7 +++++++
net/core/skbuff.c | 1 +
2 files changed, 8 insertions(+)
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3117,6 +3117,13 @@ static inline void nf_reset_trace(struct
#endif
}
+static inline void ipvs_reset(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_IP_VS)
+ skb->ipvs_property = 0;
+#endif
+}
+
/* Note: This doesn't put any conntrack and bridge info in dst. */
static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
bool copy)
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4069,6 +4069,7 @@ void skb_scrub_packet(struct sk_buff *sk
if (!xnet)
return;
+ ipvs_reset(skb);
skb_orphan(skb);
skb->mark = 0;
}
Patches currently in stable-queue which might be from hustcat(a)gmail.com are
queue-3.18/netfilter-ipvs-clear-ipvs_property-flag-when-skb-net-namespace-changed.patch
This is a note to let you know that I've just added the patch titled
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
net-sctp-always-set-scope_id-in-sctp_inet6_skb_msgname.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:28:16 CET 2017
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
Date: Wed, 15 Nov 2017 22:17:48 -0600
Subject: net/sctp: Always set scope_id in sctp_inet6_skb_msgname
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 7c8a61d9ee1df0fb4747879fa67a99614eb62fec ]
Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.
Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.
With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.
That is almost reasonable as scope_id's are meaniningful only for link
local addresses. Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.
There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.
Fixes: 372f525b495c ("SCTP: Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reported-by: Alexander Potapenko <glider(a)google.com>
Tested-by: Alexander Potapenko <glider(a)google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/sctp/ipv6.c | 2 ++
1 file changed, 2 insertions(+)
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -805,6 +805,8 @@ static void sctp_inet6_skb_msgname(struc
if (ipv6_addr_type(&addr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
struct sctp_ulpevent *ev = sctp_skb2event(skb);
addr->v6.sin6_scope_id = ev->iif;
+ } else {
+ addr->v6.sin6_scope_id = 0;
}
}
Patches currently in stable-queue which might be from ebiederm(a)xmission.com are
queue-3.18/net-sctp-always-set-scope_id-in-sctp_inet6_skb_msgname.patch
This is a note to let you know that I've just added the patch titled
fealnx: Fix building error on MIPS
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
fealnx-fix-building-error-on-mips.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Huacai Chen <chenhc(a)lemote.com>
Date: Thu, 16 Nov 2017 11:07:15 +0800
Subject: fealnx: Fix building error on MIPS
From: Huacai Chen <chenhc(a)lemote.com>
[ Upstream commit cc54c1d32e6a4bb3f116721abf900513173e4d02 ]
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the LONG macro, which conflicts with the LONG enum
in drivers/net/ethernet/fealnx.c.
Signed-off-by: Huacai Chen <chenhc(a)lemote.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/net/ethernet/fealnx.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x00000800, /* first descriptor */
RXLSD = 0x00000400, /* last descriptor */
ErrorSummary = 0x80, /* error summary */
- RUNT = 0x40, /* runt packet received */
- LONG = 0x20, /* long packet received */
+ RUNTPKT = 0x40, /* runt packet received */
+ LONGPKT = 0x20, /* long packet received */
FAE = 0x10, /* frame align error */
CRC = 0x08, /* crc error */
RXER = 0x04, /* receive error */
@@ -1633,7 +1633,7 @@ static int netdev_rx(struct net_device *
dev->name, rx_status);
dev->stats.rx_errors++; /* end of a packet. */
- if (rx_status & (LONG | RUNT))
+ if (rx_status & (LONGPKT | RUNTPKT))
dev->stats.rx_length_errors++;
if (rx_status & RXER)
dev->stats.rx_frame_errors++;
Patches currently in stable-queue which might be from chenhc(a)lemote.com are
queue-3.18/fealnx-fix-building-error-on-mips.patch
This is a note to let you know that I've just added the patch titled
af_netlink: ensure that NLMSG_DONE never fails in dumps
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
af_netlink-ensure-that-nlmsg_done-never-fails-in-dumps.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 9 Nov 2017 13:04:44 +0900
Subject: af_netlink: ensure that NLMSG_DONE never fails in dumps
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
[ Upstream commit 0642840b8bb008528dbdf929cec9f65ac4231ad0 ]
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.
However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:
nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.
In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.
This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/netlink/af_netlink.c | 17 +++++++++++------
net/netlink/af_netlink.h | 1 +
2 files changed, 12 insertions(+), 6 deletions(-)
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1927,7 +1927,7 @@ static int netlink_dump(struct sock *sk)
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
struct module *module;
- int len, err = -ENOBUFS;
+ int err = -ENOBUFS;
int alloc_size;
mutex_lock(nlk->cb_mutex);
@@ -1965,9 +1965,11 @@ static int netlink_dump(struct sock *sk)
goto errout_skb;
netlink_skb_set_owner_r(skb, sk);
- len = cb->dump(skb, cb);
+ if (nlk->dump_done_errno > 0)
+ nlk->dump_done_errno = cb->dump(skb, cb);
- if (len > 0) {
+ if (nlk->dump_done_errno > 0 ||
+ skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
mutex_unlock(nlk->cb_mutex);
if (sk_filter(sk, skb))
@@ -1977,13 +1979,15 @@ static int netlink_dump(struct sock *sk)
return 0;
}
- nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
- if (!nlh)
+ nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
+ sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+ if (WARN_ON(!nlh))
goto errout_skb;
nl_dump_check_consistent(cb, nlh);
- memcpy(nlmsg_data(nlh), &len, sizeof(len));
+ memcpy(nlmsg_data(nlh), &nlk->dump_done_errno,
+ sizeof(nlk->dump_done_errno));
if (sk_filter(sk, skb))
kfree_skb(skb);
@@ -2048,6 +2052,7 @@ int __netlink_dump_start(struct sock *ss
cb->skb = skb;
nlk->cb_running = true;
+ nlk->dump_done_errno = INT_MAX;
mutex_unlock(nlk->cb_mutex);
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -35,6 +35,7 @@ struct netlink_sock {
size_t max_recvmsg_len;
wait_queue_head_t wait;
bool cb_running;
+ int dump_done_errno;
struct netlink_callback cb;
struct mutex *cb_mutex;
struct mutex cb_def_mutex;
Patches currently in stable-queue which might be from Jason(a)zx2c4.com are
queue-3.18/af_netlink-ensure-that-nlmsg_done-never-fails-in-dumps.patch