The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d9032dba5a2b2bbf0fdce67c8795300ec9923b43 Mon Sep 17 00:00:00 2001
From: Guangbin Huang <huangguangbin2(a)huawei.com>
Date: Sat, 27 Feb 2021 11:05:58 +0800
Subject: [PATCH] net: phy: fix save wrong speed and duplex problem if autoneg
is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.
And duplex setting has same problem too.
However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.
Fixes: 51e2a3846eab ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2(a)huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong(a)huawei.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d314..fc2e7cb5b2e5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -276,14 +276,16 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev,
phydev->autoneg = autoneg;
- phydev->speed = speed;
+ if (autoneg == AUTONEG_DISABLE) {
+ phydev->speed = speed;
+ phydev->duplex = duplex;
+ }
linkmode_copy(phydev->advertising, advertising);
linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
phydev->advertising, autoneg == AUTONEG_ENABLE);
- phydev->duplex = duplex;
phydev->master_slave_set = cmd->base.master_slave_cfg;
phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d9032dba5a2b2bbf0fdce67c8795300ec9923b43 Mon Sep 17 00:00:00 2001
From: Guangbin Huang <huangguangbin2(a)huawei.com>
Date: Sat, 27 Feb 2021 11:05:58 +0800
Subject: [PATCH] net: phy: fix save wrong speed and duplex problem if autoneg
is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.
And duplex setting has same problem too.
However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.
Fixes: 51e2a3846eab ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2(a)huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong(a)huawei.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d314..fc2e7cb5b2e5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -276,14 +276,16 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev,
phydev->autoneg = autoneg;
- phydev->speed = speed;
+ if (autoneg == AUTONEG_DISABLE) {
+ phydev->speed = speed;
+ phydev->duplex = duplex;
+ }
linkmode_copy(phydev->advertising, advertising);
linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
phydev->advertising, autoneg == AUTONEG_ENABLE);
- phydev->duplex = duplex;
phydev->master_slave_set = cmd->base.master_slave_cfg;
phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d9032dba5a2b2bbf0fdce67c8795300ec9923b43 Mon Sep 17 00:00:00 2001
From: Guangbin Huang <huangguangbin2(a)huawei.com>
Date: Sat, 27 Feb 2021 11:05:58 +0800
Subject: [PATCH] net: phy: fix save wrong speed and duplex problem if autoneg
is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.
And duplex setting has same problem too.
However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.
Fixes: 51e2a3846eab ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2(a)huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong(a)huawei.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d314..fc2e7cb5b2e5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -276,14 +276,16 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev,
phydev->autoneg = autoneg;
- phydev->speed = speed;
+ if (autoneg == AUTONEG_DISABLE) {
+ phydev->speed = speed;
+ phydev->duplex = duplex;
+ }
linkmode_copy(phydev->advertising, advertising);
linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
phydev->advertising, autoneg == AUTONEG_ENABLE);
- phydev->duplex = duplex;
phydev->master_slave_set = cmd->base.master_slave_cfg;
phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d9032dba5a2b2bbf0fdce67c8795300ec9923b43 Mon Sep 17 00:00:00 2001
From: Guangbin Huang <huangguangbin2(a)huawei.com>
Date: Sat, 27 Feb 2021 11:05:58 +0800
Subject: [PATCH] net: phy: fix save wrong speed and duplex problem if autoneg
is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.
And duplex setting has same problem too.
However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.
Fixes: 51e2a3846eab ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2(a)huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong(a)huawei.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d314..fc2e7cb5b2e5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -276,14 +276,16 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev,
phydev->autoneg = autoneg;
- phydev->speed = speed;
+ if (autoneg == AUTONEG_DISABLE) {
+ phydev->speed = speed;
+ phydev->duplex = duplex;
+ }
linkmode_copy(phydev->advertising, advertising);
linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
phydev->advertising, autoneg == AUTONEG_ENABLE);
- phydev->duplex = duplex;
phydev->master_slave_set = cmd->base.master_slave_cfg;
phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d9032dba5a2b2bbf0fdce67c8795300ec9923b43 Mon Sep 17 00:00:00 2001
From: Guangbin Huang <huangguangbin2(a)huawei.com>
Date: Sat, 27 Feb 2021 11:05:58 +0800
Subject: [PATCH] net: phy: fix save wrong speed and duplex problem if autoneg
is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.
And duplex setting has same problem too.
However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.
Fixes: 51e2a3846eab ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2(a)huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong(a)huawei.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d314..fc2e7cb5b2e5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -276,14 +276,16 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev,
phydev->autoneg = autoneg;
- phydev->speed = speed;
+ if (autoneg == AUTONEG_DISABLE) {
+ phydev->speed = speed;
+ phydev->duplex = duplex;
+ }
linkmode_copy(phydev->advertising, advertising);
linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
phydev->advertising, autoneg == AUTONEG_ENABLE);
- phydev->duplex = duplex;
phydev->master_slave_set = cmd->base.master_slave_cfg;
phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
(cherry picked from commit c119565a15a628efdfa51352f9f6c5186e506a1c)
On book3s/32, page protection is defined by the PP bits in the PTE
which provide the following protection depending on the access
keys defined in the matching segment register:
- PP 00 means RW with key 0 and N/A with key 1.
- PP 01 means RW with key 0 and RO with key 1.
- PP 10 means RW with both key 0 and key 1.
- PP 11 means RO with both key 0 and key 1.
Since the implementation of kernel userspace access protection,
PP bits have been set as follows:
- PP00 for pages without _PAGE_USER
- PP01 for pages with _PAGE_USER and _PAGE_RW
- PP11 for pages with _PAGE_USER and without _PAGE_RW
For kernelspace segments, kernel accesses are performed with key 0
and user accesses are performed with key 1. As PP00 is used for
non _PAGE_USER pages, user can't access kernel pages not flagged
_PAGE_USER while kernel can.
For userspace segments, both kernel and user accesses are performed
with key 0, therefore pages not flagged _PAGE_USER are still
accessible to the user.
This shouldn't be an issue, because userspace is expected to be
accessible to the user. But unlike most other architectures, powerpc
implements PROT_NONE protection by removing _PAGE_USER flag instead of
flagging the page as not valid. This means that pages in userspace
that are not flagged _PAGE_USER shall remain inaccessible.
To get the expected behaviour, just mimic other architectures in the
TLB miss handler by checking _PAGE_USER permission on userspace
accesses as if it was the _PAGE_PRESENT bit.
Note that this problem only is only for 603 cores. The 604+ have
an hash table, and hash_page() function already implement the
verification of _PAGE_USER permission on userspace pages.
Fixes: f342adca3afc ("powerpc/32s: Prepare Kernel Userspace Access Protection")
Change-Id: I68bc5e5ff4542bdfcdcd12923fa96a5811707475
Cc: stable(a)vger.kernel.org # v5.2+
Reported-by: Christoph Plattner <christoph.plattner(a)thalesgroup.com>
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/4a0c6e3bb8f0c162457bf54d9bc6fd8d7b55129f.16121609…
---
arch/powerpc/kernel/head_32.S | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 126ba5438430..edaab1142498 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -418,10 +418,11 @@ InstructionTLBMiss:
cmplw 0,r1,r3
#endif
mfspr r2, SPRN_SPRG_PGDIR
- li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
+ li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC | _PAGE_USER
#if defined(CONFIG_MODULES) || defined(CONFIG_DEBUG_PAGEALLOC)
bge- 112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */
+ li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
addi r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l /* kernel page table */
#endif
112: rlwimi r2,r3,12,20,29 /* insert top 10 bits of address */
@@ -480,9 +481,10 @@ DataLoadTLBMiss:
lis r1,PAGE_OFFSET@h /* check if kernel address */
cmplw 0,r1,r3
mfspr r2, SPRN_SPRG_PGDIR
- li r1, _PAGE_PRESENT | _PAGE_ACCESSED
+ li r1, _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
bge- 112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */
+ li r1, _PAGE_PRESENT | _PAGE_ACCESSED
addi r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l /* kernel page table */
112: rlwimi r2,r3,12,20,29 /* insert top 10 bits of address */
lwz r2,0(r2) /* get pmd entry */
@@ -556,9 +558,10 @@ DataStoreTLBMiss:
lis r1,PAGE_OFFSET@h /* check if kernel address */
cmplw 0,r1,r3
mfspr r2, SPRN_SPRG_PGDIR
- li r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
+ li r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
bge- 112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */
+ li r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
addi r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l /* kernel page table */
112: rlwimi r2,r3,12,20,29 /* insert top 10 bits of address */
lwz r2,0(r2) /* get pmd entry */
--
2.25.0
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 03a3ca37e4c6478e3a84f04c8429dd5889e107fd Mon Sep 17 00:00:00 2001
From: Florian Westphal <fw(a)strlen.de>
Date: Wed, 24 Feb 2021 17:23:19 +0100
Subject: [PATCH] netfilter: nf_nat: undo erroneous tcp edemux lookup
Under extremely rare conditions TCP early demux will retrieve the wrong
socket.
1. local machine establishes a connection to a remote server, S, on port
p.
This gives:
laddr:lport -> S:p
... both in tcp and conntrack.
2. local machine establishes a connection to host H, on port p2.
2a. TCP stack choses same laddr:lport, so we have
laddr:lport -> H:p2 from TCP point of view.
2b). There is a destination NAT rewrite in place, translating
H:p2 to S:p. This results in following conntrack entries:
I) laddr:lport -> S:p (origin) S:p -> laddr:lport (reply)
II) laddr:lport -> H:p2 (origin) S:p -> laddr:lport2 (reply)
NAT engine has rewritten laddr:lport to laddr:lport2 to map
the reply packet to the correct origin.
When server sends SYN/ACK to laddr:lport2, the PREROUTING hook
will undo-the SNAT transformation, rewriting IP header to
S:p -> laddr:lport
This causes TCP early demux to associate the skb with the TCP socket
of the first connection.
The INPUT hook will then reverse the DNAT transformation, rewriting
the IP header to H:p2 -> laddr:lport.
Because packet ends up with the wrong socket, the new connection
never completes: originator stays in SYN_SENT and conntrack entry
remains in SYN_RECV until timeout, and responder retransmits SYN/ACK
until it gives up.
To resolve this, orphan the skb after the input rewrite:
Because the source IP address changed, the socket must be incorrect.
We can't move the DNAT undo to prerouting due to backwards
compatibility, doing so will make iptables/nftables rules to no longer
match the way they did.
After orphan, the packet will be handed to the next protocol layer
(tcp, udp, ...) and that will repeat the socket lookup just like as if
early demux was disabled.
Fixes: 41063e9dd1195 ("ipv4: Early TCP socket demux.")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427
Signed-off-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index e87b6bd6b3cd..4731d21fc3ad 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -646,8 +646,8 @@ nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
}
static unsigned int
-nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
- const struct nf_hook_state *state)
+nf_nat_ipv4_pre_routing(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
{
unsigned int ret;
__be32 daddr = ip_hdr(skb)->daddr;
@@ -659,6 +659,23 @@ nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
return ret;
}
+static unsigned int
+nf_nat_ipv4_local_in(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
+{
+ __be32 saddr = ip_hdr(skb)->saddr;
+ struct sock *sk = skb->sk;
+ unsigned int ret;
+
+ ret = nf_nat_ipv4_fn(priv, skb, state);
+
+ if (ret == NF_ACCEPT && sk && saddr != ip_hdr(skb)->saddr &&
+ !inet_sk_transparent(sk))
+ skb_orphan(skb); /* TCP edemux obtained wrong socket */
+
+ return ret;
+}
+
static unsigned int
nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
@@ -736,7 +753,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
/* Before packet filtering, change destination */
{
- .hook = nf_nat_ipv4_in,
+ .hook = nf_nat_ipv4_pre_routing,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_NAT_DST,
@@ -757,7 +774,7 @@ static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
},
/* After packet filtering, change source */
{
- .hook = nf_nat_ipv4_fn,
+ .hook = nf_nat_ipv4_local_in,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 03a3ca37e4c6478e3a84f04c8429dd5889e107fd Mon Sep 17 00:00:00 2001
From: Florian Westphal <fw(a)strlen.de>
Date: Wed, 24 Feb 2021 17:23:19 +0100
Subject: [PATCH] netfilter: nf_nat: undo erroneous tcp edemux lookup
Under extremely rare conditions TCP early demux will retrieve the wrong
socket.
1. local machine establishes a connection to a remote server, S, on port
p.
This gives:
laddr:lport -> S:p
... both in tcp and conntrack.
2. local machine establishes a connection to host H, on port p2.
2a. TCP stack choses same laddr:lport, so we have
laddr:lport -> H:p2 from TCP point of view.
2b). There is a destination NAT rewrite in place, translating
H:p2 to S:p. This results in following conntrack entries:
I) laddr:lport -> S:p (origin) S:p -> laddr:lport (reply)
II) laddr:lport -> H:p2 (origin) S:p -> laddr:lport2 (reply)
NAT engine has rewritten laddr:lport to laddr:lport2 to map
the reply packet to the correct origin.
When server sends SYN/ACK to laddr:lport2, the PREROUTING hook
will undo-the SNAT transformation, rewriting IP header to
S:p -> laddr:lport
This causes TCP early demux to associate the skb with the TCP socket
of the first connection.
The INPUT hook will then reverse the DNAT transformation, rewriting
the IP header to H:p2 -> laddr:lport.
Because packet ends up with the wrong socket, the new connection
never completes: originator stays in SYN_SENT and conntrack entry
remains in SYN_RECV until timeout, and responder retransmits SYN/ACK
until it gives up.
To resolve this, orphan the skb after the input rewrite:
Because the source IP address changed, the socket must be incorrect.
We can't move the DNAT undo to prerouting due to backwards
compatibility, doing so will make iptables/nftables rules to no longer
match the way they did.
After orphan, the packet will be handed to the next protocol layer
(tcp, udp, ...) and that will repeat the socket lookup just like as if
early demux was disabled.
Fixes: 41063e9dd1195 ("ipv4: Early TCP socket demux.")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427
Signed-off-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index e87b6bd6b3cd..4731d21fc3ad 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -646,8 +646,8 @@ nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
}
static unsigned int
-nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
- const struct nf_hook_state *state)
+nf_nat_ipv4_pre_routing(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
{
unsigned int ret;
__be32 daddr = ip_hdr(skb)->daddr;
@@ -659,6 +659,23 @@ nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
return ret;
}
+static unsigned int
+nf_nat_ipv4_local_in(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
+{
+ __be32 saddr = ip_hdr(skb)->saddr;
+ struct sock *sk = skb->sk;
+ unsigned int ret;
+
+ ret = nf_nat_ipv4_fn(priv, skb, state);
+
+ if (ret == NF_ACCEPT && sk && saddr != ip_hdr(skb)->saddr &&
+ !inet_sk_transparent(sk))
+ skb_orphan(skb); /* TCP edemux obtained wrong socket */
+
+ return ret;
+}
+
static unsigned int
nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
@@ -736,7 +753,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
/* Before packet filtering, change destination */
{
- .hook = nf_nat_ipv4_in,
+ .hook = nf_nat_ipv4_pre_routing,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_NAT_DST,
@@ -757,7 +774,7 @@ static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
},
/* After packet filtering, change source */
{
- .hook = nf_nat_ipv4_fn,
+ .hook = nf_nat_ipv4_local_in,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC,
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 03a3ca37e4c6478e3a84f04c8429dd5889e107fd Mon Sep 17 00:00:00 2001
From: Florian Westphal <fw(a)strlen.de>
Date: Wed, 24 Feb 2021 17:23:19 +0100
Subject: [PATCH] netfilter: nf_nat: undo erroneous tcp edemux lookup
Under extremely rare conditions TCP early demux will retrieve the wrong
socket.
1. local machine establishes a connection to a remote server, S, on port
p.
This gives:
laddr:lport -> S:p
... both in tcp and conntrack.
2. local machine establishes a connection to host H, on port p2.
2a. TCP stack choses same laddr:lport, so we have
laddr:lport -> H:p2 from TCP point of view.
2b). There is a destination NAT rewrite in place, translating
H:p2 to S:p. This results in following conntrack entries:
I) laddr:lport -> S:p (origin) S:p -> laddr:lport (reply)
II) laddr:lport -> H:p2 (origin) S:p -> laddr:lport2 (reply)
NAT engine has rewritten laddr:lport to laddr:lport2 to map
the reply packet to the correct origin.
When server sends SYN/ACK to laddr:lport2, the PREROUTING hook
will undo-the SNAT transformation, rewriting IP header to
S:p -> laddr:lport
This causes TCP early demux to associate the skb with the TCP socket
of the first connection.
The INPUT hook will then reverse the DNAT transformation, rewriting
the IP header to H:p2 -> laddr:lport.
Because packet ends up with the wrong socket, the new connection
never completes: originator stays in SYN_SENT and conntrack entry
remains in SYN_RECV until timeout, and responder retransmits SYN/ACK
until it gives up.
To resolve this, orphan the skb after the input rewrite:
Because the source IP address changed, the socket must be incorrect.
We can't move the DNAT undo to prerouting due to backwards
compatibility, doing so will make iptables/nftables rules to no longer
match the way they did.
After orphan, the packet will be handed to the next protocol layer
(tcp, udp, ...) and that will repeat the socket lookup just like as if
early demux was disabled.
Fixes: 41063e9dd1195 ("ipv4: Early TCP socket demux.")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427
Signed-off-by: Florian Westphal <fw(a)strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index e87b6bd6b3cd..4731d21fc3ad 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -646,8 +646,8 @@ nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
}
static unsigned int
-nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
- const struct nf_hook_state *state)
+nf_nat_ipv4_pre_routing(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
{
unsigned int ret;
__be32 daddr = ip_hdr(skb)->daddr;
@@ -659,6 +659,23 @@ nf_nat_ipv4_in(void *priv, struct sk_buff *skb,
return ret;
}
+static unsigned int
+nf_nat_ipv4_local_in(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
+{
+ __be32 saddr = ip_hdr(skb)->saddr;
+ struct sock *sk = skb->sk;
+ unsigned int ret;
+
+ ret = nf_nat_ipv4_fn(priv, skb, state);
+
+ if (ret == NF_ACCEPT && sk && saddr != ip_hdr(skb)->saddr &&
+ !inet_sk_transparent(sk))
+ skb_orphan(skb); /* TCP edemux obtained wrong socket */
+
+ return ret;
+}
+
static unsigned int
nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
@@ -736,7 +753,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
/* Before packet filtering, change destination */
{
- .hook = nf_nat_ipv4_in,
+ .hook = nf_nat_ipv4_pre_routing,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_NAT_DST,
@@ -757,7 +774,7 @@ static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
},
/* After packet filtering, change source */
{
- .hook = nf_nat_ipv4_fn,
+ .hook = nf_nat_ipv4_local_in,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC,