* [net] ipv4: Fix broken PMTUD when using L4 multipath hash
[not found] <[email protected]>
@ 2023-10-12 23:40 ` Nabil S. Alramli
2023-10-13 16:19 ` David Ahern
0 siblings, 1 reply; 5+ messages in thread
From: Nabil S. Alramli @ 2023-10-12 23:40 UTC (permalink / raw)
To: sbhogavilli, davem, dsahern, edumazet, kuba, pabeni, netdev,
linux-kernel
Cc: srao, dev
From: Suresh Bhogavilli <[email protected]>
On a node with multiple network interfaces, if we enable layer 4 hash
policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
broken and TCP connection does not make progress unless the incoming
ICMP Fragmentation Needed (type 3, code 4) message is received on the
egress interface of selected nexthop of the socket.
This is because build_sk_flow_key() does not provide the sport and dport
from the socket when calling flowi4_init_output(). This appears to be a
copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
flowi4_init_output() call used for packet forwarding where an skb is
present, is passed later to fib_multipath_hash() call, and can scrape
out both sport and dport from the skb if L4 hash policy is in use.
In the socket write case, fib_multipath_hash() does not get an skb so
it expects the fl4 to have sport and dport populated when L4 hashing is
in use. Not populating them results in creating a nexthop exception
entry against a nexthop that may not be the one used by the socket.
Hence it is not later matched when inet_csk_rebuild_route is called to
update the cached dst entry in the socket, so TCP does not lower its MSS
and the connection does not make progress.
Fix this by providing the source port and destination ports to
flowi4_init_output() call in build_sk_flow_key().
Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Suresh Bhogavilli <[email protected]>
---
net/ipv4/route.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e2bf4602b559..2517eb12b7ef 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -557,7 +557,8 @@ static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
inet_test_bit(HDRINCL, sk) ?
IPPROTO_RAW : sk->sk_protocol,
inet_sk_flowi_flags(sk),
- daddr, inet->inet_saddr, 0, 0, sk->sk_uid);
+ daddr, inet->inet_saddr, inet->inet_dport, inet->inet_sport,
+ sk->sk_uid);
rcu_read_unlock();
}
--
2.31.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
2023-10-12 23:40 ` [net] ipv4: Fix broken PMTUD when using L4 multipath hash Nabil S. Alramli
@ 2023-10-13 16:19 ` David Ahern
2023-10-16 18:51 ` Nabil S. Alramli
0 siblings, 1 reply; 5+ messages in thread
From: David Ahern @ 2023-10-13 16:19 UTC (permalink / raw)
To: Nabil S. Alramli, sbhogavilli, davem, edumazet, kuba, pabeni,
netdev, linux-kernel
Cc: srao, dev
On 10/12/23 5:40 PM, Nabil S. Alramli wrote:
> From: Suresh Bhogavilli <[email protected]>
>
> On a node with multiple network interfaces, if we enable layer 4 hash
> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
> broken and TCP connection does not make progress unless the incoming
> ICMP Fragmentation Needed (type 3, code 4) message is received on the
> egress interface of selected nexthop of the socket.
known problem.
>
> This is because build_sk_flow_key() does not provide the sport and dport
> from the socket when calling flowi4_init_output(). This appears to be a
> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
> flowi4_init_output() call used for packet forwarding where an skb is
> present, is passed later to fib_multipath_hash() call, and can scrape
> out both sport and dport from the skb if L4 hash policy is in use.
are you sure?
As I recall the problem is that the ICMP can be received on a different
path. When it is processed, the exception is added to the ingress device
of the ICMP and not the device the original packet egressed. I have
scripts that somewhat reliably reproduced the problem; I started working
on a fix and got distracted.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
2023-10-13 16:19 ` David Ahern
@ 2023-10-16 18:51 ` Nabil S. Alramli
2024-02-09 17:11 ` Suresh Bhogavilli
0 siblings, 1 reply; 5+ messages in thread
From: Nabil S. Alramli @ 2023-10-16 18:51 UTC (permalink / raw)
To: David Ahern, sbhogavilli, davem, edumazet, kuba, pabeni, netdev,
linux-kernel
Cc: jdamato, srao, dev
Hi David,
Thank you for your quick response.
On 10/13/2023 12:19 PM, David Ahern wrote:
> On 10/12/23 5:40 PM, Nabil S. Alramli wrote:
>> From: Suresh Bhogavilli <[email protected]>
>>
>> On a node with multiple network interfaces, if we enable layer 4 hash
>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
>> broken and TCP connection does not make progress unless the incoming
>> ICMP Fragmentation Needed (type 3, code 4) message is received on the
>> egress interface of selected nexthop of the socket.
>
> known problem.
>
>>
>> This is because build_sk_flow_key() does not provide the sport and dport
>> from the socket when calling flowi4_init_output(). This appears to be a
>> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
>> flowi4_init_output() call used for packet forwarding where an skb is
>> present, is passed later to fib_multipath_hash() call, and can scrape
>> out both sport and dport from the skb if L4 hash policy is in use.
>
> are you sure?
>
> As I recall the problem is that the ICMP can be received on a different
> path. When it is processed, the exception is added to the ingress device
> of the ICMP and not the device the original packet egressed. I have
> scripts that somewhat reliably reproduced the problem; I started working
> on a fix and got distracted.
With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an
ICMP packet too big (PTB) message is received on an interface different
from the socket egress interface, we see a cache entry added to the
ICMP ingress interface but with parameters matching the route entry
rather than the MTU reported in the ICMP message.
On the below node, ICMP PTB messages arrive on an interface named
vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing -
the path from this cache to 139.162.188.91 is via another interface
named vlan200.
When the ICMP PTB message arrives on vlan100, an exception entry does
get added to vlan200 and the socket's cached mtu gets updated too. TCP
connection makes progress (not shown).
sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 363sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 363sec mtu 905 advmss 1460
With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP
traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100
still), the cache entry still shows mtu of 1500 on the TCP egress
interface of vlan200. No exception entry gets added to vlan100 as you noted:
sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
In this case, the TCP connection does not make progress, ultimately
timing out.
If we retry TCP connections until one uses vlan100 to egress, then the
exception entry does get added with an MTU matching those reported in
the ICMP PTB message:
sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
cache expires 153sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
In this case the TCP connection over vlan100 does make progress.
With the proposed patch applied, an exception entry does get created on
the socket egress interface even when that is different from the ICMP
PTB ingress interface. Below is the output after different TCP
connections have used the two interfaces this node has:
sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
cache expires 565sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 562sec mtu 905 advmss 1460
Thank you.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
2023-10-16 18:51 ` Nabil S. Alramli
@ 2024-02-09 17:11 ` Suresh Bhogavilli
2024-02-09 22:27 ` David Ahern
0 siblings, 1 reply; 5+ messages in thread
From: Suresh Bhogavilli @ 2024-02-09 17:11 UTC (permalink / raw)
To: Nabil S. Alramli, David Ahern, davem, edumazet, kuba, pabeni,
netdev, linux-kernel
Cc: jdamato, srao, dev
Hi David,
On 10/16/23 2:51 PM, Nabil S. Alramli wrote:
> On 10/13/2023 12:19 PM, David Ahern wrote:
>>> On a node with multiple network interfaces, if we enable layer 4 hash
>>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
>>> broken and TCP connection does not make progress unless the incoming
>>> ICMP Fragmentation Needed (type 3, code 4) message is received on the
>>> egress interface of selected nexthop of the socket.
>> known problem.
>>
>>> This is because build_sk_flow_key() does not provide the sport and dport
>>> from the socket when calling flowi4_init_output(). This appears to be a
>>> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
>>> flowi4_init_output() call used for packet forwarding where an skb is
>>> present, is passed later to fib_multipath_hash() call, and can scrape
>>> out both sport and dport from the skb if L4 hash policy is in use.
>> are you sure?
>>
>> As I recall the problem is that the ICMP can be received on a different
>> path. When it is processed, the exception is added to the ingress device
>> of the ICMP and not the device the original packet egressed. I have
>> scripts that somewhat reliably reproduced the problem; I started working
>> on a fix and got distracted.
> With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an
> ICMP packet too big (PTB) message is received on an interface different
> from the socket egress interface, we see a cache entry added to the
> ICMP ingress interface but with parameters matching the route entry
> rather than the MTU reported in the ICMP message.
>
> On the below node, ICMP PTB messages arrive on an interface named
> vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing -
> the path from this cache to 139.162.188.91 is via another interface
> named vlan200.
>
> When the ICMP PTB message arrives on vlan100, an exception entry does
> get added to vlan200 and the socket's cached mtu gets updated too. TCP
> connection makes progress (not shown).
>
> sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 363sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 363sec mtu 905 advmss 1460
>
> With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP
> traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100
> still), the cache entry still shows mtu of 1500 on the TCP egress
> interface of vlan200. No exception entry gets added to vlan100 as you noted:
>
> sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
>
> In this case, the TCP connection does not make progress, ultimately
> timing out.
>
> If we retry TCP connections until one uses vlan100 to egress, then the
> exception entry does get added with an MTU matching those reported in
> the ICMP PTB message:
>
> sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
> cache expires 153sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
>
> In this case the TCP connection over vlan100 does make progress.
>
> With the proposed patch applied, an exception entry does get created on
> the socket egress interface even when that is different from the ICMP
> PTB ingress interface. Below is the output after different TCP
> connections have used the two interfaces this node has:
>
> sbhogavilli@node20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
> cache expires 565sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 562sec mtu 905 advmss 1460
>
> Thank you.
Does that answer your question? Do you need me to make any changes to
get your Reviewed-by?
Best regards,
Suresh
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
2024-02-09 17:11 ` Suresh Bhogavilli
@ 2024-02-09 22:27 ` David Ahern
0 siblings, 0 replies; 5+ messages in thread
From: David Ahern @ 2024-02-09 22:27 UTC (permalink / raw)
To: Suresh Bhogavilli, Nabil S. Alramli, davem, edumazet, kuba,
pabeni, netdev, linux-kernel
Cc: jdamato, srao, dev
On 2/9/24 10:11 AM, Suresh Bhogavilli wrote:
>
> Does that answer your question? Do you need me to make any changes to
> get your Reviewed-by?
re-send the patch enhancing the commit message with what happens for
different policies.
Would be good to get a unit test added to
tools/testing/selftests/net/pmtu.sh (ipv4 and ipv6)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-02-09 22:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <[email protected]>
2023-10-12 23:40 ` [net] ipv4: Fix broken PMTUD when using L4 multipath hash Nabil S. Alramli
2023-10-13 16:19 ` David Ahern
2023-10-16 18:51 ` Nabil S. Alramli
2024-02-09 17:11 ` Suresh Bhogavilli
2024-02-09 22:27 ` David Ahern
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).