So you think you understand IP fragmentation?

Posted Feb 7, 2024 18:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
Parent article: So you think you understand IP fragmentation?

Ugh. MTU is something that should have been added to the IP layer in IPv6, but they completely dropped the spaghetti.

In my perfect world, I'd have added two fields in the IP header, one for the forward MTU and one for the reflected MTU. Each router inspects the forward MTU and replaces it with its own MTU, if it's less than the one that is already there. Then the target host simply copies the resulting value into the "reflected MTU" field and sends it back with the next reply.

Done. MTU can be easily discovered within just one RTT.

Also, instead of fragmenting the packet or sending ETOOBIG, the routers should just truncate the packet and let it reach the destination. No need for ICMP.

Sadly, this is now all just random musings. IPv6 is a failure set in stone.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 19:01 UTC (Wed) by vadim (subscriber, #35271) [Link] (7 responses)

> Also, instead of fragmenting the packet or sending ETOOBIG, the routers should just truncate the packet and let it reach the destination. No need for ICMP.

That sounds like a recipe for breaking almost everything.

Old fashioned protocols like cleartext SMTP will suddenly have bizarre failures, and parts of emails will randomly vanish into the ether.

Other protocols like SSL and SSH will return obscure errors nobody has seen before, or exhibit behaviors like waiting forever for data that doesn't arrive.

Downloads will be randomly corrupted.

IOT and other restricted devices will malfunction in very hard to debug ways.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 20:30 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Nope. Anything that deals with TCP will be fine, truncated packets will be caught inside the IP level, completely transparently to the protocols above it.

On the other hand, in-band MTU signalling would allow VPN protocols to more easily identify the flow that needs corrective actions (MTU clamping).

So you think you understand IP fragmentation?

Posted Feb 7, 2024 22:04 UTC (Wed) by shemminger (subscriber, #5739) [Link] (5 responses)

Truncated packets will cause checksum and other errors.
And truncated packets show up as other types of errors in counters.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 22:07 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

That's kinda the idea. Maybe also add a flag "packet intentionally truncated due to MTU overflow" to make sure stats can be easily broken down by truncation type.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 23:04 UTC (Wed) by jengelh (subscriber, #33263) [Link] (3 responses)

>Truncated packets will cause checksum errors

And it so happens IPv6 has done away with not only (v4-style auto)fragmentation, but also with the IP-level checksum, so YMMV.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 0:44 UTC (Thu) by pizza (subscriber, #46) [Link] (2 responses)

> And it so happens IPv6 has done away with not only (v4-style auto)fragmentation, but also with the IP-level checksum, so YMMV.

Unless you truncate the IPv6 packet smaller than its header length, truncating the IPv6 packet isn't going to cause processing problems.

Meanwhile, the IP payload (eg TCP or UDP) already provides its own checksum that will fail if it gets truncated.

Either way, truncating the packet isn't going to allow the application to receive garbage data.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 13:52 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

No, the checksum will not always fail. The checksum protects against one-bit changes and similar shenanigans. It's not safe for truncation. One in 65536 truncated packets will pass the checksum test. That's way too much for a protocol that's supposed to be reliable.

However, the UDP header also contains … surprise … a length. As long as you don't send IPv6 jumbograms (length word: zero) you're thus still safe there.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 14:20 UTC (Fri) by farnz (subscriber, #17727) [Link]

Even IPv6 jumbograms have a length field, in the Jumbo Payload hop-by-hop header. This is a 32-bit number instead of a 16 bit number, but if you have the full IPv6 header, you'll get a length field to work with (either 16 bits, if not jumbo-sized, or 32 if jumbo-sized).

So you think you understand IP fragmentation?

Posted Feb 8, 2024 8:55 UTC (Thu) by intelfx (subscriber, #130118) [Link] (19 responses)

> In my perfect world, I'd have added two fields in the IP header, one for the forward MTU and one for the reflected MTU. Each router inspects the forward MTU and replaces it with its own MTU, if it's less than the one that is already there. Then the target host simply copies the resulting value into the "reflected MTU" field and sends it back with the next reply.

It’s a neat idea, but does it justify paying the extra space cost in each and every packet?

So you think you understand IP fragmentation?

Posted Feb 8, 2024 13:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (18 responses)

Yes. The extra space is just 4 bytes, and we're losing 20 bits on an unused "flow label" in IPv6 anyway. And some bits can be saved by using multiples of 16, instead of an exact MTU.

Correctly implementing this mechanism would also unlock larger packets. We no longer would be limited by just 1500 bytes.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 13:20 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

Flow label is used in IPv6. Particularly in data-centre environments.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 21:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I'm honestly not familiar with that. I know that a lot of use-cases were proposed throughout the years (QoS, routing hints, etc.) but I'm not aware of any real uptake. Load balancers can't depend on it, because it's controlled by clients.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:09 UTC (Fri) by paulj (subscriber, #341) [Link]

Some of the most popular L3 switching/routing ASICs by default now use the flow-label for IPv6 ECMP flow grouping, *not* the (address,port) 4-tuple (which is the flow-ID for v4). E.g. more modern Broadcom Tridents. Anything that cares about ECMP is going to have to set the flow-label. I'm sure popular TCP stacks must do already - given ASICs using it by default. I can't double-check or find an authoritative reference right now, but I'm pretty sure Linux assigns a random flow-label to a TCP connection if the app hasn't set one. So apps generally don't need to care - the kernel has it covered.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:50 UTC (Fri) by intelfx (subscriber, #130118) [Link] (14 responses)

> We no longer would be limited by just 1500 bytes

I thought that we are limited by 1500 bytes because the Internet equipment does not support / is not configured to support larger packets. Even if you implement perfect discovery, that won't magically fix the equipment, no?

So you think you understand IP fragmentation?

Posted Feb 9, 2024 12:22 UTC (Fri) by farnz (subscriber, #17727) [Link] (10 responses)

We have a chicken-and-egg situation; the equipment is not configured to support larger packets because path MTU detection is not reliable, fragmentation is not reliable, and the failure case if both of those don't work is random failures of higher level protocols like TCP. This means that there is no reason for anyone to support a larger MTU on Internet-facing equipment, since you're likely to have issues where a required path between two points drops large MTU packets.

In addition, if your MTU is too large, you will frequently experience an RTT delay where something on the path sends an ICMP Too Big your way, and you have to reduce the detected path MTU; in Cyberax's proposal, you can determine the current path MTU with small packets (like those used to establish a TCP connection), and thus not pay that penalty unless the path is changing from larger MTU to smaller MTU during the lifetime of a single connection.

And it's worth noting that we already have examples of devices where there's a large MTU at PHY level, and we aggregate MAC level packets to fill a single PHY packet; it's called WiFi. Having a good way to handle variable MTU (which would include WiFi APs being able to change the path MTU on you, because the client has moved) would reduce overheads. But this needs not just Cyberax's idea, but also a change from switching Ethernet frames to routing IP packets everywhere, so that the WiFi AP is expected to know about per-device MTUs.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:06 UTC (Fri) by paulj (subscriber, #341) [Link] (6 responses)

Fragmentation was reliable actually. Until we deprecated it and encouraged vendors to degrade/remove support. Way more reliable than what was meant to supercede it at least.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:12 UTC (Fri) by farnz (subscriber, #17727) [Link] (5 responses)

I last saw fragmentation being reliable in the 1980s; my experience ever since then with IPv4 is that it's hugely unreliable, because all sorts of entities use middleboxes that drop all fragments (rather than forwarding them, or reassembling then forwarding), and thus it's basically useless.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:21 UTC (Fri) by paulj (subscriber, #341) [Link] (4 responses)

Well, yeah, firewalls always suck. They were still uncommon even in the 90s though when I was first online IVR - i don't know about the 80s. The idiotic middle-boxes only seemed to really become common in the v late 90s and 00s, with the Internet going mainstream. Again, my vague, hand-wavy, recollection.

The data-plane level support was fine though, until IETF moved to deprecate, and then vendors of course did.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:34 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

My experience was that they were always present, and become more and more of an issue throughout the 90s, until they were basically making fragmentation unusable unless the path was either between two academic institutions or between my ISP at the time and an academic institution.

Additionally, long before middleboxes became widespread, the dataplane support already sucked; there were plenty of Cisco routers that could do forwarding in hardware, but did fragmentation in software on a slow path. Not a problem from home, where my modem was the bottleneck, but a very noticeable issue when at an academic institution where the "wrong" MTU could bring speeds down from megabits per second to tens of kilobits per second.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:39 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Well, fragmentation being slow is not a problem. It's expected that frags will be slow-path - but at least communication is still possible. Piece-meal upgrades of the common MTU are at least /possible/ then.

Slow but working beats the mess we have today: We will never be able to default to >1500 MTUs, and even then we still don't have reliable networking (VPNs, etc.), and because of that the awesome networking tool of encapsulation is restricted in utility.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:40 UTC (Fri) by paulj (subscriber, #341) [Link]

And that's not to say we should go back to data-plane fragmentation. Impossible, and it might not be technically the best solution anyway. But the current situation is a mess and unfortunate.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:58 UTC (Fri) by farnz (subscriber, #17727) [Link]

By 1990, it was already the case IME that communication was not possible if there were smaller MTUs in the path, unless you were lucky enough to have a path where everything was run by sensible netadmins (usually true of academia), or you were on dial-up (where you had the bottleneck MTU).

And one of the many issues back then was routers with multi-MTU paths that were configured explicitly to not fragment packets because it could overload the CPU; packets were either pre-fragmented, or were dropped. Add in people configuring routers to drop fragments "because security" (which got worse after the ping of death vulnerability was discovered, since that depended on buggy fragment handling), and fragmentation became useless.

The IETF, by limiting fragmentation to the endpoints, were reacting to the state of play in 1990, where many routers already didn't fragment, but dropped packets that were too big.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> But this needs not just Cyberax's idea, but also a change from switching Ethernet frames to routing IP packets everywhere, so that the WiFi AP is expected to know about per-device MTUs.

Not necessarily. There are two ways this can be done without significant changes:

1. APs can just update the "forward MTU" field in IP packets, they don't need to be full routers for this. Yeah, it's a layering violation, but I doubt that people care too much about that sort of thing anymore.

2. MTU can be added to ARP/ND directly. So the sender will discover the L2 MTU of the destination when it does the initial L2 discovery. WiFi APs are responsible for ARP/ND already, so it even fits in well into the "proper" layered model.

Also, how do APs handle jumbo frames? I need to do some experiments...

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:03 UTC (Fri) by pizza (subscriber, #46) [Link] (1 responses)

> Also, how do APs handle jumbo frames? I need to do some experiments...

Based on my (admittedly not recent), they don't handle it well. Indeed, despite wifi nominally supporting 2300ish byte MTUs, APs routinely fail with anything over 1500 bytes. Because reasons.

(That is is the main reason why I reverted back to 1500 byte MTUs on my home networks....)

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:22 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Ha. My AP synthesizes ICMP Too Big packets. Layering violation, you say?

I have 9k MTU within my home network (and I get 10500 megabits over 10GB connections), and so far my WiFi has been behaving pretty well. I can get 1.5GBps download and 2.5GBps upload.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:05 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Nearly all equipment has supported at least 9k MTU for a long while. The problem now is the network protocols we use have no reliable way to figure out how to use larger MTUs.

See my blog link in another comment in this article. It has quotes from an early paper on TCPIP from Kahn and Cerf explaining why it is important to have a reliable network mechanism to allow different MTU networks to inter-op. Unfortunately, we - collectively - failed to heed their wise words.

A reliable mechanism needs to be in-band. E.g., data-plane fragmentation. Side-band end-host solutions - i.e., relying on ICMP messages - have proven to be fragile. Pure end-host probing (i.e. Path-MTU Discovery, in protocol or out) is also inefficient, temporally unreliable, and fragile.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:00 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Nearly all equipment has supported at least 9k MTU for a long while

Except for WiFi. Its PHY MTU is just 2300 bytes.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:40 UTC (Fri) by farnz (subscriber, #17727) [Link]

That's the MAC MTU; the PHY MTU can be as large as 2,097,148 bytes in 802.11ac networks (noting that the PHY MTU depends both on static parameters like channel width, but also dynamic parameters like time it will take to transmit the frame). For 802.11ax, the PHY MTU is permitted to go as high as 6,500,631 bytes. Even as early as 802.11n (in 2009), the PHY MTU was allowed to go as large as 65,536 bytes under good conditions.

This is made useful with a much smaller MAC MTU by having aggregation options, so that a single PHY frame contains many MAC frames; the downside is that there is overhead for each and every MAC frame in the PHY frame, which would go down if the MAC frames were larger. There would still be overhead mapping MAC frames into PHY frames, so you wouldn't have as large a MAC MTU as the PHY MTUs, but there would be large MTUs involved.