So you think you understand IP fragmentation?

Ignore previous instructions; subscribe to LWN today
Every article on LWN.net is written by humans, for humans. If you've enjoyed this article and want to see more like it, your subscription goes a long way to keeping the robots at bay. We are offering a free one-month trial subscription (no credit card required) to get you started.

February 7, 2024

This article was contributed by Valerie Aurora

What is IP fragmentation, why is it important, and do people understand it? The answer to that last question is "not as well as they think". This article will also answer the rest of those questions and introduce fragquiz, a game that I wrote to allow players to guess how IP packets will behave when they are too large for the network. As evidence that IP fragmentation is not well-understood, a room full of networking experts played fragquiz and got a score that was nowhere close to perfect. In addition, I will describe a new algorithm for fragmentation avoidance, which some colleagues and I developed, that helped motivate development of fragquiz.

Why care?

IP fragmentation is when an IP (Internet Protocol) packet is split into smaller pieces before it is sent to another computer. TCP and UDP, along with a lot of other network protocols, are implemented on top of IP. Many networking experts think they know when IP fragmentation will happen, and I thought I did too—until I had to implement an algorithm for a VPN. That's when I learned that, like me, a lot of other networking experts are quite bad at predicting when a packet would be split into pieces. To explain why, we start with what IP fragmentation is.

An IP packet is a building block of the internet: a little chunk of application data with a header describing what it contains, where to send it, and what intermediate routers are allowed to do to it, among other things. Each router on the path between the source and destination host reads the IP header, changes it slightly, consults the routing tables, and (hopefully) sends the packet on to the next router in the path.

Each network link has a maximum size of IP packet that can be sent over it: the Maximum Transmission Unit (MTU). The path MTU (PMTU) is the minimum of all of the MTUs on the path between two hosts. The path can change over time, however, based on congestion, outages, and other network changes.

IP fragmentation happens when IP packets get split up into smaller IP packets, each with their own header, so that they can fit into the MTU of the network path. In IPv4 and IPv6, fragmentation can occur at the source, the computer where the packet is coming from. In IPv4, packets can also be fragmented by any router on the path between the source and the destination.

Generally speaking, IP fragmentation is bad for performance in just about every dimension: throughput, latency, CPU usage, memory usage, and network congestion. To see why, imagine a typical IPv4 packet of 20 bytes of IP metadata and 1480 bytes of data that has been fragmented into packets that each contain only eight bytes or fewer of data for a total of 1480/8 = 185 packets. (This is possible but unlikely to ever happen in reality; usually packets are only split into two pieces.)

To send 1480 bytes of data in eight-byte fragments, the source must send 185*20 = 3700 bytes of metadata instead of just 20 bytes in the unfragmented case. Processing the packet header costs a certain amount of CPU time, which will happen 185 times at every host in the path. The destination can't pass the data up the networking stack until it receives all of the fragments, so the latency is the worst case of 185 packets. The destination must also reserve memory for assembling the fragmented packet, which it will throw away if it does not receive even one of the fragments after waiting for a reasonable time.

Worse, fragments are more likely to be lost. Many routers and firewalls treat fragments as a security risk because they don't include the information from higher-level protocols like TCP or UDP and can't be filtered based on port, so they drop all IP fragments. Also, load-balancing systems might route fragments to different hosts, where they can never be reassembled.

Even when an IP packet is only split into two pieces, it usually causes a noticeable degradation of connection performance due to the doubling of the per-packet overhead. Sometimes IP fragmentation results in a network "black hole" if a router is configured to drop fragments. The small packets that initiate a connection get through, but the larger packets containing the data are fragmented, so they are all dropped. This is why network programmers really really want to prevent IP fragmentation.

Prevention

IP fragmentation is prevented by only sending packets that are equal to or smaller than the path MTU between two hosts. But how do we find the path MTU? This is called path MTU discovery (PMTUD) and there are a variety of methods to do this, depending on the networking protocol and the characteristics of the network. One reliable way to find the path MTU is to send IP packets of a known size that are not allowed to be fragmented. If the source gets confirmation that a packet arrived at the destination, then the path MTU is at least as large as that packet.

So, to prevent IP fragmentation, you must understand IP fragmentation well enough to predict two things: the size of the IP packet as sent by the source host, and whether any intermediate routers are permitted to fragment the packet into smaller pieces. This depends on, among other things:

the MTU of the local interface
the IP version (IPv4 or IPv6)
the options in the IP packet header
the protocol (TCP/UDP/ICMP/etc.)
the socket options
any system-wide PMTUD-related settings
any relevant PMTU-cache entries

If the sender tries to send a IP packet that is bigger than the MTU on any part of the path to the receiving host, there are three possibilities: the send() system call returns EMSGSIZE, the packet is fragmented, and/or the packet is dropped. (The last two may happen on either the source host or an intermediate router, depending on the packet type and options.) When I say that someone "understands IP fragmentation", I mean that they can predict which of those things might happen to a given packet.

Well-understood?

If you'd asked me a year ago if most networking experts could predict the size and fragmentation status of an IP packet, I would have confidently said "yes". Then I had to implement DPLPMTUD for a VPN. (Yes, that's a real acronym, for real software, from a real RFC. It stands for Datagram Packetization Layer Path Maximum Transmission Unit Discovery.)

Initially, it seemed like it would be easy. My colleagues were networking experts with a lot of experience working on the application, which is a WireGuard-based VPN using IPv4 and IPv6. Together, we came up with a fast, simple path MTU discovery algorithm. They were confident that the software already only sent packets that couldn't be fragmented, so all we had to do is send the right size of probe packets, using a built-in ping feature, and record the response. Imagine our surprise when the packet captures turned out to be full of fragmented packets.

As I searched for ways to disable IP fragmentation, I found a lot of misleading and unhelpful answers on Stack Overflow. Sometimes the best answer would be down-voted. The official documentation either didn't exist (macOS) or was hard to understand (Linux). We all thought the probe packets should be sent on a socket with IP_PMTUDISC_DO set on Linux, but it took a few weeks to realize that we actually wanted IP_PMTUDISC_PROBE. Eventually I figured out all the correct settings for Linux and macOS, but it took much longer than it should have.

I wanted to share what I learned with other people, but now I faced an even harder problem: How do you teach people something they think they already know? People were confidently wrong about IP fragmentation everywhere I looked, including in the mirror. Also, let's face it, IP fragmentation is kind of boring.

Introducing fragquiz

I decided to write a game to help people learn IP fragmentation. The program would send packets that were larger than the MTU of the local network connection (the gateway interface), while changing the IP version (IPv6 or IPv4), the transport-layer protocol (TCP or UDP), and the socket-fragmentation options (do/don't fragment on macOS, four different PMTUD options from the ip(7) man page for Linux). It would then report whether the packet was sent, what the packet's fragmentation setting was, and if it was fragmented en route—but first it would make the user guess what would happen. At the end, it would tell them their score and encourage them to send their score and a link to the program to someone else, Wordle-style.

I had a few requirements:

Works on macOS and Linux
Easy to run (no superuser, no separate server, no configuration)
No virtualization, tunnels, or loopback interface since they often have bugs related to MTU
No host packet tracing because fragmentation/reassembly often happens on the network interface

I decided to use a traceroute-style solution. The default mode in traceroute sends packets with a small time-to-live (TTL), or hop limit for IPv6. When a router receives a packet, it subtracts one from the TTL; if the TTL is now zero and the packet isn't for the router itself, it will throw away the packet and send an ICMP Time Exceeded message back to the source. Traceroute then reads the IP address of the sending router from the Time Exceeded message and prints that out. It continues sending packets with increasing TTLs to find the IP address of routers that are increasingly close to the destination.

Fragquiz uses the same TTL technique, sending each packet with a small TTL, and reading the ICMP Time Exceeded packet sent by the router. The Time Exceeded message includes the header of the packet that triggered the message, which includes the packet size and fragmentation status. On macOS and Linux, an unprivileged user can read (and send) a restricted subset of ICMP messages using the unprivileged ICMP socket type.

It worked, but there were a few surprises along the way.

Initially, I assumed that routers would not reassemble a packet with a TTL of one, since they would just decrement the TTL and throw it away as soon as they finished. But the first router that I tested, my home WiFi access point, did exactly that. I added code to automatically probe the network with larger-than-MTU packets with increasing TTL values until it received a Time Exceeded message for a fragment instead of the whole packet, signaling that the packets reached a router that did not reassemble the packet before sending a Time Exceeded message. Then I used that value for the TTL for the packets testing IP fragmentation. Usually the necessary TTL is one or two; the largest TTL I've seen in practice is six, meaning that routers one through five all reassembled fragments before sending Time Exceeded responses.

Some networks simply don't send Time Exceeded messages. I found this out the hard way when fragquiz suddenly stopped working and I spent a few minutes frantically trying to figure out how I'd broken the code. Then I realized that turning off my VPN made fragquiz work again. In my experience, it is rare for a network to correctly generate Time Exceeded messages for both IPv4 and IPv6.

While the unprivileged ICMP listener allowed non-superusers to read Time Exceeded messages on macOS, that code only worked for the superuser on Linux. According to the initial commit message for ICMP sockets, an ICMP Time Exceeded message can only be read by an unprivileged user using IP_RECVERR on the sending socket. I didn't implement that, so currently the Linux version only works as root.

Both macOS and Linux keep a cache of path MTUs discovered by the operating system. The cached path MTU will affect the IP-fragmentation behavior in some cases, which made testing a pain since I had to wait for the cached path MTU to expire. I would like to add an option to clear the path MTU cache in the future. Also, note that the prototype version has a kind of placeholder license, for now, but I plan to release a version with an open license in the future.

I presented fragquiz at RIPE 87, which is a conference for network operators and internet service providers. At the end of the talk, I had the audience play fragquiz by voting with raised hands. Almost every question had people voting both "yes" and "no". Collectively, their score was just below 80%. That means an audience full of professional network engineers and researchers working together didn't even get a "B" on the assignment. I think we can safely conclude that understanding IP fragmentation is hard.

A novel (?) algorithm

Finally, I promised to explain the new algorithm, which I co-created with Salman Aljammaz and James Tucker.

Most path MTU discovery algorithms test one path MTU at a time. They send a packet of a certain size and see whether it got through, then decide what to do next: send a bigger packet, send a smaller packet, or decide that the current estimate of the path MTU is good enough and terminate the search algorithm. This can take several round trips to find the best path MTU.

Our first insight was that, as shown by Custura, et al., in the real world there are a small number of likely packet sizes, less than ten. We aren't the first to realize that; in fact, RFC 8899 says: "Implementations could optimize the search procedure by selecting step sizes from a table of common PMTU sizes."

What we did differently is this: we sent ALL of the possible packet sizes at the same time. So if the local MTU is 9000 bytes, then we send packets with sizes of 1280, 1400, 1500, 8000, and 9000 bytes all at the same time. The other end sends an acknowledgment for every packet it sees. Then we set the path MTU to the largest packet size that was acknowledged. It's okay if it's off by a few bytes; most PMTU search algorithms stop probing when they get "close enough."

Every ten minutes, we reprobe the path MTU by sending a packet that is the next MTU size up. If we get an acknowledgment for the larger MTU size, then we know the path MTU has changed, and we reprobe with all of the packet sizes larger than that and smaller than the local MTU. Otherwise we use the current path MTU for another ten minutes. If we start losing packets for any reason, including the path MTU shrinking, we renegotiate the connection from scratch.

This algorithm has a latency of one RTT (round trip time) and is extremely simple: one timer, one static table, and one variable to hold the current path MTU. The downside is that it might use more bandwidth than other path MTU search algorithms if they can find the path MTU with fewer packets.

Reader challenge

I hope you can now state proudly that you also don't understand IP fragmentation. If you're still not sure, here's a fun closing challenge: Download fragquiz and run the following on either Linux or macOS with the standard configuration. (If you've made a TCP connection to bing.com in the last 10 minutes, replace it with a domain you haven't connected to recently. If you're on macOS, you don't need the sudo.)

    $ sudo ./fragquiz -p udp4 -f default -a bing.com:80
    $ sudo ./fragquiz -p tcp4 -f default -a bing.com:80
    $ sudo ./fragquiz -p udp4 -f default -a bing.com:80

Do you get the same answer on the first and third command? Why or why not? Hint: consult the Linux ip(7) man page linked above.

[ Valerie Aurora is a software consultant who enjoys writing ridiculous hacks and solving difficult systems problems. ]

Index entries for this article
GuestArticles	Aurora, Valerie

So you think you understand IP fragmentation?

Posted Feb 7, 2024 16:53 UTC (Wed) by ju3Ceemi (subscriber, #102464) [Link] (6 responses)

Somehow, IP fragmentation is easy to understand, for a network architect
The only thing to know : it does not work, so don' care about it

Routers most likely won't fragment IP on the dataplane, so the controlplane will have protections to aggressively ratelimit (or just disable) IP fragmentation

Hence the absence of support in ipv6

So you think you understand IP fragmentation?

Posted Feb 8, 2024 9:16 UTC (Thu) by Sesse (subscriber, #53779) [Link] (4 responses)

Unfortunately, PMTU discovery doesn't actually work anymore either (it's hard to get the return packet through IP-based load balancers), so without fragmentation, you're pretty much stuck between a rock and a hard place.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 5:58 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (3 responses)

Is this because the load balancer doesn’t look at the part of the packet that was included in the ICMP response?

So you think you understand IP fragmentation?

Posted Feb 9, 2024 8:02 UTC (Fri) by Sesse (subscriber, #53779) [Link] (2 responses)

Most OSes and/or routers don't include enough of the original packet in the ICMP response to do that, unfortunately. So you don't know which backend to send the ICMP packet to, and you don't really want to flood it to all of them.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 14:13 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Is that an OS/router bug?

So you think you understand IP fragmentation?

Posted Feb 9, 2024 14:22 UTC (Fri) by Sesse (subscriber, #53779) [Link]

I guess a combination of hardware limitations and the specs not being clear about how much of the packet to return. In any case, it's just a fact of life now.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 17:42 UTC (Thu) by vaurora (subscriber, #38407) [Link]

Just a pedantic note for other readers that IPv6 does support fragmentation - at the source only. This is to avoid breaking applications that, for example, send UDP packets that are larger than even the local MTU and expect them to be fragmented by the OS.

Preventing fragmentation in IPv6 is harder than you think if you have any form of encapsulation going on. Sure, you can rely on a 1280 byte minimum MTU, but if you're tunneling and adding another few bytes to every packet, your applications still expect to be able to send a 1280 byte packet without fragmentation. But your encapsulated packet is now 1288 bytes (or 1320 or whatever).

So you think you understand IP fragmentation?

Posted Feb 7, 2024 18:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (28 responses)

Ugh. MTU is something that should have been added to the IP layer in IPv6, but they completely dropped the spaghetti.

In my perfect world, I'd have added two fields in the IP header, one for the forward MTU and one for the reflected MTU. Each router inspects the forward MTU and replaces it with its own MTU, if it's less than the one that is already there. Then the target host simply copies the resulting value into the "reflected MTU" field and sends it back with the next reply.

Done. MTU can be easily discovered within just one RTT.

Also, instead of fragmenting the packet or sending ETOOBIG, the routers should just truncate the packet and let it reach the destination. No need for ICMP.

Sadly, this is now all just random musings. IPv6 is a failure set in stone.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 19:01 UTC (Wed) by vadim (subscriber, #35271) [Link] (7 responses)

> Also, instead of fragmenting the packet or sending ETOOBIG, the routers should just truncate the packet and let it reach the destination. No need for ICMP.

That sounds like a recipe for breaking almost everything.

Old fashioned protocols like cleartext SMTP will suddenly have bizarre failures, and parts of emails will randomly vanish into the ether.

Other protocols like SSL and SSH will return obscure errors nobody has seen before, or exhibit behaviors like waiting forever for data that doesn't arrive.

Downloads will be randomly corrupted.

IOT and other restricted devices will malfunction in very hard to debug ways.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 20:30 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Nope. Anything that deals with TCP will be fine, truncated packets will be caught inside the IP level, completely transparently to the protocols above it.

On the other hand, in-band MTU signalling would allow VPN protocols to more easily identify the flow that needs corrective actions (MTU clamping).

So you think you understand IP fragmentation?

Posted Feb 7, 2024 22:04 UTC (Wed) by shemminger (subscriber, #5739) [Link] (5 responses)

Truncated packets will cause checksum and other errors.
And truncated packets show up as other types of errors in counters.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 22:07 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

That's kinda the idea. Maybe also add a flag "packet intentionally truncated due to MTU overflow" to make sure stats can be easily broken down by truncation type.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 23:04 UTC (Wed) by jengelh (subscriber, #33263) [Link] (3 responses)

>Truncated packets will cause checksum errors

And it so happens IPv6 has done away with not only (v4-style auto)fragmentation, but also with the IP-level checksum, so YMMV.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 0:44 UTC (Thu) by pizza (subscriber, #46) [Link] (2 responses)

> And it so happens IPv6 has done away with not only (v4-style auto)fragmentation, but also with the IP-level checksum, so YMMV.

Unless you truncate the IPv6 packet smaller than its header length, truncating the IPv6 packet isn't going to cause processing problems.

Meanwhile, the IP payload (eg TCP or UDP) already provides its own checksum that will fail if it gets truncated.

Either way, truncating the packet isn't going to allow the application to receive garbage data.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 13:52 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

No, the checksum will not always fail. The checksum protects against one-bit changes and similar shenanigans. It's not safe for truncation. One in 65536 truncated packets will pass the checksum test. That's way too much for a protocol that's supposed to be reliable.

However, the UDP header also contains … surprise … a length. As long as you don't send IPv6 jumbograms (length word: zero) you're thus still safe there.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 14:20 UTC (Fri) by farnz (subscriber, #17727) [Link]

Even IPv6 jumbograms have a length field, in the Jumbo Payload hop-by-hop header. This is a 32-bit number instead of a 16 bit number, but if you have the full IPv6 header, you'll get a length field to work with (either 16 bits, if not jumbo-sized, or 32 if jumbo-sized).

So you think you understand IP fragmentation?

Posted Feb 8, 2024 8:55 UTC (Thu) by intelfx (subscriber, #130118) [Link] (19 responses)

> In my perfect world, I'd have added two fields in the IP header, one for the forward MTU and one for the reflected MTU. Each router inspects the forward MTU and replaces it with its own MTU, if it's less than the one that is already there. Then the target host simply copies the resulting value into the "reflected MTU" field and sends it back with the next reply.

It’s a neat idea, but does it justify paying the extra space cost in each and every packet?

So you think you understand IP fragmentation?

Posted Feb 8, 2024 13:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (18 responses)

Yes. The extra space is just 4 bytes, and we're losing 20 bits on an unused "flow label" in IPv6 anyway. And some bits can be saved by using multiples of 16, instead of an exact MTU.

Correctly implementing this mechanism would also unlock larger packets. We no longer would be limited by just 1500 bytes.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 13:20 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

Flow label is used in IPv6. Particularly in data-centre environments.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 21:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I'm honestly not familiar with that. I know that a lot of use-cases were proposed throughout the years (QoS, routing hints, etc.) but I'm not aware of any real uptake. Load balancers can't depend on it, because it's controlled by clients.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:09 UTC (Fri) by paulj (subscriber, #341) [Link]

Some of the most popular L3 switching/routing ASICs by default now use the flow-label for IPv6 ECMP flow grouping, *not* the (address,port) 4-tuple (which is the flow-ID for v4). E.g. more modern Broadcom Tridents. Anything that cares about ECMP is going to have to set the flow-label. I'm sure popular TCP stacks must do already - given ASICs using it by default. I can't double-check or find an authoritative reference right now, but I'm pretty sure Linux assigns a random flow-label to a TCP connection if the app hasn't set one. So apps generally don't need to care - the kernel has it covered.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:50 UTC (Fri) by intelfx (subscriber, #130118) [Link] (14 responses)

> We no longer would be limited by just 1500 bytes

I thought that we are limited by 1500 bytes because the Internet equipment does not support / is not configured to support larger packets. Even if you implement perfect discovery, that won't magically fix the equipment, no?

So you think you understand IP fragmentation?

Posted Feb 9, 2024 12:22 UTC (Fri) by farnz (subscriber, #17727) [Link] (10 responses)

We have a chicken-and-egg situation; the equipment is not configured to support larger packets because path MTU detection is not reliable, fragmentation is not reliable, and the failure case if both of those don't work is random failures of higher level protocols like TCP. This means that there is no reason for anyone to support a larger MTU on Internet-facing equipment, since you're likely to have issues where a required path between two points drops large MTU packets.

In addition, if your MTU is too large, you will frequently experience an RTT delay where something on the path sends an ICMP Too Big your way, and you have to reduce the detected path MTU; in Cyberax's proposal, you can determine the current path MTU with small packets (like those used to establish a TCP connection), and thus not pay that penalty unless the path is changing from larger MTU to smaller MTU during the lifetime of a single connection.

And it's worth noting that we already have examples of devices where there's a large MTU at PHY level, and we aggregate MAC level packets to fill a single PHY packet; it's called WiFi. Having a good way to handle variable MTU (which would include WiFi APs being able to change the path MTU on you, because the client has moved) would reduce overheads. But this needs not just Cyberax's idea, but also a change from switching Ethernet frames to routing IP packets everywhere, so that the WiFi AP is expected to know about per-device MTUs.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:06 UTC (Fri) by paulj (subscriber, #341) [Link] (6 responses)

Fragmentation was reliable actually. Until we deprecated it and encouraged vendors to degrade/remove support. Way more reliable than what was meant to supercede it at least.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:12 UTC (Fri) by farnz (subscriber, #17727) [Link] (5 responses)

I last saw fragmentation being reliable in the 1980s; my experience ever since then with IPv4 is that it's hugely unreliable, because all sorts of entities use middleboxes that drop all fragments (rather than forwarding them, or reassembling then forwarding), and thus it's basically useless.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:21 UTC (Fri) by paulj (subscriber, #341) [Link] (4 responses)

Well, yeah, firewalls always suck. They were still uncommon even in the 90s though when I was first online IVR - i don't know about the 80s. The idiotic middle-boxes only seemed to really become common in the v late 90s and 00s, with the Internet going mainstream. Again, my vague, hand-wavy, recollection.

The data-plane level support was fine though, until IETF moved to deprecate, and then vendors of course did.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:34 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

My experience was that they were always present, and become more and more of an issue throughout the 90s, until they were basically making fragmentation unusable unless the path was either between two academic institutions or between my ISP at the time and an academic institution.

Additionally, long before middleboxes became widespread, the dataplane support already sucked; there were plenty of Cisco routers that could do forwarding in hardware, but did fragmentation in software on a slow path. Not a problem from home, where my modem was the bottleneck, but a very noticeable issue when at an academic institution where the "wrong" MTU could bring speeds down from megabits per second to tens of kilobits per second.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:39 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Well, fragmentation being slow is not a problem. It's expected that frags will be slow-path - but at least communication is still possible. Piece-meal upgrades of the common MTU are at least /possible/ then.

Slow but working beats the mess we have today: We will never be able to default to >1500 MTUs, and even then we still don't have reliable networking (VPNs, etc.), and because of that the awesome networking tool of encapsulation is restricted in utility.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:40 UTC (Fri) by paulj (subscriber, #341) [Link]

And that's not to say we should go back to data-plane fragmentation. Impossible, and it might not be technically the best solution anyway. But the current situation is a mess and unfortunate.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:58 UTC (Fri) by farnz (subscriber, #17727) [Link]

By 1990, it was already the case IME that communication was not possible if there were smaller MTUs in the path, unless you were lucky enough to have a path where everything was run by sensible netadmins (usually true of academia), or you were on dial-up (where you had the bottleneck MTU).

And one of the many issues back then was routers with multi-MTU paths that were configured explicitly to not fragment packets because it could overload the CPU; packets were either pre-fragmented, or were dropped. Add in people configuring routers to drop fragments "because security" (which got worse after the ping of death vulnerability was discovered, since that depended on buggy fragment handling), and fragmentation became useless.

The IETF, by limiting fragmentation to the endpoints, were reacting to the state of play in 1990, where many routers already didn't fragment, but dropped packets that were too big.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> But this needs not just Cyberax's idea, but also a change from switching Ethernet frames to routing IP packets everywhere, so that the WiFi AP is expected to know about per-device MTUs.

Not necessarily. There are two ways this can be done without significant changes:

1. APs can just update the "forward MTU" field in IP packets, they don't need to be full routers for this. Yeah, it's a layering violation, but I doubt that people care too much about that sort of thing anymore.

2. MTU can be added to ARP/ND directly. So the sender will discover the L2 MTU of the destination when it does the initial L2 discovery. WiFi APs are responsible for ARP/ND already, so it even fits in well into the "proper" layered model.

Also, how do APs handle jumbo frames? I need to do some experiments...

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:03 UTC (Fri) by pizza (subscriber, #46) [Link] (1 responses)

> Also, how do APs handle jumbo frames? I need to do some experiments...

Based on my (admittedly not recent), they don't handle it well. Indeed, despite wifi nominally supporting 2300ish byte MTUs, APs routinely fail with anything over 1500 bytes. Because reasons.

(That is is the main reason why I reverted back to 1500 byte MTUs on my home networks....)

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:22 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Ha. My AP synthesizes ICMP Too Big packets. Layering violation, you say?

I have 9k MTU within my home network (and I get 10500 megabits over 10GB connections), and so far my WiFi has been behaving pretty well. I can get 1.5GBps download and 2.5GBps upload.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 16:05 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Nearly all equipment has supported at least 9k MTU for a long while. The problem now is the network protocols we use have no reliable way to figure out how to use larger MTUs.

See my blog link in another comment in this article. It has quotes from an early paper on TCPIP from Kahn and Cerf explaining why it is important to have a reliable network mechanism to allow different MTU networks to inter-op. Unfortunately, we - collectively - failed to heed their wise words.

A reliable mechanism needs to be in-band. E.g., data-plane fragmentation. Side-band end-host solutions - i.e., relying on ICMP messages - have proven to be fragile. Pure end-host probing (i.e. Path-MTU Discovery, in protocol or out) is also inefficient, temporally unreliable, and fragile.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:00 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Nearly all equipment has supported at least 9k MTU for a long while

Except for WiFi. Its PHY MTU is just 2300 bytes.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 17:40 UTC (Fri) by farnz (subscriber, #17727) [Link]

That's the MAC MTU; the PHY MTU can be as large as 2,097,148 bytes in 802.11ac networks (noting that the PHY MTU depends both on static parameters like channel width, but also dynamic parameters like time it will take to transmit the frame). For 802.11ax, the PHY MTU is permitted to go as high as 6,500,631 bytes. Even as early as 802.11n (in 2009), the PHY MTU was allowed to go as large as 65,536 bytes under good conditions.

This is made useful with a much smaller MAC MTU by having aggregation options, so that a single PHY frame contains many MAC frames; the downside is that there is overhead for each and every MAC frame in the PHY frame, which would go down if the MAC frames were larger. There would still be overhead mapping MAC frames into PHY frames, so you wouldn't have as large a MAC MTU as the PHY MTUs, but there would be large MTUs involved.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 18:52 UTC (Wed) by rrolls (subscriber, #151126) [Link] (2 responses)

May I suggest that fragquiz.html advises the reader that:

1. Go is required
2. they should not `tar -xzf` the file in their downloads directory, and have to clean up a mess, like I did :)

Personally, I have no interest in installing Go, so I will give this program a miss, but I enjoyed reading the article nonetheless!

So you think you understand IP fragmentation?

Posted Feb 8, 2024 17:36 UTC (Thu) by vaurora (subscriber, #38407) [Link] (1 responses)

Thanks for the bug reports! Fixed.

Do you have suggestions for what language you would like to see fragquiz written in? Annoyingly I need to set some pretty low-level OS-specific socket options so that needs to be possible.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:31 UTC (Fri) by LtWorf (subscriber, #124958) [Link]

Python is generally easier I guess for this sort of things. It is installed on all linux systems I presume.

It should be possible to do using just the standard library too.

And it has setsockopt and all of that, so it should work. If it isn't enough you can just do libc calls easily.

But personally I would not rewrite it.

So you think you understand IP fragmentation?

Posted Feb 7, 2024 23:04 UTC (Wed) by wahern (subscriber, #37304) [Link]

> The official documentation either didn't exist (macOS) or was hard to understand (Linux).

There are lies, damned lies, and documentation. Documentation is best used to more quickly identify the relevant source code to read. (See, also, lies, damned likes, and comments.)

So you think you understand IP fragmentation?

Posted Feb 8, 2024 9:53 UTC (Thu) by timon (guest, #152974) [Link] (1 responses)

> Some networks simply don’t send Time Exceeded messages. I found this out the hard way

I’m confused by this. Don’t people usually find this out the easy way, as most traceroute outputs contain interspersed asterisks where a hop didn’t return anything?

Am I missing something, or is the author just one of the “lucky” ten thousand here?
https://xkcd.com/1053/

So you think you understand IP fragmentation?

Posted Feb 8, 2024 18:15 UTC (Thu) by vaurora (subscriber, #38407) [Link]

The "hard way" refers to how I found out that (a) I had left my VPN on, (b) my VPN did not send Time Exceeded. I'm not fond of the "I didn't change anything, so how did I break it???" feeling.

However, given that the purpose of this article is to demonstrate that people often don't understand networking as well as they think, including me, it's a reasonable conclusion! My main takeaway is that all kinds of networking oddities are happening in the background and I'm blissfully unaware of them unless I write code to check it.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 10:24 UTC (Thu) by paulj (subscriber, #341) [Link] (9 responses)

Interesting article. Reliably determining Path MTU in a continuous fashion is a difficult problem.

If routers *are* fragmenting packets that have the DF bit set that is... wow. WTF? That's a pretty serious bug. Has the author been able to figure out which vendors' products are doing this?

Also, there is an argument to be made that removing fragmentation was a bad idea. It turns one problem "fragmenting is slow, and should be avoided" into a worse one: "Internet randomly blackholes communications between 2 places, for unfathomable and potentially varying reasons - in a way that is almost impossible to fix". Worse, we are now stuck on 1500 MTU for the Internet, forever. The creators of TCPIP warned of this:

https://paul.jakma.org/2011/06/28/cerf-and-kahn-on-why-yo...

So you think you understand IP fragmentation?

Posted Feb 8, 2024 11:33 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

When the CRA comes into effect, all we need is a serious security bug (heartbleed?) caused by ICMP filtering, and all these router vendors will be scrambling to fix their routers to work properly :-)

Cheers,
Wol

So you think you understand IP fragmentation?

Posted Feb 23, 2024 15:48 UTC (Fri) by sammythesnake (guest, #17693) [Link]

Colour me sceptical on that idea. I imagine that in such a case there would be some vigorously partisan "discussions" of exactly where responsibility lies - is the router responsible for how the source/destination behaves when packets go missing?

I think there's a good argument that occasional missing packets is a normally expected behaviour of "the internet" - a whole lot of the specs for things like TCP/IP exist specifically because of that fact. When it happens unnecessarily, that's certainly a *performance* issue, but not a *security* issue in some random part of the internet, rather in any end-point that reacts by leaking information or whatever.

Any endpoint that can't stay as safe as a "connection failed" error really shouldn't be dealing with anything security related...

If an intermediary on the path *rewrites* stuff, that's a much harder thing to justify by this kind of argument, but even then I think the more reasonable next step is ensuring integrity/privacy via end-to-end encryption because the internet is a hostile environment full of baddies of all kinds, not just crappy middleboxen (e.g. a whole alphabet soup of state agencies who absolutely do not share my priorities with regard to my internet traffic(!))

So you think you understand IP fragmentation?

Posted Feb 8, 2024 17:52 UTC (Thu) by vaurora (subscriber, #38407) [Link] (3 responses)

> If routers *are* fragmenting packets that have the DF bit set that is... wow. WTF? That's a pretty serious bug. Has the author been able to figure out which vendors' products are doing this?

Sounds like I've written something unclear since I haven't observed that! If you let me know which part sounds like that is happening, I'll try to reword.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 18:06 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

Well, you sent >path-MTU packets and observed fragments, and this was surprising, right? Which means you must have had DF set on the packets - as is common, particularly if probing for pMTU? And your slide deck on the fragquiz, on slide 9 describing the solution says "Set the don't fragment socket option" - which makes sense, given what you're trying to measure (?).

And you observed fragments, despite having set DF, right? Which would be surprising, naturally.

Otherwise, you did /not/ set DF, sent various size packets and observed fragments - which is... not surprising... ?

I'm confused now. ;)

So you think you understand IP fragmentation?

Posted Feb 8, 2024 18:29 UTC (Thu) by vaurora (subscriber, #38407) [Link] (1 responses)

I see! No, everyone I talked to *believed* that the application was already setting the DF bit on all the packets, and I believed them. It wasn't until the PMTU discovery code failed that I did the packet dump and realized the application was not setting DF.

My takeaway from that was that multiple network experts were writing code assuming that the packets weren't fragmentable when they actually were, and no one noticed for years because things just worked anyway.

The solution on slide 9 is for fragquiz, which is a packet generator, not an implementation of DPLPMTUD (which I pronounce "dpblublubblubbbbd").

So you think you understand IP fragmentation?

Posted Feb 9, 2024 11:01 UTC (Fri) by paulj (subscriber, #341) [Link]

Ah, ok. That isn't quite clear from the article. :) E.g., the bit that says:

> "They were confident that the software already only sent packets that couldn't be fragmented, so all we had to do is send the right size of probe packets, using a built-in ping feature, and record the response. Imagine our surprise when the packet captures turned out to be full of fragmented packets."

The reader here can reasonably conclude from "all we had to is send the right size of probe" that we are talking about software under your control to do probing. And "surprise when the packet captures turned out to be full of [frags]" suggests surprise at seeing frags, which would imply the software /had/ set DF - for otherwise frags on large packets are not that surprising.

Perhaps a bit more word-smithing there? Something like "Surprise at seeing [frags], even where [condition where an expert wouldn't always predict frags]"? Or some other change there?

Great seeing detailed networking articles on LWN, and thanks for the tool! More please. :)

So you think you understand IP fragmentation?

Posted Feb 9, 2024 3:51 UTC (Fri) by shemminger (subscriber, #5739) [Link] (2 responses)

The big problem I have seen is that many environments set routers to block/drop all ICMP packets.
That includes Echo (ping), Time Exceeded, and fragmentation.
It makes dealing with thise kind of black holes hard.

So you think you understand IP fragmentation?

Posted Feb 9, 2024 12:28 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Yep. Mentioned that also in the blog post. This used to be a terrible problem some time ago (??? 15+ years ago?). We had a number of firewall vendors who had "block all ICMP" as their default policies - either cause the vendor was institutionally stupid, or because their customers were and demanded it (possibly facilitated by dumb tech industry reviewers?). It /seems/ to be a /little/ better now - i.e. firewalls have better defaults now.

However, you still have dumb IT people who buy firewalls and then click at stuff without really knowing what they're doing. "That option to block all ICMP, that must be more secure!", and some perverse security idiots too.

A lesson here for protocol designers however has to be that side-channel / out-of-band control/error channels are not a good idea.

So you think you understand IP fragmentation?

Posted Feb 22, 2024 6:50 UTC (Thu) by fest3er (guest, #60379) [Link]

Grok! I used to think that blocking all ICMP was A Good Thing™ until I read and understood that IPv4 cannot work without ICMP. Then I modified Smoothwall Express' iptables to accept ICMP messages while retaining the option to tell the kernel to block or allow PING from the internet.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 17:51 UTC (Thu) by auerswal (subscriber, #119876) [Link] (3 responses)

Thanks for an interesting article!

I plan to look at the "fragquiz" program and might even try it, when it is released under some free software license and no longer requires root (on GNU/Linux).

> in the real world there are a small number of likely packet sizes

This was also mentioned in RFC 1191[1] from 1990. The actual values obviously change through time. The 1492 bytes entry from the RFC 1191 plateau table is still relevant today for some consumer-grade Internet connections.

> What we did differently is this: we sent ALL of the possible packet sizes at the same time.

I would have called this "obvious" when testing a small number of packet sizes (thus I used this idea in 2018 in a little shell script[2] for personal use, modulo a rate limit to increase the chances of obtaining useful results). ;-)

Br,
Erik

[1] https://www.rfc-editor.org/rfc/rfc1191.html
[2] https://github.com/auerswal/sft/blob/master/pmtud

So you think you understand IP fragmentation?

Posted Feb 8, 2024 18:48 UTC (Thu) by vaurora (subscriber, #38407) [Link] (2 responses)

Great shell script implementation of PMTUD! (That is not a sentence I was expecting to write.)

I do agree that the concept of "probe this pre-defined table of MTU sizes" is obvious, which is why it is in the RFC. I just haven't yet seen a PMTUD search algorithm that sends all the probes simultaneously, which teeeeeeechnically your script doesn't seem to do either. ;) I'm still excited to see the implementation.

I'm not sure why sending all the probes simultaneously is not commonly done. My theory is that people get stuck in a stream-oriented TCP-style mental model when they design PMTUD stuff. One of my colleagues thinks it's a holdover from when long-distance network links were much lower bandwidth.

So you think you understand IP fragmentation?

Posted Feb 8, 2024 22:07 UTC (Thu) by auerswal (subscriber, #119876) [Link] (1 responses)

Regarding not sending all probes in parallel in my script:

On the one hand, ICMP Echo Responses are often rate limited. Sending all probes in a burst thus likely results in some missing responses because of the rate limit, not because of requests dropped due to too small PMTU. At least that was my experience back when I started writing the script. ;-)

If the largest probes in a "plateau search" are sent first, the first arriving probe (so not yet rate limited) has a good probability of being the largest probe fitting inside the PMTU. But even that may not be true if probes take different paths.

On the other hand, I do remember times when there was little bandwidth available and do not want to send one big burst as fast as possible. Some form of packet pacing is just more considerate with respect to other users of the network (perhaps I have too often encountered situations where adding packet pacing via some network device configuration shenanigans "solved" "network" problems of applications…). I am also comfortable waiting a bit for my manually triggered PMTUD to finish, which may not be the case for some VPN product user who might complain over long startup times.

Sending (1280+1400+1500=4180) bytes in one burst is still less than the TCP IW10 burst without pacing. (1280+1400+1500+8000+9000=21180) is even over the TCP IW10 maximum initial burst, but hopefully the larger packets would not get far, e.g., stay inside a data center. Together with a cooperating target (the other side of a VPN tunnel) this seems OK. (Using a 1280 bytes floor is a sign of the times, with more and more IPv6 all around. :-) )

I wrote my PMTUD script to work around a VPN missing a reliable built-in PMTUD mechanism. Therefore I use ICMP Echo although both rate and packet size limits are commonly encountered over the Internet. I do not need a special responder, just a server I want to reach via VPN that answers pings without size limit. As such I really like that you added PMTUD to the VPN product, and also put some thought into it. :-)

So you think you understand IP fragmentation?

Posted Feb 9, 2024 10:34 UTC (Fri) by vaurora (subscriber, #38407) [Link]

I appreciate you spelling out your reasoning and experience with ICMP ping-based PMTUD. We came to similar conclusions. Doing DPLPMTUD inside an encrypted VPN protocol is far easier than using anything ICMP-based: intermediate routers can’t inspect and filter, we can send pad data, and we can’t get spoofed. But I am absolutely going to be using your script to double check my future work. :)

So you think you understand IP fragmentation?

Posted Feb 19, 2024 10:32 UTC (Mon) by da4089 (subscriber, #1195) [Link]

It's great to see another article from Valerie in LWN!
I particularly remember the union mounts article/work, and it's hard to believe that it was ~14 (!) years ago.