November 15, 2021 Aah, the Unix pipe. Friend (and occasional foe) of any Unix user, cornerstone of the Unix Philosophy, and universal building block of many a SysAdmin's tool chest. We all use it day in and day out without thinking much about how it works. And honestly, it's not all that difficult to understand: You create a pipe by calling, well, the pipe(2) system call, which provides you with two file handles: the read end and the write end of the pipe. To communicate with another process, you then fork(2), close the read end in one process and the write end in the other, et voilá, you can now communicate through the pipe. As that graphic illustrates, interprocess communication (IPC) through the pipe happens in kernel space, and obviously data written into the pipe is not stored in the filesystem. Pipe behavior is reasonably straight forward: reading from a pipe where the write end has been closed yields EOF; writing into a pipe where the read end has been closed, yields SIGPIPE: $ cat sigpipe.c #include <err.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> void sigpipe(int s) { fprintf(stderr, "Caught signal #%d: SIGPIPE!\n", s); } int main() { int fd[2]; if (signal(SIGPIPE, sigpipe) == SIG_ERR) { err(EXIT_FAILURE, "signal"); /* NOTREACHED */ } if (pipe(fd) < 0) { err(EXIT_FAILURE, "pipe"); /* NOTREACHED */ } /* Closing the read-end will trigger SIGPIPE * when we then try to write. */ (void)close(fd[0]); if (write(fd[1], "x", 1) < 0) { err(EXIT_FAILURE, "write"); /* NOTREACHED */ } (void)printf("We never get here.\n"); } $ cc -Wall -Werror -Wextra sigpipe.c $ ./a.out Caught signal #13: SIGPIPE! a.out: write: Broken pipe $ But of course we might also encounter the situation where there is a process with the read handle open, but data is just not consumed (or not consumed as quickly as we are writing the data): $ cat sigpipe.c | ( sleep 10; wc; ) 34 83 600 $ Now for 10 seconds, nothing was reading the data. So... where did it go? Well, we know that pipe I/O is, by default, anyway, blocking, meaning the read(2) call on a pipe will not return until there is data available. But what about writes? In the above example, was data only written once wc began reading? It doesn't look like that's the case: $ ( echo "message"; echo "message written" >&2; ) | cat message written message $ ( echo "message"; echo "message written" >&2; ) | ( sleep 3; cat; ) message written [3 seconds later:] message $ That is, we see that "message" was written to the pipe even though nothing was currently reading. So here, just like in the above "cat sigpipe.c" example, the data was held somewhere in between. In other words, the kernel has a buffer for the pipe, allowing the write to complete even though no process is reading from the pipe. So... how big is this pipe buffer? It looks like we can write quite a lot of data: $ dd if=/dev/zero bs=1024k count=10240 2>/dev/null | (sleep 10; wc -c; ) 10737418240 $ That's ten gigabytes of data! But of course the kernel will not buffer ten GB for you. Rather, writes to the pipe are blocking by default, but blocking not dependent on a reader consuming data, but rather blocking based on the internal buffer size of the pipe: initial writes will immediately succeed up until the pipe buffer is filled; subsequent writes will block until the reader has consumed any data and thus freed up space in the pipe buffer. Ok, so far, so uneventful - that's how non-blocking I/O works. But, how big is this internal pipe buffer? Well, the answer won't surprise you: it depends. :-) POSIX specifies the behavior of write(2) with regards to a pipe or fifo as such: Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions: PIPE_BUF (either defined or retrievable via e.g., fpathconf(2)) is only relevant to how many bytes need to be written atomically to avoid interleaving with multiple writers; it says nothing about the total pipe buffer. But we can find out what that total is by experimentation: We simply continue to write data into a pipe without a reader until the write call would block: $ cat pipebuf.c #include <err.h> #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> int main() { int count = 0; int fd[2]; if (pipe(fd) < 0) { err(EXIT_FAILURE, "pipe"); /* NOTREACHED */ } if (fcntl(fd[1], F_SETFL, O_NONBLOCK) < 0) { err(EXIT_FAILURE, "fcntl set flags"); /* NOTREACHED */ } while (1) { if (write(fd[1], "x", 1) < 0) { if (errno == EAGAIN) { break; } err(EXIT_FAILURE, "write"); /* NOTREACHED */ } count++; } (void)printf("%d\n", count); } $ cc -Wall -Werror -Wextra pipebuf.c $ ./a.out 16384 $ Ok, so on this particular system (NetBSD/amd64 9.2), the internal pipe buffer size is 16K. It's the same on OmniOS 5.11, while on e.g., macOS 11.6.1, Linux 5.8.15 / Fedora 33, and FreeBSD 13.0 it's 64k. NetBSD allows you to retrieve the free space in a file descriptor's send queue via an ioctl(2) query for FIONSPACE, together with the number of bytes immediately available for reading (FIONREAD) and the number of bytes in the write queue (FIONWRITE), so we can query that as well. (FreeBSD appears to only support these for sockets and socketpairs.) And trying to write data in chunks rather than one byte at a time, we also discover that NetBSD lets us write data greater than 16k:
$ cat pipebuf2.c
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int
main(int argc, char **argv) {
int fd[2];
int count = 0;
int size;
char *buf;
if (argc != 2) {
fprintf(stderr, "Usage: %s num\n", argv[0]);
exit(EXIT_FAILURE);
/* NOTREACHED */
}
if ((size = (int)strtol(argv[1], NULL, 10)) < 1) {
(void)fprintf(stderr, "Please provide a number (>0).\n");
exit(EXIT_FAILURE);
/* NOTREACHED */
}
if ((buf = malloc(size)) == NULL) {
err(EXIT_FAILURE, "malloc");
}
memset(buf, 'x', size);
if (pipe(fd) < 0) {
err(EXIT_FAILURE, "pipe");
/* NOTREACHED */
}
if (fcntl(fd[1], F_SETFL, O_NONBLOCK) < 0) {
err(EXIT_FAILURE, "fcntl set flags");
/* NOTREACHED */
}
if ((count = write(fd[1], buf, size)) < 0) {
err(EXIT_FAILURE, "write");
/* NOTREACHED */
}
(void)printf("%d\n", count);
}
$ cc -Wall -Werror -Wextra pipebuf2.c
$ ./a.out 16384
16384
$ ./a.out 16385
16385
$ ./a.out 102400
65536
$ That's right: when we write a large chunk of data, NetBSD automatically bumps the pipe buffer size up to 64K - a so-called "big pipe". And Linux let's you increase the size of the pipe buffer via the F_SETPIPE_SZ fcntl(2) (as a multiple of the page size and up to the max size defined in /proc/sys/fs/pipe-max-size for normal users). With these differences already observed, let's compare the behavior of the different IPC buffers across the different OS, since what holds for pipes should conceptually be similar for the other forms of on-system IPC, such as FIFOs, sockets in the PF_LOCAL (or AF_UNIX) domain, and socketpairs. To do that, I wrote a program that helps us test the different conditions: ipcbuf. The tool has two modes: looping and chunking. In looping mode, we write incremental chunks of data; in chunking mode, we write consecutive chunks. Sample invocations using looping mode on NetBSD 9.2: $ ipcbuf Testing pipe buffer size in loop mode. Loop starting with 1 byte and doubling each iteration. PIPE_BUF : 512 _PC_PIPE_BUF : 512 FIONSPACE : 16384 Wrote 1 out of 1 byte. (Total: 1) Wrote 2 out of 2 bytes. (Total: 3) Wrote 4 out of 4 bytes. (Total: 7) Wrote 8 out of 8 bytes. (Total: 15) Wrote 16 out of 16 bytes. (Total: 31) Wrote 32 out of 32 bytes. (Total: 63) Wrote 64 out of 64 bytes. (Total: 127) Wrote 128 out of 128 bytes. (Total: 255) Wrote 256 out of 256 bytes. (Total: 511) Wrote 512 out of 512 bytes. (Total: 1023) Wrote 1024 out of 1024 bytes. (Total: 2047) Wrote 2048 out of 2048 bytes. (Total: 4095) Wrote 4096 out of 4096 bytes. (Total: 8191) Wrote 8192 out of 8192 bytes. (Total: 16383) Wrote 1 out of 16384 bytes. (Total: 16384) FIONWRITE : 16384 Observed total : 16384 Draining... FIONREAD : 16384 FIONREAD : 8192 FIONREAD : 0 Read : 16384 $ ipcbuf 16385 Testing pipe buffer size in loop mode. Loop starting with 16385 bytes and doubling each iteration. PIPE_BUF : 512 _PC_PIPE_BUF : 512 FIONSPACE : 16384 Wrote 16385 out of 16385 bytes. (Total: 16385) Wrote 32770 out of 32770 bytes. (Total: 49155) Wrote 16381 out of 65540 bytes. (Total: 65536) FIONWRITE : 65536 Observed total : 65536 Draining... FIONREAD : 65536 FIONREAD : 32766 FIONREAD : 0 Read : 65536 Since we've already observed that using different chunk sizes may yield different results, let's compare the buffer size used by pipes and FIFOs in a single-byte loop (ipcbuf -l 1 0) versus the maximum amount we can write in chunks (ipcbuf -c chunk1 chunk2):
NetBSD "big pipe" if the initial write was larger than 16K bytes socketpair(2) and socket(2)Now what about socketpair(2) and socket(2)? For those, we can check SO_SNDBUF, but is that accurate? And is there a difference between DGRAM and STREAM sockets? We know that datagram connections in the PF_INET or PF_INET6 domain are completely unbuffered, as UDP is a connectionless, unreliable protocol, and so packets need not be consumed by a reader and can silently be dropped. But datagrams in the PF_LOCAL domain are reliable and thus still buffered in the kernel! And this is where things get interesting... $ ipcbuf -R 2048 -t socketpair -l 1 0 Testing socketpair DGRAM buffer size in loop mode. Loop starting with 1 byte, increasing by 0 bytes each time. recvspace : 16384 SO_RCVBUF : 2048 SO_SNDBUF : 2560 FIONSPACE : 2560 Wrote 1 out of 1 byte. (Total: 1) Wrote 1 out of 1 byte. (Total: 2) Wrote 1 out of 1 byte. (Total: 3) Wrote 1 out of 1 byte. (Total: 4) Unable to write even a single byte: No buffer space available Iterations : 4 FIONWRITE : 0 Observed total : 4 Draining... FIONREAD : 12 FIONREAD : 9 FIONREAD : 6 FIONREAD : 3 FIONREAD : 0 Read : 4 $ Uhm... wait. What? We only get to write 4 bytes? Why does the receiving socket claim there are 12 bytes? And what happens if we try to write more bytes? $ ipcbuf -R 2048 -t socketpair -l 400 0 Testing socketpair DGRAM buffer size in loop mode. Loop starting with 400 bytes, increasing by 0 bytes each time. recvspace : 16384 SO_RCVBUF : 2048 SO_SNDBUF : 2560 FIONSPACE : 2560 Wrote 400 out of 400 bytes. (Total: 400) Wrote 400 out of 400 bytes. (Total: 800) Wrote 400 out of 400 bytes. (Total: 1200) Wrote 400 out of 400 bytes. (Total: 1600) Unable to write even a single byte: No buffer space available Iterations : 4 FIONWRITE : 0 Observed total : 1600 Draining... FIONREAD : 1608 FIONREAD : 1206 FIONREAD : 804 FIONREAD : 402 FIONREAD : 0 Read : 1600 Looks like we can always only perform 4 writes in sequence. Or...?
$ ipcbuf -R 2048 -t socketpair -l 856 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 401 bytes, increasing by 0 bytes each time.
recvspace : 16384
SO_RCVBUF : 2048
SO_SNDBUF : 2560
FIONSPACE : 2560
Wrote 856 out of 856 bytes. (Total: 856)
Wrote 856 out of 856 bytes. (Total: 1712)
Wrote 330 out of 856 bytes. (Total: 2042)
Unable to write even a single byte: No buffer space available
Iterations : 3
FIONWRITE : 0
Observed total : 2042
Draining...
FIONREAD : 2048
FIONREAD : 1190
FIONREAD : 332
FIONREAD : 0
Read : 2042
$ Huh, that's 3 writes, but not even filling each dgram! Let's try to write slightly larger chunks:
$ ipcbuf -R 2048 -t socketpair -l 857 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 857 bytes, increasing by 0 bytes each time.
recvspace : 16384
SO_RCVBUF : 2048
SO_SNDBUF : 2560
FIONSPACE : 2560
Wrote 857 out of 857 bytes. (Total: 857)
Wrote 857 out of 857 bytes. (Total: 1714)
Unable to write even a single byte: No buffer space available
Iterations : 2
FIONWRITE : 0
Observed total : 1714
Draining...
FIONREAD : 1718
FIONREAD : 859
FIONREAD : 0
Read : 1714
$ That's less data than in the previous invocation! What happens if we max out SO_SNDBUF right away? $ ipcbuf -R 2048 -t socketpair -l 2560 0 Testing socketpair DGRAM buffer size in loop mode. Loop starting with 2560 bytes, increasing by 0 bytes each time. recvspace : 16384 SO_RCVBUF : 2048 SO_SNDBUF : 2560 FIONSPACE : 2560 Wrote 2046 out of 2560 bytes. (Total: 2046) Iterations : 1 FIONWRITE : 0 Observed total : 2046 Draining... FIONREAD : 2048 FIONREAD : 2048 FIONREAD : 0 Read : 2046 $ Ok, so there's a lot going on here, and honestly, it took me some time to wrap my head around these limitations. Since we're using sockets and datagrams, we now have to look at how the OS handles network traffic, and in BSD based systems, that involves memory buffers, or mbufs. The use of mbufs is best explained in Chapter 2 of TCP/IP Illustrated, Volume 2 by W. Richard Stevens, but let me summarize in a nutshell and for our context: For any datagram, the system allocates a number of these mbufs, a fixed size data structure of -- in the case of NetBSD/amd64, which I'll use as the reference for the remainder here -- 512 bytes. In this fixed size, however, there are different types of mbufs that may hold different amounts of data. Multiple mbufs are linked into an mbuf-chain; the fist mbuf in the chain contains a struct pkthdr of 56 bytes, and each mbuf also contains a struct m_hdr of another 56 bytes. This leaves 512 - 56 - 56 = 400 bytes in the mbuf for data storage. If we want to write more than 400 bytes, we allocate a second mbuf and chain it to the first. Since this is not the first mbuf in the chain, it doesn't need a pkthdr, and so can store 456 bytes. Now if we want to write more data than these 400 + 456 = 856 bytes, then instead of using a second data mbuf, we allocate a different type, an expanded mbuf, that referencecs a separate area of memory for the data, a so-called mbuf cluster. Simplified, these three types of mbufs might look like so: MT_HEADER MT_DATA MT_DATA M_EXT +-------------+ +-------------+ +-------------+ / | mh_next | | mh_next | | mh_next | 56 bytes | mh_data -------+ | mh_data -------+ | mh_data ---------> +------------+ \ | mh_type | | | mh_type | | | mh_type | | mbuf | | ... | | | ... | | | ... | | cluster | +-------------+ | +-------------+ | +-------------+ | | / | pkthdr | | | databuf |<-+ | | | [2048 | 56 bytes | ... | | | ... | | | | bytes] | \ | | | | [456 bytes] | | | | | +-------------+ | | | | | +------------+ | databuf |<-+ | | | | / | ... | | | | | 400 bytes | | | | | | \ | | | | | | | | | | | | +-------------+ +-------------+ +-------------+ When we put together a datagram for our socket, we then generate the appropriate mbuf chain and then prepend an additional mbuf of type MT_SONAME. For our different sized writes, we then get:
MT_SONAME MT_DATA +-------------+ +-------------+ | mh_next ------------------->| mh_next | | mh_data -------+ | mh_data -------+ | mh_type | | | mh_type | | | ... | | | ... | | +-------------+ | +-------------+ | | pkthdr | | | pkthdr | | | ... | | | now | | | | | | unused | | +-------------+ | +-------------+ | | sockname |<-+ | xxxxxxx |<-+ | | | ... | | | | | | | | | | | | | +-------------+ +-------------+
MT_SONAME MT_DATA MT_DATA +-------------+ +-------------+ +-------------+ | mh_next ------------------->| mh_next ------------->| mh_next | | mh_data -------+ | mh_data -------+ | mh_data -------+ | mh_type | | | mh_type | | | mh_type | | | ... | | | ... | | | ... | | +-------------+ | +-------------+ | +-------------+ | | pkthdr | | | pkthdr | | | xxxxxxxxxxx |<-+ | ... | | | now | | | | | | | | unused | | | | +-------------+ | +-------------+ | | | | sockname |<-+ | xxxxxxxxxxx |<-+ | | | | | xxxxxxxxxxx | | | | | | xxxxxxxxxxx | | | | | | xxxxxxxxxxx | | | | | | xxxxxxxxxxx | | | +-------------+ +-------------+ +-------------+
MT_SONAME MT_DATA +-------------+ +-------------+ | mh_next ------------------->| mh_next | mbuf cluster | mh_data -------+ | mh_data --------------->+-------------+ | mh_type | | | mh_type | | xxxxxxxxxxx | | ... | | | ... | | xxxxxxxxxxx | +-------------+ | +-------------+ | xxxxxxxxxxx | | pkthdr | | | pkthdr | | xxxxxxxxxxx | | ... | | | now | | xxxxxxxxxxx | | | | | unused | +-------------+ +-------------+ | +-------------+ | sockname |<-+ | | | | | | | | | | | | | | | | | | +-------------+ +-------------+ Now the amount of data we can write to a socket is limited in sys/socketvar.h: /* * How much space is there in a socket buffer (so->so_snd or so->so_rcv)? * Since the fields are unsigned, detect overflow and return 0. */ static __inline u_long sbspace(const struct sockbuf *sb) { KASSERT(solocked(sb->sb_so)); if (sb->sb_hiwat <= sb->sb_cc || sb->sb_mbmax <= sb->sb_mbcnt) return 0; return lmin(sb->sb_hiwat - sb->sb_cc, sb->sb_mbmax - sb->sb_mbcnt); } sb_hiwat is SO_RCVBUF, while sb_mbmax is set in sys/kern/uipc_socket2.c in the sbreserve function to the minimum of 2 * SO_RCVBUF and SB_MAX (262144 on this system). In other words, the max amount we can write is limited by either datagram bytes fitting into SO_RCVBUF, or the number of mbufs needed for the data being less than 2 * SO_RCVBUF. With this in mind, we can then see why we get the different numbers we observed above, with SO_RCVBUF = 2048:
Now, then, what's the smallest amount of data we can write and still get blocked? Since we can write datagrams with 0 bytes payload, we can actually loop around and around and then be entirely limited by mbufs, but that's not meaningfully different from writing single byte writes. But what is the largest amount of data we can write into a socketpair? If we don't perform small writes (i.e., 0 - MINCLSIZE bytes), then we're not bound by sb_mbcnt, but instead by SO_RCVBUF (on the total amount of data) and SO_SNDBUF (on the maximum size of the datagram). If SO_SNDBUF = SO_RCFBUF = chunk size, we can write SO_RCVBUF - 2 bytes in a single dgram; otherwise, we'll end up losing some more data capacity to overhead. The same behavior as for socketpairs also holds for sockets in the PF_LOCAL domain, by the way, only there we observe a dgram overhead of 106 bytes: $ ipcbuf -R 1024 -t socket -l 1 0 Testing PF_LOCAL DGRAM socket buffer size in loop mode. Loop starting with 1 byte, increasing by 0 bytes each time. recvspace : 16384 SO_SNDBUF : 2560 SO_RCVBUF : 1024 FIONSPACE : 2560 Wrote 1 out of 1 byte. (Total: 1) Wrote 1 out of 1 byte. (Total: 2) Unable to write even a single byte: No buffer space available Iterations : 2 FIONWRITE : 0 Observed total : 2 Draining... FIONREAD : 214 FIONREAD : 107 FIONREAD : 0 Read : 2 $ Finally, we can also use a socketpair(2) using SOCK_STREAM, which changes things once more just a little bit: a stream oriented socket does not require datagrams, meaning it's not limited by SO_RCVBUF as far as bytes to send is concerned, but is limited by the sb_mbcnt limitation as above. socketpair(2) OS comparisonNow comparing this behavior across our different OS versions, we observe:
depending on SO_SNDBUF and chunk size The difference between NetBSD, FreeBSD, and macOS is negligible: the latter two appear to use 16 bytes for the socketpair sockname. Linux and OmniOS, on the other hand, are entirely different: Linux doesn't use mbufs, and instead uses skbuffs, and I don't even know what OmniOS / Solaris uses -- and please don't tell me it's STREAMS based or something like that. On OmniOS, SO_SNDBUF and SO_RCVBUF appear to not matter, and I can write up to two chunks of 64K for a total of 131072 bytes; on Linux, setting the SO_SNDBUF value < 2304 is bumped up (possibly to a page size multiple plus some value?), while for any value larger than 2304 Linux will go "Well, surely you meant twice that number, so why don't I go ahead and double the number you gave me, how about it?" (see socket(7)). Writing data appears to ignore SO_RCVBUF: linux$ ipcbuf -R 512 -S 16384 -t socketpair -l 16384 0 Testing socketpair DGRAM buffer size in loop mode. Loop starting with 16384 bytes, increasing by 0 bytes each time. max_dgram_qlen : 512 SO_RCVBUF : 2304 SO_SNDBUF : 32768 Wrote 16384 out of 16384 bytes. (Total: 16384) Wrote 16384 out of 16384 bytes. (Total: 32768) Unable to write 16384 more bytes: Resource temporarily unavailable Iterations : 2 SIOCOUTQ : 41472 Observed total : 32768 Draining... FIONREAD : 16384 FIONREAD : 16384 FIONREAD : 0 Read : 32768 linux$ Seems quite wonky and likely would require a deep dive similar to my mbuf adventure above. Perhaps another time... But one other interesting thing to note here about datagrams in the PF_LOCAL domain is the way the data is then marked as available on the read end. On e.g., NetBSD, FreeBSD, and macOS, FIONREAD on the read file descriptor will return to us the total number of bytes ready to be read, while on Linux, it returns bytes on a per-datagram basis:
Note that on the left side, each FIONREAD request tells us how many bytes in total are in the buffer, including all datagram overhead, while on the right side, FIONREAD only tells us about how much actual data is available for the next read(2). Since datagrams are ordered and distinct units, each read(2) on either OS will only return exactly the number of bytes contained in the datagram written, though. socket(2) OS comparisonA socketpair(2) uses, well, sockets. But what if we set up a socket(2) for IPC instead? We can then compare buffer sizes for PF_LOCAL, PF_INET, and PF_INET6, in SOCK_DGRAM and SOCK_STREAM each. (We can ignore datagrams in the INET and INET6 domains, since those are unreliable, as noted above.) And what we get then looks like this:
Enough already!Alright, lets wrap it up. As you can tell, there is some significant variance across the different forms of IPC and OS variants, and to understand the true limits, you basically need to dive deep into the network buffers and OS-specific handling, influenced by a number of possibly system-specific variables or settings, including default max values, hard-code kernel limits, run-time configurable settings such as sysctl(8)s etc. etc. If I still didn't get everything right in the above (which wouldn't entirely surprise me), please drop me a note; if you want to experiement with the different buffers on your system, feel free to grab the ipcbuf tool I ended up writing for this analysis and send me a pull request for fixes. For now... EAGAIN. November 15, 2021 References:
Related: |