Signs of Triviality

Opinions, mostly my own, on the importance of being and other things.
[homepage]  [blog]  [[email protected]]  [@jschauma]  [RSS]

IPC Buffer Sizes

November 15, 2021

Aah, the Unix pipe. Friend (and occasional foe) of any Unix user, cornerstone of the Unix Philosophy, and universal building block of many a SysAdmin's tool chest. We all use it day in and day out without thinking much about how it works. And honestly, it's not all that difficult to understand:

Illustration of the
creation of a Unix pipe, fork, and child/parent file
handles.

You create a pipe by calling, well, the pipe(2) system call, which provides you with two file handles: the read end and the write end of the pipe. To communicate with another process, you then fork(2), close the read end in one process and the write end in the other, et voilá, you can now communicate through the pipe.

As that graphic illustrates, interprocess communication (IPC) through the pipe happens in kernel space, and obviously data written into the pipe is not stored in the filesystem. Pipe behavior is reasonably straight forward: reading from a pipe where the write end has been closed yields EOF; writing into a pipe where the read end has been closed, yields SIGPIPE:

$ cat sigpipe.c
#include <err.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void
sigpipe(int s) {
	fprintf(stderr, "Caught signal #%d: SIGPIPE!\n", s);
}

int
main() {
	int fd[2];

	if (signal(SIGPIPE, sigpipe) == SIG_ERR) {
		err(EXIT_FAILURE, "signal");
		/* NOTREACHED */
	}

	if (pipe(fd) < 0) {
		err(EXIT_FAILURE, "pipe");
		/* NOTREACHED */
	}

	/* Closing the read-end will trigger SIGPIPE
	 * when we then try to write. */
	(void)close(fd[0]);
	if (write(fd[1], "x", 1) < 0) {
		err(EXIT_FAILURE, "write");
		/* NOTREACHED */
	}
	(void)printf("We never get here.\n");
}
$ cc -Wall -Werror -Wextra sigpipe.c
$ ./a.out
Caught signal #13: SIGPIPE!
a.out: write: Broken pipe
$ 

But of course we might also encounter the situation where there is a process with the read handle open, but data is just not consumed (or not consumed as quickly as we are writing the data):

$ cat sigpipe.c | ( sleep 10; wc; )
      34      83     600
$ 

Now for 10 seconds, nothing was reading the data. So... where did it go? Well, we know that pipe I/O is, by default, anyway, blocking, meaning the read(2) call on a pipe will not return until there is data available. But what about writes? In the above example, was data only written once wc began reading? It doesn't look like that's the case:

$ ( echo "message"; echo "message written" >&2; ) | cat
message written
message
$ ( echo "message"; echo "message written" >&2; ) | ( sleep 3; cat; )
message written
[3 seconds later:]
message
$ 

That is, we see that "message" was written to the pipe even though nothing was currently reading. So here, just like in the above "cat sigpipe.c" example, the data was held somewhere in between. In other words, the kernel has a buffer for the pipe, allowing the write to complete even though no process is reading from the pipe.

So... how big is this pipe buffer? It looks like we can write quite a lot of data:

$ dd if=/dev/zero bs=1024k count=10240 2>/dev/null | (sleep 10; wc -c; )
 10737418240
$ 

That's ten gigabytes of data! But of course the kernel will not buffer ten GB for you. Rather, writes to the pipe are blocking by default, but blocking not dependent on a reader consuming data, but rather blocking based on the internal buffer size of the pipe: initial writes will immediately succeed up until the pipe buffer is filled; subsequent writes will block until the reader has consumed any data and thus freed up space in the pipe buffer.

Ok, so far, so uneventful - that's how non-blocking I/O works. But, how big is this internal pipe buffer? Well, the answer won't surprise you: it depends. :-)

POSIX specifies the behavior of write(2) with regards to a pipe or fifo as such:

Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe.

PIPE_BUF (either defined or retrievable via e.g., fpathconf(2)) is only relevant to how many bytes need to be written atomically to avoid interleaving with multiple writers; it says nothing about the total pipe buffer. But we can find out what that total is by experimentation: We simply continue to write data into a pipe without a reader until the write call would block:

$ cat pipebuf.c
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int
main() {
	int count = 0;
	int fd[2];

	if (pipe(fd) < 0) {
		err(EXIT_FAILURE, "pipe");
		/* NOTREACHED */
	}

	if (fcntl(fd[1], F_SETFL, O_NONBLOCK) < 0) {
		err(EXIT_FAILURE, "fcntl set flags");
		/* NOTREACHED */
	}	

	while (1) {
		if (write(fd[1], "x", 1) < 0) {
			if (errno == EAGAIN) {
				break;
			}
			err(EXIT_FAILURE, "write");
			/* NOTREACHED */
		}
		count++;
	}
	(void)printf("%d\n", count);
}
$ cc -Wall -Werror -Wextra pipebuf.c
$ ./a.out
16384
$ 

Ok, so on this particular system (NetBSD/amd64 9.2), the internal pipe buffer size is 16K. It's the same on OmniOS 5.11, while on e.g., macOS 11.6.1, Linux 5.8.15 / Fedora 33, and FreeBSD 13.0 it's 64k.

NetBSD allows you to retrieve the free space in a file descriptor's send queue via an ioctl(2) query for FIONSPACE, together with the number of bytes immediately available for reading (FIONREAD) and the number of bytes in the write queue (FIONWRITE), so we can query that as well. (FreeBSD appears to only support these for sockets and socketpairs.)

And trying to write data in chunks rather than one byte at a time, we also discover that NetBSD lets us write data greater than 16k:

$ cat pipebuf2.c
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int
main(int argc, char **argv) {
	int fd[2];
	int count = 0;
	int size;
	char *buf;

	if (argc != 2) {
		fprintf(stderr, "Usage: %s num\n", argv[0]);
		exit(EXIT_FAILURE);
		/* NOTREACHED */
	}

	if ((size = (int)strtol(argv[1], NULL, 10)) < 1) {
		(void)fprintf(stderr, "Please provide a number (>0).\n");
		exit(EXIT_FAILURE);
		/* NOTREACHED */
	}

	if ((buf = malloc(size)) == NULL) {
		err(EXIT_FAILURE, "malloc");
	}
	memset(buf, 'x', size);

	if (pipe(fd) < 0) {
		err(EXIT_FAILURE, "pipe");
		/* NOTREACHED */
	}

	if (fcntl(fd[1], F_SETFL, O_NONBLOCK) < 0) {
		err(EXIT_FAILURE, "fcntl set flags");
		/* NOTREACHED */
	}	

	if ((count = write(fd[1], buf, size)) < 0) {
		err(EXIT_FAILURE, "write");
		/* NOTREACHED */
	}
	(void)printf("%d\n", count);
}
$ cc -Wall -Werror -Wextra pipebuf2.c
$ ./a.out 16384
16384
$ ./a.out 16385
16385
$ ./a.out 102400
65536
$ 

That's right: when we write a large chunk of data, NetBSD automatically bumps the pipe buffer size up to 64K - a so-called "big pipe". And Linux let's you increase the size of the pipe buffer via the F_SETPIPE_SZ fcntl(2) (as a multiple of the page size and up to the max size defined in /proc/sys/fs/pipe-max-size for normal users).

With these differences already observed, let's compare the behavior of the different IPC buffers across the different OS, since what holds for pipes should conceptually be similar for the other forms of on-system IPC, such as FIFOs, sockets in the PF_LOCAL (or AF_UNIX) domain, and socketpairs. To do that, I wrote a program that helps us test the different conditions: ipcbuf.

The tool has two modes: looping and chunking. In looping mode, we write incremental chunks of data; in chunking mode, we write consecutive chunks. Sample invocations using looping mode on NetBSD 9.2:

$ ipcbuf
Testing pipe buffer size in loop mode.
Loop starting with 1 byte and doubling each iteration.

PIPE_BUF       :      512
_PC_PIPE_BUF   :      512
FIONSPACE      :    16384

Wrote        1 out of        1 byte.  (Total: 1)
Wrote        2 out of        2 bytes. (Total: 3)
Wrote        4 out of        4 bytes. (Total: 7)
Wrote        8 out of        8 bytes. (Total: 15)
Wrote       16 out of       16 bytes. (Total: 31)
Wrote       32 out of       32 bytes. (Total: 63)
Wrote       64 out of       64 bytes. (Total: 127)
Wrote      128 out of      128 bytes. (Total: 255)
Wrote      256 out of      256 bytes. (Total: 511)
Wrote      512 out of      512 bytes. (Total: 1023)
Wrote     1024 out of     1024 bytes. (Total: 2047)
Wrote     2048 out of     2048 bytes. (Total: 4095)
Wrote     4096 out of     4096 bytes. (Total: 8191)
Wrote     8192 out of     8192 bytes. (Total: 16383)
Wrote        1 out of    16384 bytes. (Total: 16384)
FIONWRITE      :    16384
Observed total :    16384

Draining...
FIONREAD       :    16384
FIONREAD       :     8192
FIONREAD       :        0
Read           :    16384
$ ipcbuf 16385
Testing pipe buffer size in loop mode.
Loop starting with 16385 bytes and doubling each iteration.

PIPE_BUF       :      512
_PC_PIPE_BUF   :      512
FIONSPACE      :    16384

Wrote    16385 out of    16385 bytes. (Total: 16385)
Wrote    32770 out of    32770 bytes. (Total: 49155)
Wrote    16381 out of    65540 bytes. (Total: 65536)
FIONWRITE      :    65536
Observed total :    65536

Draining...
FIONREAD       :    65536
FIONREAD       :    32766
FIONREAD       :        0
Read           :    65536

Since we've already observed that using different chunk sizes may yield different results, let's compare the buffer size used by pipes and FIFOs in a single-byte loop (ipcbuf -l 1 0) versus the maximum amount we can write in chunks (ipcbuf -c chunk1 chunk2):

OS →
IPC ↓
NetBSD 9.2 FreeBSD 13.0 macOS 11.6.1 Linux 5.8.15 / Fedora 33 OmniOS 5.11
pipe(2) 1-byte loop 16384 65536 65536 65536 16384
pipe(2) max 65536 65536 65536 1048575 21503
mkfifo(2) 1-byte loop 8191 65536 65536 65536 16384
mkfifo(2) max 8191 65536 65536 65536 21503

NetBSD "big pipe" if the initial write was larger than 16K bytes
via explicit F_SETPIPE_SZ
if you write 16K - 1 bytes, a subsequent write of an additional PIPE_MAX bytes succeeds

socketpair(2) and socket(2)

Now what about socketpair(2) and socket(2)? For those, we can check SO_SNDBUF, but is that accurate? And is there a difference between DGRAM and STREAM sockets?

We know that datagram connections in the PF_INET or PF_INET6 domain are completely unbuffered, as UDP is a connectionless, unreliable protocol, and so packets need not be consumed by a reader and can silently be dropped. But datagrams in the PF_LOCAL domain are reliable and thus still buffered in the kernel!

And this is where things get interesting...

$ ipcbuf -R 2048 -t socketpair -l 1 0 
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 1 byte, increasing by 0 bytes each time.

recvspace      :    16384
SO_RCVBUF      :     2048
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote        1 out of        1 byte.  (Total: 1)
Wrote        1 out of        1 byte.  (Total: 2)
Wrote        1 out of        1 byte.  (Total: 3)
Wrote        1 out of        1 byte.  (Total: 4)
Unable to write even a single byte: No buffer space available
Iterations     :        4
FIONWRITE      :        0
Observed total :        4

Draining...
FIONREAD       :       12
FIONREAD       :        9
FIONREAD       :        6
FIONREAD       :        3
FIONREAD       :        0
Read           :        4
$ 

Uhm... wait. What? We only get to write 4 bytes? Why does the receiving socket claim there are 12 bytes? And what happens if we try to write more bytes?

$ ipcbuf -R 2048 -t socketpair -l 400 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 400 bytes, increasing by 0 bytes each time.

recvspace      :    16384
SO_RCVBUF      :     2048
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote      400 out of      400 bytes. (Total: 400)
Wrote      400 out of      400 bytes. (Total: 800)
Wrote      400 out of      400 bytes. (Total: 1200)
Wrote      400 out of      400 bytes. (Total: 1600)
Unable to write even a single byte: No buffer space available
Iterations     :        4
FIONWRITE      :        0
Observed total :     1600

Draining...
FIONREAD       :     1608
FIONREAD       :     1206
FIONREAD       :      804
FIONREAD       :      402
FIONREAD       :        0
Read           :     1600

Looks like we can always only perform 4 writes in sequence. Or...?

$ ipcbuf -R 2048 -t socketpair -l 856 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 401 bytes, increasing by 0 bytes each time.

recvspace      :    16384
SO_RCVBUF      :     2048
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote      856 out of      856 bytes. (Total: 856)
Wrote      856 out of      856 bytes. (Total: 1712)
Wrote      330 out of      856 bytes. (Total: 2042)
Unable to write even a single byte: No buffer space available
Iterations     :        3
FIONWRITE      :        0
Observed total :     2042

Draining...
FIONREAD       :     2048
FIONREAD       :     1190
FIONREAD       :      332
FIONREAD       :        0
Read           :     2042
$ 

Huh, that's 3 writes, but not even filling each dgram! Let's try to write slightly larger chunks:

$ ipcbuf -R 2048 -t socketpair -l 857 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 857 bytes, increasing by 0 bytes each time.

recvspace      :    16384
SO_RCVBUF      :     2048
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote      857 out of      857 bytes. (Total: 857)
Wrote      857 out of      857 bytes. (Total: 1714)
Unable to write even a single byte: No buffer space available
Iterations     :        2
FIONWRITE      :        0
Observed total :     1714

Draining...
FIONREAD       :     1718
FIONREAD       :      859
FIONREAD       :        0
Read           :     1714
$ 

That's less data than in the previous invocation! What happens if we max out SO_SNDBUF right away?

$ ipcbuf -R 2048 -t socketpair -l 2560 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 2560 bytes, increasing by 0 bytes each time.

recvspace      :    16384
SO_RCVBUF      :     2048
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote     2046 out of     2560 bytes. (Total: 2046)
Iterations     :        1
FIONWRITE      :        0
Observed total :     2046

Draining...
FIONREAD       :     2048
FIONREAD       :     2048
FIONREAD       :        0
Read           :     2046
$ 

Ok, so there's a lot going on here, and honestly, it took me some time to wrap my head around these limitations. Since we're using sockets and datagrams, we now have to look at how the OS handles network traffic, and in BSD based systems, that involves memory buffers, or mbufs. The use of mbufs is best explained in Chapter 2 of TCP/IP Illustrated, Volume 2 by W. Richard Stevens, but let me summarize in a nutshell and for our context:

For any datagram, the system allocates a number of these mbufs, a fixed size data structure of -- in the case of NetBSD/amd64, which I'll use as the reference for the remainder here -- 512 bytes. In this fixed size, however, there are different types of mbufs that may hold different amounts of data. Multiple mbufs are linked into an mbuf-chain; the fist mbuf in the chain contains a struct pkthdr of 56 bytes, and each mbuf also contains a struct m_hdr of another 56 bytes. This leaves 512 - 56 - 56 = 400 bytes in the mbuf for data storage.

If we want to write more than 400 bytes, we allocate a second mbuf and chain it to the first. Since this is not the first mbuf in the chain, it doesn't need a pkthdr, and so can store 456 bytes.

Now if we want to write more data than these 400 + 456 = 856 bytes, then instead of using a second data mbuf, we allocate a different type, an expanded mbuf, that referencecs a separate area of memory for the data, a so-called mbuf cluster.

Simplified, these three types of mbufs might look like so:

              MT_HEADER              MT_DATA             MT_DATA M_EXT
           +-------------+       +-------------+        +-------------+
         / | mh_next     |       | mh_next     |        | mh_next     |
 56 bytes  | mh_data -------+    | mh_data -------+     | mh_data ---------> +------------+ 
         \ | mh_type     |  |    | mh_type     |  |     | mh_type     |      | mbuf       | 
           | ...         |  |    | ...         |  |     | ...         |      | cluster    | 
           +-------------+  |    +-------------+  |     +-------------+      |            | 
         / | pkthdr      |  |    | databuf     |<-+     |             |      | [2048      | 
 56 bytes  | ...         |  |    | ...         |        |             |      |    bytes]  | 
         \ |             |  |    | [456 bytes] |        |             |      |            | 
           +-------------+  |    |             |        |             |      +------------+ 
           | databuf     |<-+    |             |        |             |
         / | ...         |       |             |        |             |
400 bytes  |             |       |             |        |             |
         \ |             |       |             |        |             |
           |             |       |             |        |             |
           +-------------+       +-------------+        +-------------+

When we put together a datagram for our socket, we then generate the appropriate mbuf chain and then prepend an additional mbuf of type MT_SONAME. For our different sized writes, we then get:

  • two mbufs (one MT_SONAME, one MT_DATA) for data payloads between 1 and 400 bytes
   MT_SONAME                      MT_DATA
 +-------------+               +-------------+
 | mh_next ------------------->| mh_next     |
 | mh_data -------+            | mh_data -------+
 | mh_type     |  |            | mh_type     |  |
 | ...         |  |            | ...         |  |
 +-------------+  |            +-------------+  |
 | pkthdr      |  |            | pkthdr      |  |
 | ...         |  |            | now         |  |
 |             |  |            | unused      |  |
 +-------------+  |            +-------------+  |
 | sockname    |<-+            | xxxxxxx     |<-+
 |             |               | ...         |
 |             |               |             |
 |             |               |             |
 |             |               |             |
 +-------------+               +-------------+
  • three mbufs (one MT_SONAME, two MT_DATA) for data payloads between 401 and 856 bytes
   MT_SONAME                      MT_DATA                  MT_DATA
 +-------------+               +-------------+         +-------------+
 | mh_next ------------------->| mh_next ------------->| mh_next     |
 | mh_data -------+            | mh_data -------+      | mh_data -------+
 | mh_type     |  |            | mh_type     |  |      | mh_type     |  |
 | ...         |  |            | ...         |  |      | ...         |  |
 +-------------+  |            +-------------+  |      +-------------+  |
 | pkthdr      |  |            | pkthdr      |  |      | xxxxxxxxxxx |<-+
 | ...         |  |            | now         |  |      |             |
 |             |  |            | unused      |  |      |             |
 +-------------+  |            +-------------+  |      |             |
 | sockname    |<-+            | xxxxxxxxxxx |<-+      |             |
 |             |               | xxxxxxxxxxx |         |             |
 |             |               | xxxxxxxxxxx |         |             |
 |             |               | xxxxxxxxxxx |         |             |
 |             |               | xxxxxxxxxxx |         |             |
 +-------------+               +-------------+         +-------------+
  • two mbufs (one MT_SONAME, one MT_DATA / MT_EXT) for data payloads between 856 and SO_SNDBUF bytes
   MT_SONAME                      MT_DATA
 +-------------+               +-------------+
 | mh_next ------------------->| mh_next     |              mbuf cluster
 | mh_data -------+            | mh_data --------------->+-------------+ 
 | mh_type     |  |            | mh_type     |           | xxxxxxxxxxx |
 | ...         |  |            | ...         |           | xxxxxxxxxxx |
 +-------------+  |            +-------------+           | xxxxxxxxxxx |
 | pkthdr      |  |            | pkthdr      |           | xxxxxxxxxxx |
 | ...         |  |            | now         |           | xxxxxxxxxxx |
 |             |  |            | unused      |           +-------------+
 +-------------+  |            +-------------+
 | sockname    |<-+            |             |
 |             |               |             |
 |             |               |             |
 |             |               |             |
 |             |               |             |
 +-------------+               +-------------+

Now the amount of data we can write to a socket is limited in sys/socketvar.h:

/*
 * How much space is there in a socket buffer (so->so_snd or so->so_rcv)?
 * Since the fields are unsigned, detect overflow and return 0.
 */
static __inline u_long
sbspace(const struct sockbuf *sb)
{
	KASSERT(solocked(sb->sb_so));
	if (sb->sb_hiwat <= sb->sb_cc || sb->sb_mbmax <= sb->sb_mbcnt)
		return 0;
	return lmin(sb->sb_hiwat - sb->sb_cc, sb->sb_mbmax - sb->sb_mbcnt);
}

sb_hiwat is SO_RCVBUF, while sb_mbmax is set in sys/kern/uipc_socket2.c in the sbreserve function to the minimum of 2 * SO_RCVBUF and SB_MAX (262144 on this system). In other words, the max amount we can write is limited by either datagram bytes fitting into SO_RCVBUF, or the number of mbufs needed for the data being less than 2 * SO_RCVBUF.

With this in mind, we can then see why we get the different numbers we observed above, with SO_RCVBUF = 2048:

  • When writing 1 byte or 400 byte dgrams, we allocate 2 mbufs (one MT_SONAME and one MT_DATA), adding up to 1024 bytes sb_mbcnt on each write.

    4 writes means we wrote up to (4 * 400) + (4 * 2) = 1608 bytes, well below our SO_RCVBUF = 2048, so we should be able to write more data. But having used 4 * 1024 = 4096 mbuf bytes, we have reached sb_mbmax = 2 * SO_RCVBUF = 4096, so our next attempt to allocate another mbuf fails.
  • When writing 401 - 856 bytes, we allocate 3 mbufs, adding up to 1536 bytes sb_mbcnt on each write.

    After two writes, we wrote (2 * 856) + (2 * 2) = 1716 bytes; we should be able to write SO_SNDBUF - data written = 2048 - 1716 = 332 more bytes, so we can construct one more datagram of 330 bytes (+ 2 bytes dgram sockname).

    We have so far used sb_mbmax = 6 * 512 = 3072; the 330 bytes fit into a single MT_DATA mbuf, so together with the MT_SONAME mbuf, we can indeed write these, adding 1024 to sb_mbmax, bringing that to 4096, which is why our next write fails.
  • When writing 857 bytes, we allocate 2 mbufs (one MT_SONAME + one mbufcluster), adding up to 1024 bytes sb_mbcnt on each write.

    After two writes, we wrote (2 * 857) + (2 * 2) = 1718 bytes, we should be able to squeeze another 330 bytes into our SO_SNDBUF, but having already used 2 * 1024 = 2048 bytes sb_mbcnt, we can't allocate another mbuf, and thus fail.

Now, then, what's the smallest amount of data we can write and still get blocked? Since we can write datagrams with 0 bytes payload, we can actually loop around and around and then be entirely limited by mbufs, but that's not meaningfully different from writing single byte writes.

But what is the largest amount of data we can write into a socketpair? If we don't perform small writes (i.e., 0 - MINCLSIZE bytes), then we're not bound by sb_mbcnt, but instead by SO_RCVBUF (on the total amount of data) and SO_SNDBUF (on the maximum size of the datagram). If SO_SNDBUF = SO_RCFBUF = chunk size, we can write SO_RCVBUF - 2 bytes in a single dgram; otherwise, we'll end up losing some more data capacity to overhead.

The same behavior as for socketpairs also holds for sockets in the PF_LOCAL domain, by the way, only there we observe a dgram overhead of 106 bytes:

$ ipcbuf -R 1024 -t socket -l 1 0
Testing PF_LOCAL DGRAM socket buffer size in loop mode.
Loop starting with 1 byte, increasing by 0 bytes each time.

recvspace      :    16384
SO_SNDBUF      :     2560
SO_RCVBUF      :     1024
FIONSPACE      :     2560

Wrote        1 out of        1 byte.  (Total: 1)
Wrote        1 out of        1 byte.  (Total: 2)
Unable to write even a single byte: No buffer space available
Iterations     :        2
FIONWRITE      :        0
Observed total :        2

Draining...
FIONREAD       :      214
FIONREAD       :      107
FIONREAD       :        0
Read           :        2
$ 

Finally, we can also use a socketpair(2) using SOCK_STREAM, which changes things once more just a little bit: a stream oriented socket does not require datagrams, meaning it's not limited by SO_RCVBUF as far as bytes to send is concerned, but is limited by the sb_mbcnt limitation as above.

socketpair(2) OS comparison

Now comparing this behavior across our different OS versions, we observe:

OS →
IPC ↓
NetBSD 9.2 FreeBSD 13.0 macOS 11.6.1 Linux 5.8.15 / Fedora 33 OmniOS 5.11
SO_RCVBUF 16384 4096 4096 212992 5120
SO_SNDBUF 2560 2048 2048 212992 16384
socketpair(2) (DGRAM) overhead 2 16 16 32 0
socketpair(2) (DGRAM) 1-byte loop 32 64 64 278 582
socketpair(2) (DGRAM) large chunks SO_RCVBUF - 2 SO_RCVBUF - 16 SO_RCVBUF - 16 SO_SNDBUF - 32
or: up to 662912
131072

depending on SO_SNDBUF and chunk size

The difference between NetBSD, FreeBSD, and macOS is negligible: the latter two appear to use 16 bytes for the socketpair sockname. Linux and OmniOS, on the other hand, are entirely different: Linux doesn't use mbufs, and instead uses skbuffs, and I don't even know what OmniOS / Solaris uses -- and please don't tell me it's STREAMS based or something like that.

On OmniOS, SO_SNDBUF and SO_RCVBUF appear to not matter, and I can write up to two chunks of 64K for a total of 131072 bytes; on Linux, setting the SO_SNDBUF value < 2304 is bumped up (possibly to a page size multiple plus some value?), while for any value larger than 2304 Linux will go "Well, surely you meant twice that number, so why don't I go ahead and double the number you gave me, how about it?" (see socket(7)). Writing data appears to ignore SO_RCVBUF:

linux$ ipcbuf -R 512 -S 16384 -t socketpair -l 16384 0
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 16384 bytes, increasing by 0 bytes each time.

max_dgram_qlen :      512
SO_RCVBUF      :     2304
SO_SNDBUF      :    32768

Wrote    16384 out of    16384 bytes. (Total: 16384)
Wrote    16384 out of    16384 bytes. (Total: 32768)
Unable to write 16384 more bytes: Resource temporarily unavailable
Iterations     :        2
SIOCOUTQ       :    41472
Observed total :    32768

Draining...
FIONREAD       :    16384
FIONREAD       :    16384
FIONREAD       :        0
Read           :    32768
linux$ 

Seems quite wonky and likely would require a deep dive similar to my mbuf adventure above. Perhaps another time...

But one other interesting thing to note here about datagrams in the PF_LOCAL domain is the way the data is then marked as available on the read end. On e.g., NetBSD, FreeBSD, and macOS, FIONREAD on the read file descriptor will return to us the total number of bytes ready to be read, while on Linux, it returns bytes on a per-datagram basis:

NetBSD Linux
$ ipcbuf -R 4096 -t socketpair
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 1 byte and doubling each iteration.

recvspace      :    16384
SO_RCVBUF      :     4096
SO_SNDBUF      :     2560
FIONSPACE      :     2560

Wrote        1 out of        1 byte.  (Total: 1)
Wrote        2 out of        2 bytes. (Total: 3)
Wrote        4 out of        4 bytes. (Total: 7)
Wrote        8 out of        8 bytes. (Total: 15)
Wrote       16 out of       16 bytes. (Total: 31)
Wrote       32 out of       32 bytes. (Total: 63)
Wrote       64 out of       64 bytes. (Total: 127)
Wrote      128 out of      128 bytes. (Total: 255)
Iterations     :       13
FIONWRITE      :        0
Observed total :      255

Draining...
FIONREAD       :      271
FIONREAD       :      268
FIONREAD       :      264
FIONREAD       :      258
FIONREAD       :      248
FIONREAD       :      230
FIONREAD       :      196
FIONREAD       :      130
FIONREAD       :        0
Read           :      255
$ 
$ ipcbuf -S 4096 -t socketpair
Testing socketpair DGRAM buffer size in loop mode.
Loop starting with 1 byte and doubling each iteration.

max_dgram_qlen :      512
SO_RCVBUF      :   212992
SO_SNDBUF      :     8192

Wrote        1 out of        1 byte.  (Total: 1)
Wrote        2 out of        2 bytes. (Total: 3)
Wrote        4 out of        4 bytes. (Total: 7)
Wrote        8 out of        8 bytes. (Total: 15)
Wrote       16 out of       16 bytes. (Total: 31)
Wrote       32 out of       32 bytes. (Total: 63)
Wrote       64 out of       64 bytes. (Total: 127)
Wrote      128 out of      128 bytes. (Total: 255)
Wrote      256 out of      256 bytes. (Total: 511)
Wrote      512 out of      512 bytes. (Total: 1023)
Unable to write 1024 more bytes: Resource temporarily unavailable
Iterations     :       10
SIOCOUTQ       :     8704
Observed total :     1023

Draining...
FIONREAD       :        1
FIONREAD       :        2
FIONREAD       :        4
FIONREAD       :        8
FIONREAD       :       16
FIONREAD       :       32
FIONREAD       :       64
FIONREAD       :      128
FIONREAD       :      256
FIONREAD       :      512
FIONREAD       :        0
Read           :     1023
$ 

Note that on the left side, each FIONREAD request tells us how many bytes in total are in the buffer, including all datagram overhead, while on the right side, FIONREAD only tells us about how much actual data is available for the next read(2). Since datagrams are ordered and distinct units, each read(2) on either OS will only return exactly the number of bytes contained in the datagram written, though.

socket(2) OS comparison

A socketpair(2) uses, well, sockets. But what if we set up a socket(2) for IPC instead? We can then compare buffer sizes for PF_LOCAL, PF_INET, and PF_INET6, in SOCK_DGRAM and SOCK_STREAM each. (We can ignore datagrams in the INET and INET6 domains, since those are unreliable, as noted above.) And what we get then looks like this:

OS →
IPC ↓
NetBSD 9.2 FreeBSD 13.0 macOS 11.6.1 Linux 5.8.15 / Fedora 33 OmniOS 5.11
PF_LOCAL DGRAM
SO_RCVBUF 16384 4096 4096 212992 5120
SO_SNDBUF 2560 2048 2048 212992 16384
socket(2) DGRAM overhead 106 106 106 32 ?
socket(2) 1-byte loop 32 38 38 278 132
socket(2) large chunks SO_RCVBUF - 106 2 * (SO_SNDBUF - 106) SO_RCVBUF - 106 SO_SNDBUF - 32 131072
PF_LOCAL STREAM
SO_RCVBUF 8192 8192 8192 212992 5120
SO_SNDBUF 8192 8192 8192 212992 16384
socket(2) 1-byte loop SO_SNDBUF SO_SNDBUF SO_SNDBUF 278 21504
socket(2) large chunks SO_SNDBUF SO_SNDBUF SO_SNDBUF 219264 65536
PF_INET STREAM
SO_RCVBUF 32768 65536 8192 131072 128000
SO_SNDBUF 32768 49032 8192 212992 49152
socket(2) overhead 1024 1024 1024 8192 ?
socket(2) 1-byte loop 49154 129142 8192 different results on each invocation 130880
socket(2) large chunks SO_SNDBUF + recvspace SO_SNDBUF + recvspace - some overhead SO_SNDBUF different results on each invocation 160720
possibly depending on other TCP traffic on the system?
PF_INET6 STREAM
SO_RCVBUF 32768 131072 131072 131072 128000
SO_SNDBUF 32768 48972 146808 212992 49152
socket(2) overhead 1024 1024 1024 8192 ?
socket(2) 1-byte loop 54082 191878 546854 different results on each invocation 130880
socket(2) large chunks SO_SNDBUF + recvspace SO_SNDBUF + some overhead > SO_SNDBUF if chunksize < SO_SNDBUF different results on each invocation 131072

Enough already!

Alright, lets wrap it up. As you can tell, there is some significant variance across the different forms of IPC and OS variants, and to understand the true limits, you basically need to dive deep into the network buffers and OS-specific handling, influenced by a number of possibly system-specific variables or settings, including default max values, hard-code kernel limits, run-time configurable settings such as sysctl(8)s etc. etc.

If I still didn't get everything right in the above (which wouldn't entirely surprise me), please drop me a note; if you want to experiement with the different buffers on your system, feel free to grab the ipcbuf tool I ended up writing for this analysis and send me a pull request for fixes.

For now... EAGAIN.

November 15, 2021


References:

Related:


Previous: [IPv4 addresses are silly, inet_aton(3) doubly so]  -- Next: [Uninitialized Stack Variables]
[homepage]  [blog]  [[email protected]]  [@jschauma]  [RSS]