|
|
Subscribe / Log in / New account

A reworked TCP zero-copy receive API

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
May 18, 2018
In April, LWN looked at the new API for zero-copy reception of TCP data that had been merged into the net-next tree for the 4.18 development cycle. After that article was written, a couple of issues came to the fore that required some changes to the API for this feature. Those changes have been made and merged; read on for the details.

This API is intended to make it possible to read data from a TCP connection without the need to copy the data between the kernel and user space. The original version was based entirely on the mmap() system call; once a socket had been marked for zero-copy reception, an mmap() call would create a mapping containing the available data — in some circumstances, anyway. The application would use munmap() to release that data once processing was complete; see the article linked above for the details.

Two problems with this interface came to light after the feature had been merged. One was that this use of mmap() was somewhat strange; an mmap() call does not normally have side effects like consuming data from a socket. The author of this patch (Eric Dumazet) was comfortable with that aspect of the interface, but he had a harder time dealing with the locking problems that came with it. Calling network-layer operations from within mmap() inverts the normal locking order around mmap_sem; there was no easy way to fix that without separating the networking operations from the mmap() code.

So, in the version that (barring more surprises) will be merged for 4.18, the call to mmap() just sets up a range of address space into which data from the network can appear via zero-copy magic. Actually getting some data into that range requires a getsockopt() call with the TCP_ZEROCOPY_RECEIVE operation. This structure is passed into that call:

    struct tcp_zerocopy_receive {
	__u64 address;
	__u32 length;
	__u32 recv_skip_hint;
    };

On entry to getsockopt(), the address field contains the address of the special mapping created with mmap(), and length is the number of bytes of data to be put into that mapping. As before, address must be page-aligned (which will happen naturally since it must also be the address returned from the mmap() call), and length must be a multiple of the page size. On return, length will be set to the number of bytes actually mapped into that range. The data will remain mapped until either the range is unmapped with munmap() or another getsockopt() call replaces the data.

In the old interface, the mmap() call would fail if the available data did not fill full pages or if there is pending urgent data. The new getsockopt() call will fail in the same way in those circumstances, but with a difference: the recv_skip_hint field of the tcp_zerocopy_receive structure will be set to the amount of data the application must consume with recv() before returning to the zero-copy mode. That should make it easier for applications to recover when things don't go as planned.

The new interface should also perform better, especially in multi-threaded applications, because it is no longer necessary to call mmap() for each new batch of data. The implementation can also avoid making some higher-order allocations that were necessary with the old API.

The end result is an interface that is less surprising, easier to use, and perhaps even faster for some use cases. The whole episode is a clear demonstration of the benefits of wider review of new features, especially those that have user-space API components. In this case, a number of the ideas behind the new implementation came from Andy Lutomirski, who seemingly only became aware of the changes once they were discussed beyond the netdev mailing list. Having many eyes on the code really does make it better in the end.

Index entries for this article
KernelNetworking/Performance


to post comments

A reworked TCP zero-copy receive API

Posted May 18, 2018 16:01 UTC (Fri) by zuzzurro (subscriber, #61118) [Link] (6 responses)

What seems frankly amazing for me is that such a change (even the latest one) could be implemented and written in stone in such a short period of time without a larger audience looking at it. I understand that too large an audience could just an exercise in bike shedding, but still. If we had a mechanism to implement test changes not subject to the ABI stability requirement would make experimenting a bit easier, but as this mechanism does not exist I would suggest to proceed more carefully.
In particular what seems strange is the fact that (and I may be wrong on this) mmap has been knows to be quite slow (isn't this the reason why using mmap for cp never worked better than read/write?), and therefore the initial mechanism should have been spotted as suboptimal immediately....

A reworked TCP zero-copy receive API

Posted May 18, 2018 22:21 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (5 responses)

Since `mmap()` is used to open shared libraries, I certainly would hope that it is fast to setup the page mapping.

```
strace cp foo bar |& grep mmap | wc -l
23
```

A reworked TCP zero-copy receive API

Posted May 19, 2018 11:48 UTC (Sat) by grawity (subscriber, #80596) [Link] (2 responses)

I'd say the performance requirements for mmapping a few dozen libraries once per exec don't come anywhere close to mmapping a million packets per second...

A reworked TCP zero-copy receive API

Posted May 19, 2018 13:50 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (1 responses)

The reworked API only has one mmap() call to setup the window. I'm not really grasping what the concern on this thread is.

A reworked TCP zero-copy receive API

Posted May 19, 2018 21:43 UTC (Sat) by eternaleye (guest, #67051) [Link]

You're missing that this thread isn't about a _design_ concern with the _new_ proposal, it's a _process_ concern regarding the _old_ proposal (which did call mmap() once per receive). Namely, "How did that proposal get so far along before someone noticed that issue?"

A reworked TCP zero-copy receive API

Posted May 20, 2018 15:00 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

mmap is not slow. munmap (specifically tearing down the TLBs on every CPU) is slow. You must not leave a TLB entry available to userspace when the kernel has reused the page for something else. There are various optimisations we've implemented and a few we haven't yet to speed this up, but fundamentally it's slow because you need to tell every CPU that this process has ever executed on that the mapping is no longer valid. So it gets slower on larger machines.

A reworked TCP zero-copy receive API

Posted May 20, 2018 18:03 UTC (Sun) by zuzzurro (subscriber, #61118) [Link]

This is what I'm referring to when I takj about mmap being slow:

http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html

But my question is really about the process, not about this particular instance.

Why getsockopt?

Posted May 24, 2018 22:03 UTC (Thu) by jklowden (guest, #107637) [Link] (1 responses)

getsockopt is an arcane function, and a strange choice for this purpose. If names mean anything, it's meant to receive socket metadata, not data.

The logical function to extend, of course, is read(2). If the buffer address came from mmap, the kernel can operate in zero-copy mode. Only if the user refers to the mapped memory must the data be made visible in userspace.

Why getsockopt?

Posted May 25, 2018 4:54 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

That doesn't work. The data doesn't just arrive while a recv() is ongoing, no? So either the data needs to be buffered somewhere, or you need state that lasts longer than a recv().

A reworked TCP zero-copy receive API

Posted Aug 13, 2018 9:59 UTC (Mon) by dbkm11 (subscriber, #125598) [Link]


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds