A reworked TCP zero-copy receive API
Please consider subscribing to LWNIn April, LWN looked at the new API for zero-copy reception of TCP data that had been merged into the net-next tree for the 4.18 development cycle. After that article was written, a couple of issues came to the fore that required some changes to the API for this feature. Those changes have been made and merged; read on for the details.Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.
This API is intended to make it possible to read data from a TCP connection without the need to copy the data between the kernel and user space. The original version was based entirely on the mmap() system call; once a socket had been marked for zero-copy reception, an mmap() call would create a mapping containing the available data — in some circumstances, anyway. The application would use munmap() to release that data once processing was complete; see the article linked above for the details.
Two problems with this interface came to light after the feature had been merged. One was that this use of mmap() was somewhat strange; an mmap() call does not normally have side effects like consuming data from a socket. The author of this patch (Eric Dumazet) was comfortable with that aspect of the interface, but he had a harder time dealing with the locking problems that came with it. Calling network-layer operations from within mmap() inverts the normal locking order around mmap_sem; there was no easy way to fix that without separating the networking operations from the mmap() code.
So, in the version that (barring more surprises) will be merged for 4.18, the call to mmap() just sets up a range of address space into which data from the network can appear via zero-copy magic. Actually getting some data into that range requires a getsockopt() call with the TCP_ZEROCOPY_RECEIVE operation. This structure is passed into that call:
struct tcp_zerocopy_receive { __u64 address; __u32 length; __u32 recv_skip_hint; };
On entry to getsockopt(), the address field contains the address of the special mapping created with mmap(), and length is the number of bytes of data to be put into that mapping. As before, address must be page-aligned (which will happen naturally since it must also be the address returned from the mmap() call), and length must be a multiple of the page size. On return, length will be set to the number of bytes actually mapped into that range. The data will remain mapped until either the range is unmapped with munmap() or another getsockopt() call replaces the data.
In the old interface, the mmap() call would fail if the available data did not fill full pages or if there is pending urgent data. The new getsockopt() call will fail in the same way in those circumstances, but with a difference: the recv_skip_hint field of the tcp_zerocopy_receive structure will be set to the amount of data the application must consume with recv() before returning to the zero-copy mode. That should make it easier for applications to recover when things don't go as planned.
The new interface should also perform better, especially in multi-threaded applications, because it is no longer necessary to call mmap() for each new batch of data. The implementation can also avoid making some higher-order allocations that were necessary with the old API.
The end result is an interface that is less surprising, easier to use, and
perhaps even faster for some use cases. The whole episode is a clear
demonstration of the benefits of wider review of new features, especially
those that have user-space API components. In this case, a number of the
ideas behind the new implementation came
from Andy Lutomirski, who
seemingly only became aware of the changes once they were discussed beyond
the netdev mailing list. Having many eyes on the code really does make it
better in the end.
Index entries for this article | |
---|---|
Kernel | Networking/Performance |
Posted May 18, 2018 16:01 UTC (Fri)
by zuzzurro (subscriber, #61118)
[Link] (6 responses)
Posted May 18, 2018 22:21 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (5 responses)
```
Posted May 19, 2018 11:48 UTC (Sat)
by grawity (subscriber, #80596)
[Link] (2 responses)
Posted May 19, 2018 13:50 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted May 19, 2018 21:43 UTC (Sat)
by eternaleye (guest, #67051)
[Link]
Posted May 20, 2018 15:00 UTC (Sun)
by willy (subscriber, #9762)
[Link] (1 responses)
Posted May 20, 2018 18:03 UTC (Sun)
by zuzzurro (subscriber, #61118)
[Link]
http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html
But my question is really about the process, not about this particular instance.
Posted May 24, 2018 22:03 UTC (Thu)
by jklowden (guest, #107637)
[Link] (1 responses)
The logical function to extend, of course, is read(2). If the buffer address came from mmap, the kernel can operate in zero-copy mode. Only if the user refers to the mapped memory must the data be made visible in userspace.
Posted May 25, 2018 4:54 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link]
Posted Aug 13, 2018 9:59 UTC (Mon)
by dbkm11 (subscriber, #125598)
[Link]
A reworked TCP zero-copy receive API
In particular what seems strange is the fact that (and I may be wrong on this) mmap has been knows to be quite slow (isn't this the reason why using mmap for cp never worked better than read/write?), and therefore the initial mechanism should have been spotted as suboptimal immediately....
A reworked TCP zero-copy receive API
strace cp foo bar |& grep mmap | wc -l
23
```
I'd say the performance requirements for mmapping a few dozen libraries once per exec don't come anywhere close to mmapping a million packets per second...
A reworked TCP zero-copy receive API
A reworked TCP zero-copy receive API
A reworked TCP zero-copy receive API
A reworked TCP zero-copy receive API
A reworked TCP zero-copy receive API
Why getsockopt?
Why getsockopt?
A reworked TCP zero-copy receive API