By default, TCP sockets copy data at least once because the user expects contiguous data in memory but the operating system needs to prepend IP and TCP headers when sending; when receiving, the kernel has to examine the packet header before it can find out the receiving process. On top of that, the operating system can only manage whole virtual memory pages raising the question how the kernel should pass messages than smaller than a page to the user or what to do if urgent data (also called out-of-band data) arrives. Moreover, hardware support is required. For example, TCP packages contain a checksum and if the network interface controllers (NIC) does not compute it automatically, then the kernel must scan the entire package once in order to compute the checksum. As soon as this is done, one might as well copy the message payload.
Zero-Copy Hardware Support
Modern NICs compute TCP checksums automatically (without using the CPU) and offer scather/gather semantics meaning network packages can be assembled from separate data sources or saved into disjoint locations, respectively. With this hardware feature, it is possible to implement zero-copy sockets and this has been done in FreeBSD, Linux, and probably other operating systems. To find out if your NIC has these features, call ethtool:
ethtool --show-offload <network device>
Example output on @cconrads Dell 5540 laptop (output truncated):
$ ethtool --show-offload wlp59s0 Features for wlp59s0: rx-checksumming: on [fixed] tx-checksumming: on tx-checksum-ipv4: on [fixed] tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on [fixed] tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on [fixed] tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off [fixed] tx-tcp6-segmentation: on [fixed] [snip]
Note that with the hardware separating message payload from message headers, there are effectively two distinct buffers in a system for handling TCP traffic. In a similar fashion, InfiniBand distinguishes between send/receive message semantics and remote memory acccess reads and writes.
The socket API on Linux makes zero-copy available on the sender's side by calling
send() with the specific flags
MSG_ZEROCOPY, respectively. The receiver has to memory-map pages with
mmap() before receiving data with
getsockopt(). An alternative Linux-specific approach might be to use
int sendfile(int out_fd, int in_fd, off_t* offset, size_t count). This function replaces a pair
write() calls and avoids the copy to and from user space. For zero-copy TCP, one could then create a dummy file with
open() to get a file descriptor, memory-map the file with
mmap(), before sending or receiving data using the file descriptor. As of April 9, 2021, @cconrads did not test this or find previous experiments.
ZeroMQ does not exploit zero-copy TCP sockets, this is evident from its age (Linux gained zero-copy send semantics in 2017, zero-copy receive semantics in 2018, ZeroMQ 1.0 was released in 2009) as well as from reading the source code in ZeroMQ 4.3.2 in
src/tcp.cpp. Furthermore, the ZeroMQ Guide clarifies in Chapter 2 Sockets and Pattern:
There is no way to do zero-copy on receive: ZeroMQ delivers you a buffer that you can store as long as you wish, but it will not write data directly into application buffers.
(This makes sense because ZeroMQ is facing the same problem as the kernel: it must look at the package headers to find out what to do with the package).
Thus, ZeroMQ is zero-copy in the sense that it does not copy user-provided data within user space if
zmq_msg_init_data() is called.
Linux and other operating systems implemented zero-copy TCP sockets but this requires hardware support. The zero-copy semantics are only available over system-specific calls. ZeroMQ only offers zero-copy semantics in user-space and only when sending messages.
- Jonathan Corbet: Zero-copy networking. 2017.
- Jonathan Corbet: Zero-copy TCP receive. 2018.
- Jonathan Corbet: A reworked TCP zero-copy receive API. 2018.
- The Geek Stuff: 9 Linux ethtool Examples to Manipulate Ethernet Card (NIC Card). 2010.
- The Linux Kernel Documentation: MSG_ZEROCOPY
- Dragan Stancevic: Zero Copy I: User-Mode Perspective. 2003.
- Paul Grun: Introduction to InfiniBand for End Users. 2010.