[ Team LiB ] Previous Section Next Section

28.3 Raw Socket Output

Output on a raw socket is governed by the following rules:

  • Normal output is performed by calling sendto or sendmsg and specifying the destination IP address. write, writev, or send can also be called if the socket has been connected.

  • If the IP_HDRINCL option is not set, the starting address of the data for the kernel to send specifies the first byte following the IP header because the kernel will build the IP header and prepend it to the data from the process. The kernel sets the protocol field of the IPv4 header that it builds to the third argument from the call to socket.

  • If the IP_HDRINCL option is set, the starting address of the data for the kernel to send specifies the first byte of the IP header. The amount of data to write must include the size of the caller's IP header. The process builds the entire IP header, except: (i) the IPv4 identification field can be set to 0, which tells the kernel to set this value; (ii) the kernel always calculates and stores the IPv4 header checksum; and (iii) IP options may or may not be included; see Section 27.2

  • The kernel fragments raw packets that exceed the outgoing interface MTU.

    Raw sockets are documented to provide an identical interface to the one a protocol would have if it was resident in the kernel [McKusick et al. 1996] Unfortunately, this means that certain pieces of the API are dependent on the OS kernel, specifically with regard to the byte ordering of the fields in the IP header. On many Berkeley-derived kernels, all fields are in network byte order except ip_len and ip_off, which are in host byte order (pp. 233 and 1057 of TCPv2). On Linux and OpenBSD, however, all the fields must be in network byte order.

    The IP_HDRINCL socket option was introduced with 4.3BSD Reno. Before this, the only way for an application to specify its own IP header in packets sent on a raw IP socket was to apply a kernel patch that was introduced in 1988 by Van Jacobson to support traceroute. This patch required the application to create a raw IP socket specifying a protocol of IPPROTO_RAW, which has a value of 255 (and is a reserved value and must never appear as the protocol field in an IP header).

    The functions that perform input and output on raw sockets are some of the simplest in the kernel. For example, in TCPv2, each function requires about 40 lines of C code (pp. 1054–1057), compared to TCP input at about 2,000 lines and TCP output at about 700 lines.

Our description of the IP_HDRINCL socket option is for 4.4BSD. Earlier versions, such as Net/2, filled in more fields in the IP header when this option was set.

With IPv4, it is the responsibility of the user process to calculate and set any header checksums contained in whatever follows the IPv4 header. For example, in our ping program (Figure 28.14), we must calculate the ICMPv4 checksum and store it in the ICMPv4 header before calling sendto.

IPv6 Differences

There are a few differences with raw IPv6 sockets (RFC 3542 [Stevens et al. 2003]):

  • All fields in the protocol headers sent or received on a raw IPv6 socket are in network byte order.

  • There is nothing similar to the IPv4 IP_HDRINCL socket option with IPv6. Complete IPv6 packets (including the IPv6 header or extension headers) cannot be read or written on an IPv6 raw socket. Almost all fields in an IPv6 header and all extension headers are available to the application through socket options or ancillary data (see Exercise 28.1). Should an application need to read or write complete IPv6 datagrams, datalink access (described in Chapter 29) must be used.

  • Checksums on raw IPv6 sockets are handled differently, as will be described shortly.

IPV6_CHECKSUM Socket Option

For an ICMPv6 raw socket, the kernel always calculates and stores the checksum in the ICMPv6 header. This differs from an ICMPv4 raw socket, where the application must do this itself (compare Figures 28.14 and 28.16). While ICMPv4 and ICMPv6 both require the sender to calculate the checksum, ICMPv6 includes a pseudoheader in its checksum (we will discuss the concept of a pseudoheader when we calculate the UDP checksum in Figure 29.14). One of the fields in this pseudoheader is the source IPv6 address, and normally the application lets the kernel choose this value. To prevent the application from having to try to choose this address just to calculate the checksum, it is easier to let the kernel calculate the checksum.

For other raw IPv6 sockets (i.e., those created with a third argument to socket other than IPPROTO_ICMPV6), a socket option tells the kernel whether to calculate and store a checksum in outgoing packets and verify the checksum in received packets. By default, this option is disabled, and it is enabled by setting the option value to a nonnegative value, as in


int offset = 2;

if (setsockopt(sockfd, IPPROTO_IPV6, IPV6_CHECKSUM,
               &offset, sizeof(offset)) < 0)
    error

This not only enables checksums on this socket, it also tells the kernel the byte offset of the 16-bit checksum: 2 bytes from the start of the application data in this example. To disable the option, it must be set to -1. When enabled, the kernel will calculate and store the checksum for outgoing packets sent on the socket and also verify the checksums for packets received on the socket.

    [ Team LiB ] Previous Section Next Section