High-Performance Local Area Communication with Fast Sockets

High-Performance Local Area Communication With Fast Sockets Steven H. Rodrigues Thomas E. Anderson David E. Culler Computer Science Division University of California at Berkeley Berkeley, CA 94720 Abstract 1 Introduction Modern switched networks such as ATM and Myri- The development and deployment of high perfor- net enable low-latency, high-bandwidth communica- mance local-area networks such as ATM [de Prycker tion. This performance has not been realized by cur- 1993], Myrinet [Seitz 1994], and switched high- rent applications, because of the high processing over- speed Ethernet has the potential to dramatically im- heads imposed by existing communications software. prove communication performance for network ap- These overheads are usually not hidden with large plications. These networks are capable of microsec- packets; most network traf®c is small. We have devel- ond latencies and bandwidths of hundreds of megabits oped Fast Sockets, a local-area communication layer per second; their switched fabrics eliminate the con- that utilizes a high-performance protocol and exports tention seen on shared-bus networks such as tradi- the Berkeley Sockets programming interface. Fast tional Ethernet. Unfortunately, this raw network ca- Sockets realizes round-trip transfer times of 60 mi- pacity goes unused, due to current network commu- croseconds and maximum transfer bandwidth of 33 nication software. MB/second between two UltraSPARC 1s connected Most of these networks run the TCP/IP proto- by a Myrinet network. Fast Sockets obtains perfor- col suite [Postel 1981b, Postel 1981c, Postel 1981a]. mance by collapsing protocol layers, using simple TCP/IP is the default protocol suite for Internet traf- buffer management strategies, and utilizing knowl- ®c, and provides inter-operability among a wide va- edge of packet destinations for direct transfer into user riety of computing platforms and network technolo- buffers. Using receive posting, we make the Sockets gies. The TCP protocol provides the abstraction of API a single-copy communications layer and enable a reliable, ordered byte stream. The UDP protocol regular Sockets programs to exploit the performance provides an unreliable, unordered datagram service. of modern networks. Fast Sockets transparently re- Many other application-level protocols (such as the verts to standard TCP/IP protocols for wide-area com- FTP ®le transfer protocol, the Sun Network File Sys- munication. tem, and the X Window System) are built upon these two basic protocols. This work was supported in part by the Defense Advanced Research Projects Agency (N00600-93-C-2481, F30602-95-C-0014), the National TCP/IP's observed performance has not scaled to Science Foundation (CDA 0401156), California MICRO, the AT&T Founda- the ability of modern network hardware, however. tion, Digital Equipment Corporation, Exabyte, Hewlett Packard, Intel, IBM, Microsoft, Mitsubishi, Siemens Corporation, Sun Microsystems, and Xerox. While TCP is capable of sustained bandwidth close Anderson and Culler were also supported by National Science Foundation to the rated maximum of modern networks, actual Presidential Faculty Fellowships, and Rodrigues by a National Science Foun- dation Graduate Fellowship. The authors can be contacted at fsteverod, bandwidth is very much implementation-dependent. tea, [email protected]. The round-triplatency of commercial TCP implementations is hundreds of microseconds higher than the minimum possible on these networks. Implementa- tions of the simpler UDP protocol, which lack the reli- ability and ordering mechanisms of TCP, perform lit- tle better [Keeton et al. 1995, Kay & Pasquale 1993, von Eicken et al. 1995]. This poor performance is due to the high per-packet processing costs (processing overhead) of the protocol implementations. In local-area environments, where on-the-wire times are 1995], and native U-Net [von Eicken et al. 1995], small, these processing costs dominate small-packet have used the ®rst two methods; they implement new round-trip latencies. protocols and new programming interfaces to obtain Local-area traf®c patterns exacerbate the problems improved local-area network performance. The pro- posed by high processing costs. Most LAN traf- tocols and interfaces are lightweight and provide pro- ®c consists of small packets [Kleinrock & Naylor gramming abstractions that are similar to the under- 1974, Schoch & Hupp 1980, Feldmeier 1986, Amer lying hardware. All of these systems realize laten- et al. 1987, Cheriton & Williamson 1987, Gusella cies and throughput close to the physical limits of the 1990, Caceres et al. 1991, Claffy et al. 1992]. network. However, none of them offer compatibility Small packets are the rule even in applications con- with existing applications. sidered bandwidth-intensive: 95% of all packets in a Other work has tried to improve performance by re- NFS trace performed at the Berkeley Computer Sci- implementing TCP. Recent work includes zero-copy ence Division carried less than 192 bytes of user data TCP for Solaris [Chu 1996] and a TCP interface for [Dahlin et al. 1994]; the mean packet size in the trace the U-Net interface [von Eicken et al. 1995]. These was 382 bytes. Processing overhead is the dominant implementations can inter-operate with other TCP/IP transport cost for packets this small, limiting NFS per- implementations and improve throughput and latency formance on a high-bandwidth network. This is true relative to standard TCP/IP stacks. Both implementa- for other applications: most application-level proto- tions can realize the full bandwidth of the network for cols in local-area use today (X11, NFS, FTP, etc.) op- large packets. However, both systems have round-trip erate in a request-response, or client-server, manner: latencies considerably higher than the raw network. a client machine sends a small request message to a This paper presents our solution to the overhead server, and awaits a response from the server. In the problem: a new communications protocol and im- request-response model, processing overhead usually plementation for local-area networks that exports the cannot be hidden through packet pipelining or through Berkeley Sockets API, uses a low-overhead protocol overlapping communication and computation, mak- for local-area use, and reverts to standard protocols ing round-trip latency a critical factor in protocol perfor wide-area communication. The Sockets API is formance. a widely used programming interface that treats net- Traditionally, there are several methods of attack- work connections as ®les; application programs read ing the processing overhead problem: changing the and write network connections exactly as they read application programming interface (API), changing and write ®les. The Fast Sockets protocol has been the underlying network protocol, changing the im- designed and implemented to obtain a low-overhead plementation of the protocol, or some combination data transmission/reception path. Should a Fast Sock- of these approaches. Changing the API modi®es the ets program attempt to connect with a program out- code used by applications to access communications side the local-area network, or to a non-Fast Sockets functionality. While this approach may yield bet- program, the software transparently reverts to stan- ter performance for new applications, legacy appli- dard TCP/IP sockets. These features enable high- cations must be re-implemented to gain any bene®t. performance communication through relinking exist- Changing the communications protocol changes the ing application programs. ªon-the-wireº format of data and the actions taken Fast Sockets achieves its performance through a during a communications exchange Ð for example, number of strategies. It uses a lightweight protocol modifying the TCP packet format. A new or modi®ed and ef®cient buffer management to minimize book- protocol may improve communications performance, keeping costs. The communication protocol and but at the price of incompatibility: applications com- its programming interface are integrated to elimi- municating via the new protocol are unable to share nate module-crossing costs. Fast Sockets eliminates data directly with applications using the old protocol. copies within the protocol stack by using knowledge Changing the protocol implementation rewrites the of packet memory destinations. Additionally, Fast software that implements a particular protocol; packet Sockets was implemented without modi®cations to formats and protocol actions do not change, but the the operating system kernel. code that performs these actions does. While this ap- A major portion of Fast Sockets' performance is proach provides full compatibility with existing pro- due to receive posting, a technique of utilizing infor- tocols, fundamental limitations of the protocol design mation from the API about packet destinations to min- may limit the performance gain. imize copies. This paper describes the use of receive Recent systems, such as Active Messages [von posting in a high-performance communications stack. Eicken et al. 1992], Remote Queues [Brewer et al. It also describes the design and implementation of a low-overhead protocol for local-area communication. interface is usually the topmost layer, sitting above the The rest of the paper is organized as follows. Sec- protocol. The protocol layer may contain sub-layers: tion 2 describes problems of current TCP/IP and for example, the TCP protocol code sits above the Sockets implementations and how these problems

Load more