A Scalable and Explicit Event Delivery Mechanism for UNIX
Total Page:16
File Type:pdf, Size:1020Kb
THE ADVANCED COMPUTING SYSTEMS ASSOCIATION The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference Monterey, California, USA, June 6-11, 1999 A Scalable and Explicit Event Delivery Mechanism for UNIX _ Gaurav Banga, Network Appliance Inc. Jeffrey C. Mogul Compaq Computer Corp. Peter Druschel Rice University © 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org A scalable and explicit event delivery mechanism for UNIX Gaurav Banga [email protected] Network Appliance Inc., 2770 San Tomas Expressway, Santa Clara, CA 95051 Jeffrey C. Mogul [email protected] Compaq Computer Corp. Western Research Lab., 250 University Ave., Palo Alto, CA, 94301 Peter Druschel [email protected] Department of Computer Science, Rice University, Houston, TX, 77005 Abstract amount of available parallelism (for example, one per UNIX applications not wishing to block when do- CPU), and to use non-blocking I/O in conjunction with ing I/O often use the select() system call, to wait for an efficient mechanism for deciding which descriptors events on multiple file descriptors. The select() mech- are ready for processing[17]. We focus on the design of anism works well for small-scale applications, but scales this mechanism, and in particular on its efficiency as the poorly as the number of file descriptors increases. Many number of file descriptors grows very large. modern applications, such as Internet servers, use hun- Early computer applications seldom managed many dreds or thousands of file descriptors, and suffer greatly file descriptors. UNIX, for example, originally suppor- from the poor scalability of select(). Previous work has ted at most 15 descriptors per process[14]. However, the shown that while the traditional implementation of se- growth of large client-server applications such as data- lect() can be improved, the poor scalability is inherent in base servers, and especially Internet servers, has led to the design. We present a new event-delivery mechanism, much larger descriptor sets. which allows the application to register interest in one or Consider, for example, a Web server on the Inter- more sources of events, and to efficiently dequeue new net. Typical HTTP mean connection durations have been events. We show that this mechanism, which requires measured in the range of 2-4 seconds[8, 13]; Figure 1 only minor changes to applications, performs independ- shows the distribution of HTTP connection durations ently of the number of file descriptors. measured at one of Compaq's firewall proxy servers. In- ternet connections last so long because of long round- 1 Introduction trip times (RTTs), frequent packet loss, and often be- An application must often manage large numbers of cause of slow (modem-speed) links used for download- file descriptors, representing network connections, disk ing large images or binaries. On the other hand, mod- files, and other devices. Inherent in the use of a file ern single-CPU servers can handle about 3000 HTTP descriptor is the possibility of delay. A thread that in- requests per second[19], and multiprocessors consider- vokes a blocking I/O call on one file descriptor, such as ably more (albeit in carefully controlled environments). the UNIX read() or write() systems calls, risks ignoring Queueing theory shows that an Internet Web server hand- all of its other descriptors while it is blocked waiting for ling 3000 connections per second, with a mean duration data (or for output buffer space). of 2 seconds, will have about 6000 open connections to UNIX supports non-blocking operation for read() and manage at once (assuming constant interarrival time). write(), but a naive use of this mechanism, in which the In a previous paper[4], we showed that the BSD application polls each file descriptor to see if it might be UNIX event-notification mechanism, the select() system usable, leads to excessive overheads. call, scales poorly with increasing connection count. We Alternatively, one might allocate a single thread to showed that large connection counts do indeed occur in each activity, allowing one activity to block on I/O actual servers, and that the traditional implementation of without affecting the progress of others. Experience with select() could be improved significantly. However, we UNIX and similar systems has shown that this scales also found that even our improved select() implementa- badly as the number of threads increases, because of tion accounts for an unacceptably large share of the over- the costs of thread scheduling, context-switching, and all CPU time. This implies that, no matter how carefully thread-state storage space[6, 9]. The use of a single pro- it is implemented, select() scales poorly. (Some UNIX cess per connection is even more costly. systems use a different system call, poll(), but we believe The most efficient approach is therefore to allocate that this call has scaling properties at least as bad as those a moderate number of threads, corresponding to the of select(), if not worse.) 1 Mean = 2.07 0.8 0.6 Median = 0.20 0.4 Fraction of connections 0.2 0 0.010.1 1 10 100 1000 10000 Connection duration (seconds) N = 10,139,681 HTTP connections Data from 21 October 1998 through 27 October 1998 Fig. 1: Cumulative distribution of proxy connection durations The key problem with the select() interface is that it value, which is a system-specific parameter. The read- requires the application to inform the kernel, on each fds, writefds,andexceptfds are in-out arguments, respect- call, of the entire set of “interesting” file descriptors: i.e., ively corresponding to the sets of file descriptors that are those for which the application wants to check readiness. “interesting” for reading, writing, and exceptional con- For each event, this causes effort and data motion propor- ditions. A given file descriptor might be in more than tional to the number of interesting file descriptors. Since one of these sets. The nfds argument gives the largest the number of file descriptors is normally proportional bitmap index actually used. The timeout argument con- to the event rate, the total cost of select() activity scales trols whether, and how soon, select() will return if no file roughly with the square of the event rate. descriptors become ready. In this paper, we explain the distinction between state- Before select() is called, the application creates one based mechanisms, such as select(), which check the or more of the readfds, writefds,orexceptfds bitmaps, by current status of numerous descriptors, and event-based asserting bits corresponding to the set of interesting file mechanisms, which deliver explicit event notifications. descriptors. On its return, select() overwrites these bit- We present a new UNIX event-based API (application maps with new values, corresponding to subsets of the programming interface) that an application may use, in- input sets, indicating which file descriptors are available stead of select(), to wait for events on file descriptors. for I/O. A member of the readfds set is available if there The API allows an application to register its interest in is any available input data; a member of writefds is con- a file descriptor once (rather than every time it waits for sidered writable if the available buffer space exceeds a events). When an event occurs on one of these interest- system-specific parameter (usually 2048 bytes, for TCP ing file descriptors, the kernel places a notification on a sockets). The application then scans the result bitmaps queue, and the API allows the application to efficiently to discover the readable or writable file descriptors, and dequeue event notifications. normally invokes handlers for those descriptors. We will show that this new interface is simple, easily Figure 2 is an oversimplified example of how an ap- implemented, and performs independently of the number plication typically uses select(). One of us has shown[15] of file descriptors. For example, with 2000 connections, that the programming style used here is quite inefficient our API improves maximum throughput by 28%. for large numbers of file descriptors, independent of the problems with select(). For example, the construction 2 The problem with select() of the input bitmaps (lines 8 through 12 of Figure 2) We begin by reviewing the design and implementation should not be done explicitly before each call to select(); of the select() API. The system call is declared as: instead, the application should maintain shadow copies of the input bitmaps, and simply copy these shadows to int select( readfds writefds int nfds, and . Also, the scan of the result bit- fd_set *readfds, maps, which are usually quite sparse, is best done word- fd_set *writefds, by-word, rather than bit-by-bit. fd_set *exceptfds, Once one has eliminated these inefficiencies, however, struct timeval *timeout); select() is still quite costly. Part of this cost comes from An fd set is simply a bitmap; the maximum size (in the use of bitmaps, which must be created, copied into bits) of these bitmaps is the largest legal file descriptor the kernel, scanned by the kernel, subsetted, copied out 1 fd_set readfds, writefds; 2 struct timeval timeout; 3 int i, numready; 4 5 timeout.tv_sec = 1; timeout.tv_usec = 0; 6 7 while (TRUE) { 8 FD_ZERO(&readfds); FD_ZERO(&writefds); 9 for (i = 0; i <= maxfd; i++) { 10 if (WantToReadFD(i)) FD_SET(i, &readfds); 11 if (WantToWriteFD(i)) FD_SET(i, &writefds); 12 } 13 numready = select(maxfd, &readfds, 14 &writefds, NULL, &timeout); 15 if (numready < 1) { 16 DoTimeoutProcessing(); 17 continue; 18 } 19 20 for (i = 0; i <= maxfd; i++) { 21 if (FD_ISSET(i, &readfds)) InvokeReadHandler(i); 22 if (FD_ISSET(i, &writefds)) InvokeWriteHandler(i); 23 } 24 } Fig.