Scalable Kernel Performance for Internet Servers Under Realistic Loads

The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference (NO 98) New Orleans, Louisiana, June 1998 Scalable kernel performance for Internet servers under realistic loads Gaurav Banga Rice University Jeffrey C. Mogul Digital Equipment Corp., Western Research Lab For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ Scalable kernel performance for Internet servers under realistic loads Gaurav Banga [email protected] Department of Computer Science, Rice University, Houston, TX, 77005 Jeffrey C. Mogul [email protected] Digital Equipment Corp. Western Research Lab., 250 University Ave., Palo Alto, CA, 94301 Abstract Chankhunthod et al.[5], which concluded that Harvest is UNIX Internet servers with an event-driven architec- an order of magnitude faster than the CERN proxy. ture often perform poorly under real workloads, even if Maltzahn et. al. [11] attribute Squid’s poor perfor- they perform well under laboratory benchmarking con- mance to the amount of CPU time Squid uses to im- ditions. We investigated the poor performance of event- plement its own memory management and non-blocking driven servers. We found that the delays typical in wide- network I/O abstractions. We investigated this phenomenon area networks cause busy servers to manage a large num- in more detail, and found out that the large delays typi- ber of simultaneous connections. We also observed that cal of wide-area networks (WANs) cause Squid to have the select system call implementation in most UNIX ker- a large number of simultaneously open connections. Un- nels scales poorly with the number of connections being fortunately, the traditional UNIX implementations of sev- managed by a process. The UNIX algorithm for allocat- eral kernel features used by event-driven single-process ing file descriptors also scales poorly. These algorithmic servers do not scale well with the number of active de- problems lead directly to the poor performance of event- scriptors in a process. These are the select system call, driven servers. used to support non-blocking I/O, and the kernel rou- We implemented scalable versions of the select sys- tine that allocates a new file descriptor. (We refer to the tem call and the descriptor allocation algorithm. This led descriptor-allocation routine as ufalloc(), as it is named to an improvement of up to 58% in Web proxy and Web in Digital UNIX, although other UNIX variants use dif- server throughput, and dramatically improved the scala- ferent names, e.g., fdalloc().) A system running the Squid bility of the system. server spends a large fraction of its time in these kernel routines, which is directly responsible for Squid’s poor 1 Introduction performance under real workloads. Many Web servers and proxies are implemented as as We designed and implemented scalable versions of single-threaded event-driven processes. This approach is select() and ufalloc() in Digital UNIX, and evaluated the motivated by the belief that an event-driven architecture performance of Squid and an event-driven Web server in has some advantages over a thread-per-connection archi- a simulated WAN environment. We observed throughput tecture [17], and that it is more efficient than process-per- improvements of up to 43% for the Web server, and up connection designs, including “pre-forked” process-per- to 58% for Squid. We observed dramatic reductions in connection systems. In particular, event-driven servers CPU utilizations at lower loads. We also evaluated these have lower context-switching and synchronization over- changes on a busy HTTP proxy server, which handles head, especially in the context of single-processor ma- several million requests per day. chines. The rest of this paper is organized as follows. Sec- Unfortunately, event-driven servers have been ob- tion 2 gives a brief overview of the working of a typical served to perform poorly under real conditions. In a re- event-driven server running on a UNIX system. We also cent study of Digital’s Palo Alto Web proxies, Maltzahn describe the dynamics of typical implementations of se- et. al. [11] found that the Squid (formerly Harvest) proxy lect() and ufalloc(). Section 3 describes our quantitative server[5, 22] performs no better than the older CERN characterization of the performance problems in select() proxy[10]. This is surprising, because the CERN proxy and ufalloc(). In Section 4 we present scalable versions forks a new process to handle each new connection, and of select() and ufalloc(). In Sections 5 and 6 we evalu- process creation is a moderately expensive operation. ate our implementation. Finally, Section 7 covers related This result is also in sharp contrast with the study by work and and offers some conclusions. 2 Background algorithms can be trivially extended to include descrip- In this section we present a brief overview of the work- tors that refer to other kinds of objects, such as vnodes. ing of a typical event-driven server. We will also describe (Vnodes are kernel data structures used to represent files classical implementations of select() and ufalloc().This and devices.) will provide necessary background for the discussion in In Digital UNIX, the select() function in the ker- the following sections. nel starts by creating internal data structures containing summary information about sockets that are marked in 2.1 Event-driven servers at least one input bitmap. Subsequently, select() calls An event-driven server typically has a single thread do scan(), which calls selscan() to check the status of which manages all connections to the server. The thread each of the entities (vnodes or sockets) corresponding to uses the select() system call to simultaneously wait for the selected descriptors. events on these connections. For each selected socket, selscan() enqueues a record When a call to select() returns, the server’s main loop referring to the current thread on the select queue of the invokes event handlers for each of the ready descriptors. socket. This is done so that the thread can be identi- These handlers perform a variety of tasks depending on fied as waiting inside select() for events on the socket. the nature of the particular event. For example, when a selscan() then calls soo select() for each socket, which socket being used to listen for new connections becomes checks to see if the condition that the process is interested ready, the corresponding handler calls accept() to re- in (i.e. the socket is readable, writable, or has pending turn a file descriptor for the new connection. Handlers exceptions) is true. If none of the conditions that the user invoked when a connection becomes ready for reading process is selecting on are true, then do scan() goes to or writing perform the actual read or write to the appro- sleep waiting for any of these to become true. priate descriptor. The execution of handlers may cause Note that the linear search in selscan() covers every the addition or removal of descriptors from the set being socket of potential interest to the selecting process, inde- managed by the server. pendent of how many are actually ready. Thus, the cost Event-driven servers are fast because they have no is proportional to the number of file descriptors involved locking or context switching overhead. The same thread in the call to select(), rather than to the number of events manages all connections, and all handlers are executed discoveredbythecall. synchronously. A single-threaded server, however, can- When a network packet comes in, protocol processing not exploit any true concurrency in the stream of tasks. may cause a condition on which do scan() is blocked Thus, on multiprocessor systems, event-driven servers to become true. The thread that performs protocol pro- have as many threads as processors. Examples of event- cessing for an incoming packet calls select wakeup(), driven servers include Squid[5, 22] and its commercial which wakes up all threads that are blocked in do scan() version NetCache[16], Zeus[25], thttpd[24] and several awaiting this condition. research servers[2, 8, 18]. A thread that is woken up in do scan() calls selscan(), 2.2 select() which calls soo select() for all the sockets that the corresponding call to select() specified in its three bitmaps. The select system call allows a user process to wait for do scan() also calls undo scan() to remove this thread events on a set of descriptors. A process can indicate in- from select queues of the selected sockets. terest in three types of events on a descriptor: events that make a descriptor readable, those that make it writable, 2.3 ufalloc() and exception events. This information is passed to the The kernel function ufalloc() is called to allocate a kernel using three bitmaps. In each bitmap the k th bit in- new file descriptor for a process. This function is called dicates interest in events of that type for the k th descrip- as a result of the open(), socket(), socketpair(), dup(), tor. These bitmaps are value-result parameters, and the dup2() and accept() system calls. returned bitmaps indicate the sets of ready descriptors. UNIX semantics for file descriptor allocation require Stevens[23] describes the select() interface in detail. that the kernel allocate the lowest-numbered available We describe the Digital UNIX implementation of se- descriptor. This prevents the use of a straightforward lect(). However, the classical BSD implementation of scalable implementation, such as a free list. Instead, all select() is similar to the Digital UNIX implementation. of the UNIX variants that we know of, including BSD- The main differences are related to the multithreaded na- derived systems such as Digital UNIX, and System V Re- ture of the Digital UNIX kernel.

Load more