Appears in the Proceedings of 14th IBM Center for Advanced Studies Conference (CASCON), October, 2004

Evaluating the Performance of User-Space and

Kernel-Space Web Servers

¡ ¡ Amol Shukla , Lily Li , Anand Subramanian , Paul A.S. Ward , Tim Brecht

University of Waterloo ¢

ashukla,l7li,anand,pasward,brecht £ @uwaterloo.ca

Abstract vated the development of faster and more efficient web servers. High-performance web servers need There has been much debate over the past few years to multiplex between thousands of simultaneous about the practice of moving traditional user-space connections without degrading the performance. applications, such as web servers, into the kernel Many modern web applications require the gen-

for better performance. Recently, the user-space eration of dynamic content, where the response to ¤ server has shown promising perfor- a user request is the result of a server-side com-

mance for delivering static content. In this paper putation. Dynamic content creation allows web ¤ we first describe how we augmented the server to pages to be personalized based on user prefer-

enable it to serve dynamic content. We then evalu- ences, thereby enabling a more interactive experi- ¤ ate the performance of the server and the kernel- ence for the user than that possible with static-only space web server, using the SPECweb99 content. On-the-fly response generation also pro- workload generator under a variety of static and vides a web interface to the information stored in dynamic workloads. We demonstrate that the gap databases. Dynamic requests are often compute- in the performance of the two servers becomes less bound [1] and involve more processing than static- significant as the proportionof dynamic-content re- content requests. The on-demand creation of con-

quests increases. In fact, for workloads with a ma- tent can significantly impact the scalability and the ¤ jority of dynamic requests, the server outperforms performance provided by a web server [1, 2, 3]. TUX. We conclude that a well-designed user-space web server can compete with an in-kernel server Various techniques have been investigated to im- on performance, while retaining the reliability and prove the performance of web servers. These in- security benefits that come from operating in user clude novel server architectures [4, 5] as well as space. The results presented in this paper will help modifications to operating system interfaces and system developers and administrators in choosing mechanisms [6, 7, 8, 9, 10, 11]. An emerging trend between the in-kernel and the user-space approach has seen web servers being moved into the kernel for web servers. for better performance. TUX [12] (now known as Content Accelerator) is a kernel-space web server from . By running in kernel-space, 1 Introduction TUX avoids the overheads of context switching and event notification, and implements several op- The demand for web-based applications has grown timizations to reduce redundant data reads and

rapidly over the past few years, which has moti- copies. TUX supports the delivery of both static ¥ and dynamic content.

School of Computer Science ¦ Department of Electrical and Computer Engineering Initial research has indicated that kernel-based web servers provide a significant performance ad- Copyright § c 2004 Amol Shukla, Lily Li, Anand Subramanian, vantage over their user-space counterparts in the Paul A.S. Ward, and Tim Brecht. Permission to copy is hereby delivery of static content, and the in-kernel TUX granted provided the original copyright notice is reproduced in copies made. web server has been shown to be nearly twice as fast as the best user-space servers (IIS on Windows

1 2000 and Zeus on ) [13]. In this paper we sider the steps performed by a web server in han- revisit the kernel-space versus user-space debate dling a single client request. Figure 1, due to Brecht for web servers. In particular, we expand it to in- et al. [16], illustrates this process. clude dynamic content workloads. Recent research has shown that a significant proportion of work- loads in real-world web sites consists of dynamic 1. Wait for and accept an incoming network requests [1, 14], with Arlitt et al. [1] reporting that connection. over 95% of the requests in an e-commerce web 2. Read the incoming request from the net-

site are for dynamic content. work. ¤ Recently, the user-space server [15] web server 3. Parse the request. has shown promising results that rival the per- 4. For static requests, check the cache and formance of TUX on certain static-content work- possibly open and read the file. loads [16]. Prior to our work, the ¤ server did not 5. For dynamic requests, compute the result have the ability to handle dynamic requests. We

¤ based on the data provided in the request first demonstrate how to augment the server to and the information stored at the server. enable it to serve dynamic content. We then com-

¤ 6. Send the reply to the requesting client. pare the performance of the server and TUX (the fastest reported in-kernel web server in Linux) un- 7. If required, close the network connection. der a variety of static and dynamic workloads. We demonstrate that the gap in the performance of the Figure 1: Steps performed by a web server to pro- two servers becomes less significant as the propor- cess a client request. tion of dynamic-content requests increases. Fur-

ther, we show that for workloads with a majority Many of the steps in the figure can block due ¤ of dynamic requests, the server can outperform to network or disk I/O. A high-performance web TUX. Based on the results of our experiments, we server has to handle thousands of such requests si- suggest that a well-designed user-space web server multaneously without degrading the performance. can compete with an in-kernel server on perfor- Different server architectures [4, 5] have been pro- mance, while retaining the reliability and security posed to handle this problem and are summarized benefits that come from operating in . by Pai et al. [4]. An event-driven approach is The rest of this paper is organized as follows: often implemented in high-performance network Section 2 provides additional background informa- servers to multiplex a large number of requests over tion and summarizes some of the related work. In a few server processes. Using an event notifica- Section 3 we present an overview of various ap- tion mechanism, such as the select system call, proaches for supporting the delivery of dynamic the server identifies connections on which forward

content in web servers, and describe how we aug- progress can be made without blocking, and pro- ¤ mented the server to handle dynamic requests. cesses only those connections. By using just a Section 4 includes a detailed description of our ex- few processes, event-driven servers avoid exces- perimental environment, methodology and work- sive context-switches required by the thread- or loads. In Section 5 we present the results of our process- per-connection approach taken by multi- experiments and analyze them. In Section 6 we threaded or multi-process servers like Apache [4]. summarize our findings, discuss their implications Other researchers have suggested modifications and outline some ideas for future work. in operating system interfaces and mechanisms for efficient notification and delivery of network events to user-space servers [6, 7, 8, 17], reducing the 2 Background and Related amount of data copied between the kernel and the Work user-space [10], reducing the number of kernel- boundary crossings [9], and a combination of the Modern web servers require special techniques to above [11]. handle a large number of simultaneous connec- In light of the considerable demands placed on tions. In order to understand the performance char- the operating system by web servers, some re- acteristics of web servers, it is important to con- searchers have sought to improve the performance

2 of user-space web servers by migrating them into ability and the performance of a web server [2, 3]. the kernel. By running in the kernel, the cost of Our hypothesis was that since the generation of dy- event notification can be minimized through fewer namic content is CPU-bound [1], processing these kernel crossings, and the redundant copying and requests in the kernel might not prove to be as ad- buffering of data in the user-space application can vantageous as the case with static-content delivery. be avoided. The web server can also implement Recent workload-characterization studies by Arlitt optimizations to take advantage of direct access to et al. [1] and Wang et al. [14] report that a signifi- kernel-level data structures. In particular, a closer cant proportion of the requests handled by modern integration with the TCP/IP stack, the network in- e-commerce web sites are for dynamic content. To terface, and the file-system cache is possible. our knowledge, no studies examining the relative Following an in-kernel implementation of the benefits of the user-space versus kernel-space ap- Network File System (NFS) server in Linux, proach for web servers using workloads containing two kernel-space web servers were proposed. dynamic-content requests have been published, and kHTTPd [18] runs in the kernel as a module and our work attempts to fill this void. handles only static-content requests. TUX [12] is a kernel-based high-performance web server with the ability to serve both static and dynamic data. It im- 3 Implementing Dynamic Con- plements several optimizations such as zero-copy tent Support in the server parsing, zero-copy disk reads, zero-copy network writes, and caching complete responses in the ker- Prior to this paper, the ¤ server did not support nel’s networking layer to accelerate static content the delivery of dynamic content. In this section, delivery. By providing support for modules that we describe how we modified it to support dy- generate dynamic content and mass virtual hosting, namic requests for evaluation against TUX on the TUX is pitched as a high-performance replacement SPECweb99 benchmark. First we provide a brief for traditional user-space web servers. overview of existing dynamic content generation Joubert et al. [13] demonstrated that the best mechanisms. performing user-space web servers, IIS on Win- The Common Gateway Interface (CGI) is a stan- dows 2000 and Zeus on Linux, are about two times dard for running external programs which gener- slower than TUX on the SPECweb96 benchmark. ate dynamic content from web servers. Whenever They also reported that TUX is more than three a server receives a dynamic request, it creates a times faster than the popular Apache web server new process to handle the request and uses the CGI and provides better performance than kHTTPd. specification to interact with that process. After the However, their paper only deals with static-content dynamic content has been generated, the external requests and non-persistent (HTTP 1.0) connec- process passes the results to the web server and ter- tions. In this paper, we update their kernel ver- minates. Launching a new process is an expensive sus user-space comparison on Linux by evaluat- operation in terms of system resources. The CGI ing the performance of the in-kernel TUX web approach has been identified as having poor scala- server with the event-driven, user-space ¤ server bility due to the process-creation overhead for ev- using SPECweb99 workloads, which include both ery request [19, 20]. static and dynamic content requests, and HTTP 1.1 FastCGI or Servlets offer an improvement over persistent connections. The ¤ server has recently CGI by using a pool of persistent processes (or shown promising results that compare favourably threads) to handle dynamic requests. The processes against the performance of TUX on completely that generate dynamic content are persistent in that cached static workloads [16]. By implementing they are not terminated after every request, but are support for dynamic requests in the ¤ server we reused for later requests. The web server does not broaden the comparison of user-space and in-kernel launch a new process for every dynamic request web servers to include dynamic as well as out-of- but utilizes an existing process from the pool to cache (disk-bound) static workloads. generate content. A FastCGI or Servlet-like mech- While the delivery of static content is a simple anism reduces the process-creation overhead and process, the generation of dynamic content requires allows resources such as database connections to server-side computation, which can affect the scal- be shared between different dynamic-content han-

3 dlers. A related dynamic content generation mech- a cleaner API in ¤ server. Note that the results of anism called templating is used in PHP and Java our experiments are not affected by the static or dy- Server Pages, and involves embedding a scripting namic loading of modules. or a programming language in HTML to generate For TUX, we use a freely available user-space request-specific responses. However, both the CGI module for handling the dynamic requests re- and FastCGI-like approaches involve the overhead quired for the SPECweb99 benchmark [22]. Our of Inter-Process Communication between the web module for handling SPECweb99 requests in the server and the external processes while handling ¤ server borrows heavily from this TUX module, each dynamic request. with the TUX APIs replaced by those specific to Various server extensions have been proposed the ¤ server. By using a common code base for to minimize the inter-process communication over- the modules in TUX and the ¤ server, we expect heads. Web servers such as IIS provide APIs to eliminate any differences in performance due to which allow dynamic-content generation opera- different SPECweb99 request-handler implemen- tions to run as part of the main server process. tations, thereby providing a truer comparison be- The extension API approach involves the devel- tween in-kernel and user-space web servers for the opment of modules that use published server in- benchmark’s dynamic requests. terfaces to handle particular types of dynamic re- quests. Most high-performance web servers, in- cluding TUX, have APIs to allow modules to use 4 Experimental Environment the services of the web server to generate dynamic content. All experiments are conducted in an environment Modules responsible for dynamic-content gener- consisting of 8 client machines and a server con- ation can be loaded statically or dynamically (on- nected via two full-duplex Gigabit Ethernet links. the-fly) by the web server. Dynamic loading can All client machines have 550 MHz Pentium III pro- be conceptually viewed as follows (adapted from cessors, 256 MB of memory and run the 2.4.7-10 Millaway and Conrad [20]): the first time a user Linux kernel. The server is a 400 MHz Pentium II module is requested in a URI or when the server machine with 512 MB of memory and a 7200 RPM starts up, it loads the corresponding shared object IDE disk, and runs Red Hat Linux 8 with the 2.4.22 (using a function similar to dlopen), and obtains kernel. Since the server has two network cards, an entry point (typically a function pointer) into the we avoid network bottlenecks by partitioning the module (using a function similar to dlsym). The clients into two subnets. The first four clients module communicates with the web server using communicate with the server’s first Ethernet cards, published APIs. The server can pass on the HTTP while the remaining four use a different IP address requests to the appropriate module by maintaining linked to the second Ethernet card. This client- a mapping between the function pointer provided server cluster is completely isolated from other net- and the module name. TUX provides an interface work traffic. In order to ensure that we are able for the generation of dynamic content by trusted to generate sufficient load to drive the server with- modules, which can be dynamically loaded. Lever out letting the clients become a bottleneck, we use et al. [21] describe in detail how support for user the more powerful machines as our clients. For modules is implemented in TUX. all our experiments the server kernel is run in uni- Since the server extension API approach places processor mode. maximum emphasis on high performance, we have developed support for static modules in the ¤ server. 4.1 SPECweb99 Workload Generator In our implementation, modules responsible for

dynamic-content generation have to be compiled We evaluate the performance of the user-space ¤ and statically linked with the ¤ server. Dynamic re- server against the in-kernel TUX web server us- quests are delegated by the ¤ server to an appropri- ing the SPECweb99 workload generator. ate module, using a procedure similar to the one The SPECweb99 benchmark [23] has become described above, which then generates and deliv- the de-facto tool for web server performance eval- ers the content. In the future, we intend to imple- uation in the industry [24]. Nahum [24] exam- ment support for dynamic loading of modules with ines how well SPECweb99 captures real-world

4 web server workload characteristics such as request cessful HTTP requests as the number of simulta- methods, request inter-arrival times, and URI pop- neous connections is increased as our performance ularity. Detailed specifications of the benchmark metric. Consequently, our results do not comply can be found at the SPECweb99 web site [23, 25]. with the reporting guidelines of the SPECweb99 The SPECweb99 workload generator uses multi- benchmark. However, we believe that our perfor- ple client systems to distribute HTTP connections mance metric provides a better understanding of made to the web server. The number of simul- web server behaviour under varying loads, giving taneous connections is fixed for the duration of a us a range of data points so that we can compare SPECweb99 iteration. Each connection is used to web server performance under light, medium and make HTTP requests to the server according to the heavy load. predefined workload. Each experiment consists of a series of data The SPECweb99 clients can create workloads points, where each data point corresponds to a containing both static-content and dynamic-content SPECweb99 iteration where the number of simul- requests. The model for dynamic content in taneous connections attempted is fixed. Note that SPECweb99 is based on two prevalent features of the number of simultaneous connections attempted commercial web servers: advertising and user reg- is not necessarily equal to the number of simulta- istration [23]. A portion of SPECweb99 dynamic neous connections established. Each iteration con- GET requests simulate ad rotation in a commer- sists of one minute of warm-up time and two min- cial web site, where the ads appearing on a web utes of actual measurement time. The warm-up page are generated and rotated on-the-fly based on time between successive data points eliminates the the preferences of the user tracked through cook- cost of dynamically loading modules and allows ies. The dynamic POST requests in SPECweb99 various system resources such as the state of socket model user registration at an ISP site, where client descriptors to be cleared from previous tests. Prior information is passed to the web server as a POST to running experiments, all non-essential services request, and the server is responsible for storing the such as sendmail, dhcpd, and cron were shutdown data in a text file. to prevent these daemons from confounding the SPECweb99 specifies four classes of dynamic experimental results. The results are depicted us- requests: standard dynamic GET, dynamic GET ing graphs where the horizontal axis represents the with custom ad rotation through cookies, dynamic number of simultaneous connections requested and POST, and CGI dynamic GET [23]. Since the scal- the vertical axis shows the server throughput. ability problems of CGI are well-documented [19, The size of the file set is determined by the num- 20], we do not use CGI dynamic GET requests ber of simultaneous connections generated by the in any of our experiments. The workload mix in SPECweb99 workload generator and is computed SPECweb99 can be configured to vary the propor- using the formula described in [25]. The file set tion of the type of requests (static or dynamic), the size varies from 1.7 GB for 500 simultaneous con- HTTP method (GET or POST), and version (1.0 or nections to 3.3 GB for 1000 simultaneous connec- 1.1) used. While we experimented with a variety of tions to 4.8 GB for 1500 simultaneous connections, workload mixes involving both static and dynamic and the file set cannot be completely cached in workloads, we report only some of the most repre- memory. sentative results in this paper. 4.3 The Web Servers 4.2 Methodology We use the in-kernel TUX (kernel module version Each experiment involves a distinct SPECweb99 2, user-module version 2.2.7) and ¤ server 0.4.5 in workload mix. A workload mix is repeated for both our experiments. of the web servers. Note that we do not use the In the interest of making fair and scientific com- ‘maximum number of conforming simultaneous parisons, we carefully configured TUX and ¤ server connections’ performance metric specified by the to use the same resource limits. The maximum SPECweb99 benchmark because it describes the number of concurrent connections is set to 15,000 behaviour of the web server at a single data point. for TUX as well as the ¤ server. Logging is disabled Instead, we use the throughput reported for suc- in both the servers. The dynamic content modules

5 in both the ¤ server and TUX use the sendfile of them is suspended due to a disk read or network system call, which enables quick data transfer be- write, and the operating system can schedule an- tween file and socket descriptors. The Linux kernel other process that is ready to run. The TUX user versions that we examined (2.4.22, 2.6.1) silently manual recommends that the number of worker limit the accept queue backlog set by applications threads used in the server should not be greater than (through the listen system call) to 128. By de- the number of CPUs available [12]. Since we use fault, TUX bypasses this kernel-imposed limit in a single CPU for the experiments, we set the num- favour of a much larger value, viz., 2048. We ber of worker threads in TUX to one. By running changed the accept queue backlog in TUX to 128 to some experiments with more than one TUX worker match the value imposed by the kernel on the user- thread, we verified that a single worker thread re- space ¤ server. While we could have recompiled the sults in the best performance for TUX for our static, kernel to allow the ¤ server to use an accept queue dynamic and mixed workloads. Note that in addi- backlog of 2048, we opted not to use this approach tion to the worker thread, TUX uses a pool of ker- because a large number of systems currently oper- nel threads to obtain new work and asynchronously ate with the accept queue limit of 128. handle disk I/O tasks [21].

For the ¤ server we use the Accept-Inf op- tion [16] while accepting new connections, which causes the server to consecutively accept all cur- 5 Results rently pending connections, and the select sys- tem call as the event notification mechanism. Our first experiment involves a static-only work- Brecht et al. have demonstrated that using these load (with no dynamic requests), which can be seen two options the ¤ server can provide high perfor- as a yardstick against which we evaluate our later mance [16]. The ¤ server relies on the file sys- workloads that contain a varying proportion of dy- tem cache for caching static files, but the size of namic requests. Note that previous studies com- the available memory (512 MB) is small compared paring in-kernel and user-space web servers have to the size of the file set used in each data point concentrated on static workloads that can be com- in our experiments. TUX uses a pinned memory pletely cached in memory. In comparison, the static cache for accelerating the delivery of static content, workload used in our experiment is disk-bound. along with a number of other high-performance op- Figure 2 illustrates the relative performance of the timizations such as zero-copy parsing, zero-copy ¤ server with a single process (labelled ‘userver-1- disk reads and zero-copy network writes. process’), the ¤ server running 16 event-driven pro- There is one major limitation to the current han- cesses (labelled ‘userver-16-processes’) and TUX. dling of requests in the ¤ server. Some system The servers show similar behaviour up to 300 calls can block due to the lack of support for simultaneous connections. Many real-world web non-blocking or asynchronous disk reads in Linux. sites experience loads lighter than this and the neg- Some system calls responsible for writing dynam- ligible difference in the performance provided by ically generated content to the network can also the servers would tilt the scales in favour of the block in our implementation of dynamic content ¤ server, which offers the security and reliability support in the ¤ server. The equivalent disk read benefits of running in the user-space. However, and network writes in TUX are non-blocking be- large commercial and informational web sites of- cause TUX uses an architecture similar to that of ten have to handle a significantly higher number Flash [4] to asynchronously manage those opera- of concurrent connections. As the load increases,

tions [21]. We address this performance problem in TUX provides considerably higher throughput than ¤ the ¤ server by using multiple processes. Note that the server. The gap in the peak throughput (at each ¤ server process behaves in event-driven fash- 700 simultaneous connections) of the 16-process ion and multiplexes a number of connections using ¤ server and TUX is 40%. Past 700 simultaneous the select system call. All the ¤ server processes connections, the throughput of both servers deteri- listen for connections on the same port, share all orates, and the gap between the two becomes as the machine resources, and depend on the operating large as 61% at 1100 simultaneous connections. system for the distribution of work. The advantage Although this difference is significant, it is a lot of using multiple processes kicks in only when one smaller than previously-reported results, where the

6 2500 TUX userver-1-process userver-16-processes 2000

1500

1000 Throughput (ops/sec)

500

0 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Simultaneous Connections

Figure 2: 100% Static Workload Performance: ¤ server vs. TUX

kernel-space TUX web server was shown to be cessive content-switching. almost twice as fast as the best user-space web We can see that the in-kernel TUX web server server [13]. is not as effective in the delivery of dynamic con- Using 16 processes does improve the perfor- tent and is outperformed by the ¤ server with 8 pro- mance of ¤ server, helping it achieve higher peak cesses. The gap in their peak throughput (at 500

throughput than the single-process ¤ server. With a simultaneous connections for TUX and 700 simul- ¤ single process, the ¤ server is brought to a standstill taneous connections for the server) is 13%. The if a disk-read operation blocks, but with 16 pro- ¤ server with 8 processes outperforms TUX by as cesses even if some of them are blocked, the oper- much as 42% at the extreme points in the gap at ating system can schedule a process that is ready 1300 simultaneous connections. to run allowing the server to continue processing For the dynamic requests workload, using mul- requests. It should be noted that using too many tiple processes impacts the performance of the processes can actually inhibit the performance due ¤ server significantly, the peak throughput of the to excessive context switches and high ¤ server with 8 processes is 29% higher than that overhead. For the static workload using more than of the single-process ¤ server. We used the vm- 16 processes resulted in a drop in performance. stat and sar utilities to gather information about These results have been left out to reduce clutter system load activity at 500 simultaneous connec- in the graphs. tions. The statistics indicate that the ¤ server us- We now move to the evaluation of TUX and the ing a single process is blocked (reading large files ¤ server under a workload consisting exclusively of from disk or writing large responses to the network) dynamic-content requests. For this experiment we for at least 10% of the duration of the experiment. use the default SPECweb99 mix for dynamic re- On the other hand, with the 8-process ¤ server, we quests (without dynamic CGI GET requests, as ex- seldom reach a point where all eight processes are plained earlier), which comprises of 42% dynamic blocked, so some requests can always be serviced. GET, 42% dynamic GET with cookies and 16% dy- If all the server processes are blocked, CPU cycles namic POST requests. The results of this exper- are wasted. Indeed for the single-process ¤ server, iment appear in Figure 3. Note that using more the CPU is idle 40% of the time. For dynamic re- than 8 processes in the ¤ server for this workload quests which are compute-bound, this translates to resulted in degradation in performance due to ex- poor performance.

7 1600 TUX userver-1-process 1400 userver-8-processes

1200

1000

800

600 Throughput (ops/sec) 400

200

0 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Simultaneous Connections

Figure 3: 100% Dynamic Workload Performance: ¤ server vs. TUX

At this point we reiterate that using more than higher queue-drop value implies that the ¤ server one worker thread in TUX resulted in a degradation favours the completion of pending work before ac- in its performance on the workloads that we ex- cepting new connections. In investigating this fur- amined. Using vmstat and sar we verified that ther we observed that the TUX module that gener- the reduction in TUX’s throughput on workloads ates dynamic content consumes less than 25% of containing dynamic requests is not because it is the CPU time on an average. The overall CPU us- blocked more often than the multi-process ¤ server. age by TUX, including that by the dynamic content In trying to understand why TUX performs poorly generation module, is close to 90%. We suspect compared to the ¤ server on the dynamic content that as the number of simultaneous connections in- intensive workload we have identified two possible creases, TUX spends too much time in trying to causes. process new connections, instead of allocating suf- One reason for the reduction in the performance ficient CPU time to its dynamic content generation gap in the two servers could be that dynamic re- module to complete the work on existing connec- quests are processor-intensive [1, 2, 3], hence, the tions. As noted by Brecht et al. [16], there seems network and memory optimizations in TUX do not to be room for improvement in the connection- play as large a role in influencing the overall per- scheduling mechanism implemented in TUX. formance compared to static content delivery. So far we have evaluated TUX and the ¤ server A second factor in the drop in performance in on workloads comprising exclusively static or ex- TUX might be its strategy for accepting connec- clusively dynamic requests. However, the work- tions. TUX accepts new connections more ag- load mix seen in most real-world web servers is gressively, instead of making forward progress on usually a combination of dynamic and static re- already accepted connections [16]. We used the quests. The proportion of dynamic to static re- netstat utility to gather data about TCP queue quests differs based on the type of web site. In the drops in both the servers, and found that the num- final set of experiments we use a mixed workload ber of queue drops in TUX were an order of mag- where the ratio of dynamic to static content is var- nitude lower than those in the ¤ server. The fewer ied. In doing so we attempt to highlight the transi- queue drops in TUX suggests that it is very aggres- tion in the gap in the performance of user-space and sive in accepting new connections. In contrast, a kernel-space web servers as the proportion of dy-

8 1800 TUX userver-8-processes 1600

1400

1200

1000

800

600 Throughput (ops/sec)

400

200

0 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Simultaneous Connections

Figure 4: Mixed Workload (30% dynamic-70% static) Performance: ¤ server vs. TUX

1600 TUX userver-8-processes 1400

1200

1000

800

600 Throughput (ops/sec) 400

200

0 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Simultaneous Connections

Figure 5: Mixed Workload (50% dynamic-50% static) Performance: ¤ server vs. TUX

9 1600 TUX userver-8-processes 1400

1200

1000

800

600 Throughput (ops/sec) 400

200

0 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Simultaneous Connections

Figure 6: Mixed Workload (80% dynamic-20% static) Performance: ¤ server vs. TUX

namic requests increases. We use the request mix throughput is 3% higher than that of TUX. As re- described earlier for the dynamic-content portion ported by Arlitt et al., some e-commerce web sites of all our mixed workloads. We only show the re- do see workloads where the proportion of dynamic sults of the ¤ server with 8 processes, which is the requests is much higher than 80% [1]. For such best configuration for the number of processes, in web sites using a high-performance user-space web the following figures. server such as the ¤ server might be a much better Our first mixed workload consists of 30% dy- option. namic and 70% static requests. This ratio of dy- The results of our experiments suggest that the namic to static content is recommended by the performance merits of in-kernel web servers dimin- SPECweb99 benchmark. The results for this work- ish in the presence of dynamic-content requests.

load appear in Figure 4. TUX provides higher The gap in the peak throughput of TUX and the ¤ throughput than the server for this workload. ¤ server drops from 40% for static-only workloads However, the gap in their performance is much to under 4% for a workload with evenly split static smaller than that in the static-only workload. The and dynamic requests, and the ¤ server outper- difference in their peak throughput is 13%, and this forms TUX by a small margin as the proportion of gap becomes as high as 39% at 1500 simultaneous dynamic-content requests increases above 80%. connections. Figure 5 shows the results of the experiment with a workload containing 50% dynamic and 50% 6 Conclusions and Future static requests. We can see that the gap in the per- formance of TUX and the ¤ server is very small. Work The difference in their peak throughput is under 4%, and ¤ server’s throughput is always within 15% The contentious issue of kernel-space versus user- of that provided by TUX. space web servers has been previously investigated Our final mixed workload consists of 80% dy- in the context of completely cached static work- namic and 20% static requests. Figure 6 illustrates loads. We have expanded this discussion to cover the results of this experiment. The ¤ server out- dynamic workloads and disk-bound static work- performs TUX at all but one data point. Its peak loads, using two high-performance web servers:

10 the user-space ¤ server and the kernel-space TUX, Acknowledgements and the SPECweb99 workload generator. We gratefully acknowledge the National Sciences Our results suggest that the improvements to op- and Engineering Research Council of Canada, erating systems interfaces such as the sendfile Hewlett Packard, and the Ontario Research and De- system call and novel implementation techniques velopment Challenge Fund for financial support for such as the multi-accept strategy [16], which amor- portions of this project. tizes the overhead of event notification, have en- abled the user-space web servers to close in the gap with their in-kernel counterparts for static con- About the Authors

tent delivery. However, for predominantly static re- quests, kernel-space web servers still hold a perfor- Amol Shukla is an M.Math student in the mance advantage, and further improvements to op- School of Computer Science at the Univer- erating system interfaces and mechanisms are re- sity of Waterloo. His research interests in- quired to improve the performance of user-space clude Network Server Performance, Auto- web servers. nomic Computing, and System Support for Mobile and Distributed Applications. In-kernel web servers are not as effective in im- Lily Li is an M.A.Sc. student in the depart- proving the performance on workloads containing ment of Electrical and Computer Engineering dynamic requests. As the proportion of dynamic- at the University of Waterloo. Her main re- content requests increases, the advantages of the search interests include Wireless Mesh Net- kernel-space web servers diminish. Their perfor- works and Ad Hoc Networks. mance gap with user-space web servers is negligi- Anand Subramanian is an M.Math student ble if most of the requests are for dynamic content. in the School of Computer Science at the Uni- versity of Waterloo. His general areas of re- Given that in-kernel servers do not provide the search are Operating Systems and Networking security and reliability benefits of running in user- Protocols, he is particularly interested in oper- space, we recommend that system designers and ating system interfaces that improve the scala- administrators of web sites with a significant por- bility and performance of applications includ- tion of dynamic content examine the implications ing network servers. of migrating web servers to the kernel more care- Paul A.S. Ward is an Assistant Professor in fully. During the course of our tests, TUX crashed the Department of Electrical and Computer several times resulting in a kernel panic rendering Engineering at the University of Waterloo. the server machine unusable. It is typically easier His research interests are broadly in depend- to implement crash-recovery mechanisms for user- able distributed systems. Specifically he has space web servers. A crash in the ¤ server can be worked on problems of distributed-systems fixed by simply restarting it. Configuring TUX to observation and control, and more recently on ensure stable behaviour and high performance is a autonomic distributed systems. Prior to pursu- non-trivial task, and debugging problems in TUX ing his Ph.D. he worked in both the hardware requires some familiarity with the kernel. and software industries, covering the range from electronic parking meter design to devel-

In the future we intend to compare the ¤ server oping the fast parallel load utility for the DB2 with TUX using some of the new operating system database system. modifications which intend to improve the scala- Tim Brecht obtained his B.Sc. from the bility and performance of user-space web servers. University of Saskatchewan in 1983, M.Math While the ¤ server can match the performance of from the University of Waterloo in 1985, and TUX on dynamic content intensive workloads, us- Ph.D. from the University of Toronto in 1994. ing a scalable event notification mechanism such as He is currently an Associate Professor at the or an I/O mechanism such as asynchronous University of Waterloo. He has previously I/O could reduce its performance gap with TUX on held positions as an Associate Professor at static workloads. York University (in Toronto), a Visiting Sci-

11 entist at IBM’s Center for Advanced Studies, servers. IEEE/ACM Transactions on Net- and a Research Scientist with Hewlett Packard working, Vol. 10, February 2002. Labs. Current research interests include Oper- [10] Vivek Pai, Peter Druschel, and Willy ating Systems; Internet Systems, Services and Zwaenepoel. IO-Lite: A unified I/O buffer- Applications; Parallel Computing; and Perfor- ing and caching system. ACM Transactions mance Evaluation. on Computer Systems, Vol. 18:37–66, 2000. [11] M. Ostrowski. A mechanism for scalable References event notification and delivery in Linux. Mas- ter’s thesis, Department of Computer Science, [1] M. Arlitt, D. Krishnamurthy, and J. Rolia. University of Waterloo, November 2000. Characterizing the scalability of a large web- based shopping system. IEEE/ACM Transac- [12] Red Hat Inc. TUX 2.2 Reference Manual, tions on Internet Technology, 1, June 2001. 2002. Available at http://www.redhat.com/ docs/manuals/tux/TUX-2.2-Manual. [2] A. Iyengar, E. MacNair, and T. Nguyen. An analysis of web server performance. In Pro- [13] Philippe Joubert, Robert B. King, Richard ceedings of GLOBECOM '97, 1997. Neves, Mark Russinovich, and John M. Tracey. High-performance memory-based [3] V. Almeida and M. Mendes. Analyzing the web servers: Kernel and user-space perfor- impact of dynamic pages on the performance mance. In USENIX Annual Technical Confer- of web servers. In Computer Measurement ence, General Track, pages 175–187, 2001. Group Conference, December 1998. [14] Qing Wang, Dwight Makaroff, H. Keith Ed- [4] Vivek S. Pai, Peter Druschel, and Willy wards, and Ryan Thompson. Workload char- Zwaenepoel. Flash: An efficient and portable acterization for an E-commerce web site. In Web server. In Proceedings of the USENIX Proceedings of CASCON 2003, pages 313–

1999 Annual Technical Conference, 1999. 327, 2003. ¤ [5] M. Welsh, D. Culler, and E. Brewer. SEDA: [15] The server home page. HP Labs, 2004. An architecture for well-conditioned, scal- Available at http://hpl.hp.com/research/linux/ able internet services. In Proceedings of the userver. Eighteenth Symposium on Operating Systems [16] Tim Brecht, David Pariag, and Louay Principles, October 2001. Gammo. accept()able strategies for improv- ing web server performance. In Proceedings [6] N. Provos and C. Lever. Scalable network I/O of the 2004 USENIX Annual Technical Con- in Linux. In Proceedings of USENIX Annual ference, General Track, June 2004. Technical Conference, FREENIX Track, June 2000. [17] Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag. Comparing and evaluating [7] Abhishek Chandra and David Mosberger. epoll, select, and poll event mechanisms. In Scalability of Linux event-dispatch mecha- Proceedings of 6th Annual Linux Symposium, nisms. In Proceedings of the 2001 USENIX July 2004. Annual Technical Conference, pages 231– 244, 2001. [18] Arjan van de Ven. kHTTPd Linux HTTP accelerator. Available at http://www.fenrus. [8] G. Banga, J.C. Mogul, and P. Druschel. A demon.nl. scalable and explicit event delivery mecha- nism for UNIX. In Proceedings of the 1999 [19] Giorgos Gousios and Diomidis Spinellis. A USENIX Annual Technical Conference, June comparison of portable dynamic web content 1999. technologies for the Apache web server. In Proceedings of the 3rd International System [9] Erich Nahum, Tsipora Barzilai, and Dilip Administration and Networking Conference Kandlur. Performance issues in WWW SANE 2002, pages 103–119, May 2002.

12 [20] J. Millaway and P. Conrad. C Server Pages: An architecture for dynamic web content gen- eration. In WWW2003, May 2003. [21] C. Lever, M. Eriksen, and S. Molloy. An Analysis of the TUX Web Server. Technical report, University of Michigan, CITI Techni- cal Report 00-8, Nov. 2000. [22] SPECweb99 Dynamic API module source code. Standard Performance Evaluation Cor- poration. Available at http://www.specbench .org/web99/results/api-src/Dell-20030527- RHCA.tgz. [23] SPECweb99 Benchmark. Standard Perfor- mance Evaluation Corporation. Available at http://www.specbench.org/web99/. [24] Erich M. Nahum. Deconstructing SPECweb99. In the 7th International Work- shop on Web Content Caching and Distribu- tion (WCW), August 2001. [25] SPECweb99 Design Document. Standard Performance Evaluation Corporation. Available at http://www.specbench.org/ web99/docs/whitepaper.html.

13