Proceedings of the IEEE International Conference on Computer Communications and Networks Miami, Florida, pp. 63-68, October 2002. Implementation and Evaluation of Transparent Fault-Tolerant Web Service with Kernel-Level Support

Navid Aghdaie and Yuval Tamir Concurrent Systems Laboratory UCLA Computer Science Department Los Angeles, California 90095 {navid,tamir}@cs.ucla.edu

AbstractÐMost of the techniques used for increasing the 1 2 availability of web services do not provide fault tolerance for requests being processed at the time of server failure. Other schemes require deterministic servers or changes to the web Client Back-end client. These limitations are unacceptable for many current and future applications of the Web. We have developed an efficient 4 3 implementation of a client-transparent mechanism for providing fault-tolerant web service that does not have the limitations Figure 1: If the web server fails before sending the client reply (step mentioned above. The scheme is based on a hot standby backup 4), the client can not determine whether the failure was before or server that maintains logs of requests and replies. The after the web server communication with the back-end (steps 2,3) implementation includes modifications to the kernel and to web browser, are widely distributed and they are typically the Apache web server, using their respective module mechanisms. We describe the implementation and present an developed independently of the web service, it is critical that evaluation of the impact of the backup scheme in terms of any fault tolerance scheme used be transparent to the client. throughput, latency, and CPU processing cycles overhead. Schemes for transparent server replication [3, 7, 18, 25] sometimes require deterministic servers for reply generation I. INTRODUCTION or do not recover requests whose processing was in progress Web servers are increasingly used for critical applications at the time of failure. We discuss some of these solutions in where outages or erroneous operation are unacceptable. In more detail in Sections II and V. most cases critical services are provided using a three tier We have previously developed a scheme for client- architecture, consisting of: client web browsers, one or more transparent fault-tolerant web service that overcomes the replicated front-end servers (e.g. Apache), and one or more disadvantages of existing schemes [1]. The scheme is based back-end servers (e.g. a database). HTTP over TCP/IP is the on logging of HTTP requests and replies to a hot standby predominant protocol used for communication between clients backup server. Our original implementation was based on and the web server. The front-end web server is the mediator user-level proxies, required non-standard features of the between the clients and the back-end server. Solaris raw socket interface, and was never intergrated with a Fault tolerance techniques are often used to increase the real web server. That implementation did not require any reliability and availability of Internet services. Web servers kernel modifications but incurred high processing overhead. are often stateless Ð they do not maintain state information The contribution of this paper is a more efficient from one client request to the next. Hence, most existing web implementation of the scheme on Linux based on kernel server fault tolerance schemes simply detect failures and route modifications and its integration with the Apache web server future requests to backup servers. Examples of such fault using Apache's module mechanism. The small modifications tolerance techniques include the use of specialized routers and to the kernel are used to provide client-transparent multicast of load balancers [4, 5, 12, 14] and data replication [6, 28]. These requests to a primary server and a backup server as well as the methods are unable to recover in-progress requests since, ability to continue transmission of a reply to the client despite while the web server is stateless between transactions, it does server failure. Our implementation is based on off-the-shelf maintain important state from the arrival of the first packet of hardware (PC, router), and software (Linux, Apache). We a request to the transmission of the last packet of the reply. rely on the standard reliability features of TCP and do not With the schemes mentioned above, the client never receives make any changes to the protocol or its implementation. complete replies to the in-progress requests and has no way to In Section II we present the architecture of our scheme and determine whether or not a requested operation has been key design choices. Section III discusses our implementation performed [1, 15, 16] (see Figure 1). based on kernel and web server modules. A detailed analysis Some recent work does address the need for handling in- of the performance results including throughput, latency, and progress transactions. Client-aware solutions such consumed processing cycles is presented in Section IV. as [16, 23, 26] require modifications to the clients to achieve Related work is discussed in Section V. their goals. Since many versions of the client software, the

63 II. TRANSPARENT FAULT-TOLERANT WEB SERVICE We have previously proposed [1] implementing transparent In order to provide client-transparent fault-tolerant web fault-tolerant web service using a hot standby backup server service, a fault-free client must receive a valid reply for every that logs HTTP requests and replies but does not actually request that is viewed by the client as having been delivered. process requests unless the primary server fails. The error Both the request and the reply may consist of multiple TCP control mechanisms of TCP are used to provide reliable packets. Once a request TCP packet has been acknowledged multicast of client requests to the primary and backup. All to the client, it must not be lost. All reply TCP packets sent to client request packets are logged at the backup before arriving the client must form consistent, correct replies to prior at the primary and the primary reliably forwards a copy of the requests. reply to the backup before sending it to the client. Upon failure of the primary, the backup seamlessly takes over We assume that only a single server host at a time may fail. receiving partially received requests and transmitting logged We further assume that hosts are fail-stop [24]. Hence, host replies. The backup processes logged requests for which no failure is detected using standard techniques, such as periodic reply has been logged and any new requests. heartbeats. Techniques for dealing with failure modes other than fail-stop are important but are beyond the scope of this Since our scheme is client-transparent, clients communicate paper. We also assume that the local area network connecting with a single server address (the advertised address) and are the two servers as well as the Internet connection between the unaware of server replication [1]. The backup server receives client and the server LAN will not suffer any permanent all the packets sent to the advertised address and forwards a faults. The primary and backup hosts are connected on the copy to the primary server. For client transparency, the source same IP subnet. In practice, the reliability of the network addresses of all packets received by the client must be the connection to that subnet can be enhanced using multiple advertised address. Hence, when the primary sends packets to routers running protocols such as the Virtual Router the clients, it ``spoofs'' the source address, using the service's Redundancy Protocol [19]. This can prevent the local LAN advertised address instead of it's own as the source address. router from being a critical single point of failure. The primary logs replies by sending them to the backup over a reliable (TCP) connection and waiting for an acknowledgment In order achieve the fault tolerance goals, active replication before sending them to the client. This paper uses the same of the servers may be used, where every client request is basic scheme but the focus here is on the design and processed by both servers. While this approach will have the evaluation of a more efficient implementation based on kernel best fail-over time, it suffers from several drawbacks. First, modifications. this approach has a high cost in terms of processing power, as every client request is effectively processed twice. A second III. IMPLEMENTATION drawback is that this approach only works for deterministic There are many different ways to implement the scheme servers. If the servers generate replies non-deterministically, described in Section II. As mentioned earlier, we have the backup may not have an identical copy of a reply and thus previously done this based on user-level proxies, without any it can not always continue the transmission of a reply should kernel modifications [1]. A proxy-based implementation is the primary fail in the midst of sending a reply. simpler and potentially more portable than an implementation An alternative approach is based on logging. Specifically, that requires kernel modification but it incurs higher request packets are acknowledged only after they are stored performance overhead (Section IV). It is also possible to redundantly (logged) so that they can be obtained even after a implement the scheme entirely in the kernel in order to failure of a server host [1, 3]. Since the server may be non- minimize the overhead [22]. However it is generally desirable deterministic, none of the packets of a reply can be sent to the to minimize the complexity of the kernel [8, 17]. Furthermore, client unless the entire reply is safely stored (logged) so that the more modular approach described in this paper makes it its transmission can proceed despite a failure of a server easier to port the implementation to other kernels or other web host [1]. The logging of requests can be done at the level of servers. TCP packets [3] or at the level of HTTP requests [1]. If Our current implementation consists of a combination of request logging is done at the level of HTTP requests, the kernel modifications and modifications to the user-level web requests can be matched with logged replies so that a request server (Figure 2). TCP/IP packet operations are performed in will never be reprocessed following failure if the reply has the kernel and the HTTP message operations are performed in already been logged [1]. This is critical in order to ensure that the web servers. We have not implemented the back-end for each request only one reply will reach the client. If portion of the three-tier structure. This can be done as a request logging is done strictly at the level of TCP packets [3], mirror image of the front-end communication [1]. it is possible for a request to be replayed to a spare server Furthermore, since the transparency of the fault tolerance following failure despite the fact that a reply has already been scheme is not critical between the web server and back-end sent to the client. Since the spare server may generate a servers, simpler and less costly schemes are possible for this different reply, two different replies for the same request may section. For example, the front-end servers may include a reach the client, clearly violating the requirement for transaction ID with each request to the back-end. If a request transparent fault tolerance. is retransmitted, it will include the transaction ID and the

64 Incoming Msg B. The Server Module Client Outgoing Msg The server module is used to handle the parts of the scheme that deal with messages at the HTTP level. The Apache module acts as a handler [27] and generates the replies that are B P sent to the clients. It is composed of worker, mux, and demux a Kernel Kernel Module Kernel Module Kernel r processes. c i k m u a To Backup Kernel To Primary Kernel p r y Worker Procs Worker Procs B ack Demux Proc P Server Server Module Server Module Server a r c i HTTP Reply k Demux Proc m u a p Mux Proc r Figure 2: Implementation: replication using a combination of kernel reply y and web server modules. Message paths are shown. back-end can use that to avoid performing the transaction Figure 3: Server Structure: The mux/demux processes are used to multiple times [20]. reliably transmit a copy of the replies to the backup before they are sent to clients. The server module implements these processes and A. The Kernel Module the necessary changes to the standard worker processes. The kernel module implements the client-transparent 1) Worker Processes: A standard Apache web server atomic multicast mechanism between the client and the consists of several processes handling client requests. We primary/backup server pair. In addition it facilitates the refer to these standard processes as worker processes. In transmission of outgoing messages from the server pair to the addition to the standard handling of requests, in our scheme client such that the backup can continue the transmission the worker processes also communicate with the mux/demux seamlessly if the primary fails. processes described in the next subsection. The public address of the service known to clients is The primary worker processes receive the client request, mapped to the backup server, so the backup will receive the perform parsing and other standard operations, and then client packets. After an incoming packet goes through the generate the reply. Other than a few new bookkeeping standard kernel operations such as checksum checking, and operations, these operations are exactly what is done in a just before the TCP state change operations are performed, the standard web server. After generating the reply, instead of backup's kernel module forwards a copy of the packet to the sending the reply directly to the client, the primary worker primary. The backup's kernel then continues the standard processes pass the generated reply to the primary mux process processing of the packet, as does the primary's kernel with the so that it can be sent to the backup. The primary worker forwarded packet. process then waits for an indication from the primary demux Outgoing packets to the client are sent by the primary. process that an acknowledgment has been received from the Such packets must be presented to the client with the service backup, signaling that it can now send the reply to the client. public address as the source address. Hence, the primary's The backup worker processes perform the standard kernel module changes the source address of outgoing packets operations for receiving a request, but do not generate the to the public address of the service. On the backup, the kernel reply. Upon receiving a request and performing the standard processes the outgoing packet and updates the kernel's TCP operations, the worker process just waits for a reply from the state, but the kernel module intercepts and drops the packet backup demux process. This is the reply that is produced by a when it reaches the device queue. TCP acknowledgments for primary worker process for the same client request. outgoing packets are, of course, incoming packets and they 2) Mux/Demux Processes: The mux/demux processes are multicast to the primary and backup as above. ensure that a copy of the reply generated by the primary is The key to our multicast implementation is that when the sent to and received by the backup before the transmission of primary receives a packet, it is assured that the backup has an the reply to the client starts. This allows for the backup to identical copy of the packet. The backup forwards a packet seamlessly take over for the primary in the event of a failure, only after the packet passes through the kernel code where a even if the replies are generated non-deterministically. The packet may be dropped due to a detected error (e.g., mux/demux processes communicate with each other over a checksum) or heavy load. If a forwarded packet is lost while TCP connection, and use semaphores and shared memory to enroute to the primary, the client does not receive an communicate with worker processes on the same host (figure acknowledgment and thus retransmits the packet. This is 3). A connection identifier (client's IP address and TCP port because only the primary's TCP acknowledgments reach the number) is sent along with the replies and acknowledgments client. TCP acknowledgments generated by the backup are so that the demux process on the remote host can identify the dropped by the backup's kernel module. worker process with the matching request.

65 IV. PERFORMANCE EVALUATION milliseconds are common. The absolute overhead time The evaluation of the scheme was done on 350 MHz Intel introduced by our scheme remains the same regardless of Pentium II PC's interconnected by a 100 Mb/sec switched server response times and therefore our implementation network based on a Cisco 6509 switch. The servers were overhead is only a small fraction of the overall response time running our modified Linux 2.4.2 kernel and the Apache seen by clients. 1.3.23 web server with logging turned on and with our kernel B. Throughput and server modules installed. We used custom clients similar Figure 5 shows the peak throughput of a single pair of to those of the Wisconsin Proxy Benchmark [2] for our server hosts for different reply sizes. The throughputs of measurements. The clients continuously generate one ``unreplicated'' and ``simplex'' (in Mbytes/sec) increase until outstanding HTTP request at a time with no think time. For the network becomes the bottleneck. However, the duplex each experiment, the requests were for files of a specific size mode throughput peaks at less than half of that amount. This as presented in our results. Internet traffic studies [13, 10] is due to the fact that on the primary, the sending of the reply indicate that most web replies are less than 10-15 kbytes in to the backup by the server module and the sending of reply to size. Measurement were conducted on at least three system the clients (figure 2) occur over the same physical link. configurations: unreplicated, simplex, and duplex. The Hence, the throughput to the clients is reduced by half in ``unreplicated'' system is the standard system with no kernel duplex mode. To avoid this bottleneck, the transmission of or web server modifications. The ``simplex'' system includes the replies from the primary to the backup can be performed the kernel and server modifications but there is only one on a separate dedicated link. A high-speed Myrinet [9] LAN server, i.e., incoming packets are not really multicast and was available to us and was used for this purpose in outgoing packets are not sent to a backup before transmission measurements denoted by ``duplex-mi''. These to the client. The extra overhead of ``simplex'' relative to measurements show a significant throughput improvement ``unreplicated'' is due mainly to the packet header over the duplex results, as a throughput of about twice that of manipulations and bookkeeping in the kernel module. The duplex mode with a single network interface is achieved. ``duplex'' system is the full implementation of the scheme. C. Processing Overhead Duplex Table 1 shows the CPU cycles used by the servers to L Reply Overhead a 10 receive one request and generate a reply. These t . . . Simplex . . . measurements were done using the processor's performance . . . e . . . . Unreplicated ...... monitoring counters [21]. For each configuration the table n 5 ...... c . . . . presents the kernel-level, user-level, and total cycles used. . . . y . . . . The cpu% column shows the cpu utilization at peak 0 throughput, and indicates that the system becomes CPU bound 0 10 20 30 40 50 as the reply size decreases. This explains the throughput Reply Size (Kbytes) results, where lower throughputs (in Mbytes/sec) were Figure 4: Average latency (msec) observed by a client for different reached with smaller replies. reply sizes and system modes. The Reply Overhead line depicts the latency caused by replication of the reply in duplex mode. Based on Table 1, the duplex server (primary and backup combined) can require more than four times (for the 50KB A. Latency reply) as many cycles to handle a request compared with the Figure 4 shows the average latency on an unloaded server unreplicated server. However, as noted in the previous and network from the transmission of a request by the client to subsection, these measurements are for replies generated by the receipt of the corresponding reply by the client. There is reading cached static files. In practice, for likely applications only a single client on the network and this client has a of this technology (dynamic content), replies are likely to be maximum of one outstanding request. The results show that smaller and require significantly more processing. Hence, the the latency overhead relative to the unreplicated system actual relative processing overhead can be expected to be increases with increasing reply size. This is due to processing much lower than the factor of 4 shown in the table. of more reply packets. The difference between the ``Reply D. Comparison with a User-Level Implementation Overhead'' line and the ``Unreplicated'' line is the time to As mentioned earlier, our original implementation of this transmit the reply from the primary to the backup and receive fault tolerance scheme was based on user-level proxies, an acknowledgement at the primary. This time accounts for without any kernel modifications [1]. Table 2 shows a most of the duplex overhead. Note that these measurements comparison of the processing overhead of the user-level proxy exaggerate the relative overhead that would impact a real approach with the implementation presented in this paper. system since: 1) the client is on the same local network as the This comparison is not perfectly accurate. While both server, and 2) the requests are for (cached) static files. In schemes were implemented on the same hardware, the user- practice, taking into account server processing and Internet level proxy approach runs under the Solaris operating system communication delays, server response times of hundreds of and could not be easily ported to Linux due to a difference in

66 1000 12 Unreplicated ...... 800 . . . . Simplex . 10 ...... Unreplicated ...... 8 . Requests 600 . . Mbytes . . Duplex-mi ...... per . . per . . . . Simplex . . 6 . Second . . Second . 400 ...... 4 . Duplex ...... 200 Duplex-mi . 2 . . Duplex . 0 0

0 10 20 30 40 50 0 10 20 30 40 50 Reply Size Reply Size Figure 5: System throughput (in requests and Mbytes per second) for different message sizes (kbytes) and system modes. Duplex-mi line denotes setting with multiple network interfaces for each server - one interface is used only for reply replication.

TABLE 1: Breakdown of used CPU cycles (in thousands) - cpu% column indicates CPU utilization during peak throughput.

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢

¢ 1kbyte reply 10kbyte reply 50kbyte reply

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

System Mode ¢

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

user ¢ kernel total cpu% user kernel total cpu% user kernel total cpu%

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

Duplex (primary) 190 ¢ 337 527 100 193 587 780 77 224 1548 1772 53

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

Duplex (backup) 147 ¢ 330 477 91 158 615 773 76 185 1790 1958 58

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢

¢ Duplex-mi (primary) 192 353 545 100 198 544 742 85 225 1283 1508 85

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

Duplex-mi (backup) 147 ¢ 355 502 93 152 545 697 80 169 1124 1293 72

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

Simplex 186 ¢ 250 436 100 191 365 556 99 208 871 1079 70

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ Unreplicated 165 230 395 100 166 342 508 99 178 730 908 60

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

TABLE 2: User-level versus kernel support Ð CPU cycles (in There are various server replication schemes that are not

thousands) for processing a request that generates a 1Kbyte reply. client transparent. Most still do not provide recovery of

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢

¢ requests that were partially processed. Frolund and

¢ ¢ ¢ ¢

¢ Implementation Primary Backup Total

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢

¢ User-level Proxies 1860 1370 3230 Guerraoui [16] do recover such requests. However, the client

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢

¢ Kernel/Server Modules 337 330 667 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ must retransmit the request to multiple servers upon failure the semantics of raw sockets. In addition, the server programs detection and must be aware of the address of all instances of are different although they do similar processing. However, replicated servers. A consensus agreement protocol is also the difference of almost a factor of 5 is clearly due mostly to required for the implementation of their ``write-once the difference in the implementation of the scheme, not to OS registers'' which could be costly, although it allows recovery differences. The large overhead of the proxy approach is from non fail-stop failures. Our kernel module can be seen as caused by the extraneous system calls and message copying an alternative implementation of the write-once registers that are necessary for moving the messages between the two which also provides client transparency. Zhao et al [29] levels of proxies and the server. describe a CORBA-based infrastructure for replication in three-tier systems which deal with the same issues, but again V. RELATED WORK is not client-transparent. Early work in this field, such as Round Robin DNS [11] and The work by Snoeren et al [26] is another example of a DNS aliasing methods, focused on detecting a fault and solution that is not transparent to the client. A transport layer routing future requests to available servers. Centralized protocol with connection migration capabilities, such as SCTP schemes, such as the Magic Router [4] and Cisco Local or TCP with proposed extensions, is used along with a session Director [12], require request packets to travel through a state synchronization mechanism between servers to achieve central router where they are routed to the desired server. connection-level failover. The requirement to use a Typically the router detects server failures and does not route specialized transport layer protocol at the client is obviously packets to servers that have failed. The central router is a not transparent to the client. single point of failure and a performance bottleneck since all HydraNet-FT [25] uses a scheme that is similar to ours. It is packets must travel through it. Distributed Packet client-transparent and can recover partially processed Rewriting [7] avoids having single entry point by allowing the requests. The HydraNet-FT scheme was designed to deal with servers to send messages directly to clients and by server replicas that are geographically distributed. As a result, implementing some of the router logic in the servers so that it must use specialized routers (``redirectors'') to get packets they can forward the requests to different servers. None of to their destinations. These redirectors introduces a single these schemes support recovering requests that were being point of failure similar to the Magic Router scheme. Our processed when the failure occured, nor do they deal with scheme is based on the ability to place all server replicas on non-deterministic and non-idempotent requests. the same subnet [1]. As a result, we can use off-the-shelf

67 routers and multiple routers can be connected to the same [7] A. Bestavros, M. Crovella, J. Liu, and D. Martin, ``Distributed Packet Rewriting and its Application to Scalable Server Architectures,'' subnet and configured to work together to avoid a single point Proceedings of the International Conference on Network Protocols, of failure. Since HydraNet-FT uses active replication, it can Austin, Texas, pp. 290-297 (October 1998). only be used with deterministic servers while our standby [8] D. L. Black, D. B. Golub, D. P. Julin, R. F. Rashid, and R. P. Draves, ``Microkernel Operating System Architecture and Mach,'' Proceedings backup scheme does not have this limitation. of the USENIX Workshop on Micro-Kernels and Other Kernel Alvisi et al implemented FT-TCP [3], a kernel level TCP Architectures, Berkeley, CA, pp. 11-30 (April 1992). [9] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. wrapper that transparently masks server failures from clients. N. Seizovic, and W.-K. Su, ``Myrinet: A Gigabit-per-Second Local While this scheme and its implementation are similar to ours, Area Network,'' IEEE Micro 15(1), pp. 29-36 (February 1995). there are important differences. Instead of our hot standby [10] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, ``Web Caching and Zipf-like Distributions: Evidence and Implications,'' Proceedings spare approach, a logger running on a separate processor is of IEEE INFOCOM, New York, New York (March 1999). used. If used for web service fault tolerance, FT-TCP requires [11] T. Brisco, ``DNS Support for Load Balancing,'' IETF RFC 1794 (April deterministic servers (see Section II) and significantly longer 1995). [12] Cisco Systems Inc, ``Scaling the Internet Web Servers,'' Cisco recovery times. In addition, they did not evaluate their Systems White Paper - http://www.ieng.com/warp/public/cc/pd/cxsr/- scheme in the context of web servers. 400/tech/scale_wp.htm. [13] C. Cunha, A. Bestavros, and M. Crovella, ``Characteristics of World VI. CONCLUSION Wide Web Client-based Traces,'' Technical Report TR-95-010, Boston We have proposed a client-transparent fault tolerance University, CS Dept, Boston, MA 02215 (April 1995). [14] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari, ``A scalable and scheme for web services that correctly handles all client highly available web server,'' Proceedings of IEEE COMPCON '96, requests in spite of a web server failure. Our scheme is San Jose, California, pp. 85-92 (1996). compatible with existing three-tier architectures and can work [15] S. Frolund and R. Guerraoui, ``CORBA Fault-Tolerance: why it does not add up,'' Proceedings of the IEEE Workshop on Future Trends of with non-deterministic and non-idempotent servers. We have Distributed Systems (December 1999). implemented the scheme using a combination of Linux kernel [16] S. Frolund and R. Guerraoui, ``Implementing e-Transactions with modifications and modification to the Apache web server. We Asynchronous Replication,'' IEEE International Conference on Dependable Systems and Networks, New York, New York, pp. 449-458 have shown that this implementation involves significantly (June 2000). lower overhead than a strictly user-level proxy-based [17] D. Golub, R. Dean, A. Forin, and R. Rashid, ``Unix as an Application implementation of the same scheme. Our evaluation of the Program,'' Proceedings of summer USENIX, pp. 87-96 (June 1990). [18] C. T. Karamanolis and J. N. Magee, ``Configurable Highly Available response time (latency) and processing overhead shows that Distributed Services,'' Proceedings of the 14th IEEE Symposium on the scheme does introduce significant overhead compared to a Reliable Distributed Systems, Bad Neuenhar, Germany, pp. 118-127 standard server with no fault tolerance features. However, (September 1995). [19] S. Knight, D. Weaver, D. Whipple, R. Hinden, D. Mitzel, P. Hunt, P. this result only holds if generating the reply requires almost no Higginson, M. Shand, and A. Lindem, ``Virtual Router Redundancy processing. In practice, for the target application of this Protocol,'' RFC 2338, IETF (April 1998). scheme, replies are often small and are dynamically generated [20] Oracle Inc, Oracle8i Distributed Database Systems - Release 8.1.5, Oracle Documentation Library (1999). (requiring significant processing). For such workloads, our [21] M. Pettersson, ``Linux x86 Performance-Monitoring Counters Driver,'' results imply low relative overheads in terms of both latency http://www.csd.uu.se/Ämikpe/linux/perfctr/. and processing cycles. We have also shown that in order to [22] Inc, `` Web Server,'' http://www.redhat.com/docs/- achieve maximum throughput it is critical to have a dedicated manuals/tux/. [23] M. Sayal, Y. Breitbart, P. Scheuermann, and R. Vingralek, ``Selection network connection between the primary and backup. Algorithms for Replicated Web Servers,'' Performance Evaluation Review - Workshop on Internet Server Performance, Madison, REFERENCES Wisconsin, pp. 44-50 (June 1998). [24] F. B. Schneider, ``Byzantine Generals in Action: Implementing Fail- [1] N. Aghdaie and Y. Tamir, ``Client-Transparent Fault-Tolerant Web Stop Processors,'' ACM Transactions on Computer Systems 2(2), Service,'' Proceedings of the 20th IEEE International Performance, pp. 145-154 (May 1984). Computing, and Communications Conference, Phoenix, Arizona, [25] G. Shenoy, S. K. Satapati, and R. Bettati, ``HydraNet-FT: Network pp. 209-216 (April 2001). Support for Dependable Services,'' Proceedings of the 20th IEEE [2] J. Almeida and P. Cao, ``Wisconsin Proxy Benchmark,'' Technical International Conference on Distributed Computing Systems, Taipei, Report 1373, Computer Sciences Dept, Univ. of Wisconsin-Madison Taiwan, pp. 699-706 (April 2000). (April 1998). [26] A. C. Snoeren, D. G. Andersen, and H. Balakrishnan, ``Fine-Grained [3] L. Alvisi, T. C. Bressoud, A. El-Khashab, K. Marzullo, and D. Failover Using Connection Migration,'' Proceedings of the 3rd Zagorodnov, ``Wrapping Server-Side TCP to Mask Connection USENIX Symposium on Internet Technologies and Systems, San Failures,'' Proceedings of IEEE INFOCOM, Anchorage, Alaska, Francisco, California (March 2001). pp. 329-337 (April 2001). [27] L. Stein and D. MacEachern, Writing Apache Modules with Perl and C, [4] E. Anderson, D. Patterson, and E. Brewer, ``The Magicrouter, an O'Reilly and Associates (March 1999). Application of Fast Packet Interposing,'' Class Report, UC Berkeley - [28] R. Vingralek, Y. Breitbart, M. Sayal, and P. Scheuermann, ``Web++: A http://www.cs.berkeley.edu/Äeanders/projects/magicrouter/ (May 1996). System For Fast and Reliable Web Service,'' Proceedings of the [5] D. Andresen, T. Yang, V. Holmedahl, and O. H. Ibarra, ``SWEB: USENIX Annual Technical Conference, Sydney, Australia, pp. 171-184 Towards a Scalable World Wide Web Server on Multicomputers,'' (June 1999). Proccedings of the 10th International Parallel Processing Symposium, [29] W. Zhao, L. E. Moser, and P. M. Melliar-Smith, ``Increasing the Honolulu, Hawaii, pp. 850-856 (April 1996). Reliability of Three-Tier Applications,'' Proceedings of the 12th [6] S. M. Baker and B. Moon, ``Distributed Cooperative Web Servers,'' International Symposium on Software Reliability Engineering, Hong The Eighth International World Wide Web Conference, Toronto, Kong, pp. 138-147 (November 2001). Canada, pp. 1215-1229 (May 1999).

68