GREEN CLUSTER OF LOW-POWER EMBEDDED HARDWARE SERVER

ACCELERATORS

NAVID MOHAGHEGH

A DISSERTATION SUBMITTED TO THE FACULTY OF GRADUATE

STUDIES IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTERS OF APPLIED SCIENCE AND ENGINEERING

GRADUATE PROGRAM IN COMPUTER SCIENCE AND ENGINEERING

YORK UNIVERSITY, TORONTO, ONTARIO

NOVEMBER 2011 Library and Archives Bibliotheque et Canada Archives Canada

Published Heritage Direction du 1+1 Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-88639-7

Our file Notre reference ISBN: 978-0-494-88639-7

NOTICE: AVIS: The author has granted a non­ L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distrbute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non­ support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. Canada IV Abstract

Power consumption is the largest operating expense in any server farm. In this thesis, we provide a cluster of low cost and low-power embedded hardware accelerators that can perform simple application level serving tasks (e.g. dynamic and static web hosting). The cluster can either replace powerful servers or can be used as extra torque for peak traffic moments. The cluster can boot in less than 10 seconds allowing rapid deployment into the network. The cluster will just provide enough acceleration to pass the service level agreement (SLA) on peak traffic moments in contrast to bringing a powerful server to the network , which may be overkill solution for the surge of traffic.

We also propose a new technique for admission control in order to enforce the

SLA by dropping selective requests instead of overloading the entire system and slowing down every body. Simulation using Matlab shows that our proposed scheme outperforms previously known admission control policies in the case of M/G/l system assumption

(e.g. a general memoryless stochastic system).

We also implement our system using micro-controller boards as accelerator and

Linux as the . We intensively tested out proposed system in order to compare it with the state of the art powerful servers. Real traffic is generated for the testing of the cluster. The result is that a tiny accelerator by itself is slower than a powerful server (7-11 times slower). However it only consumes about 1-2% of the energy used by powerful Internet servers. If the objective is not minimizing the time of serving a request but rather increasing the throughput and maintaining the required SLA, a cluster of embedded controllers could be used in order to handle the same amount of traffic as a powerful state of the art Internet server. The proposed accelerator cluster is using 8 times lower energy in comparison to a powerful server while handling the same amount of traffic and producing response times that are only 7 toll times slower. Adaptive admission-based controls are also implemented to provide SLA grantee on quality of service (QoS) of the cluster. VI

Table of Contents

1: Introduction ...... 1

2: Previous W o rk ...... 4

2.1 Current State Of The Market ...... 4

2.1.1 Hard Disk Drive Improvements ...... 4

2.1.2 Attempts To Make More Efficient Power Supplies ...... 5

2.1.3 More Cores Per P U ...... 6

2.1.4 Extensive Use of Virtual Machines ...... 7

2.1.5 Using Faster CPUs...... 9

2.1.6 Prefetching and Service Differentiation ...... 9

2.1.7 Using RAM for Random Access Data Intensive Operations ...... 10

2.1.8 Sleep Modes Instead of Complete Shutdown ...... 10

2.1.9 Dynamic Voltage Scaling...... 11

2.1.10 Dynamic Acceptance Rate on Peak Traffic Hours ...... 11

2.2 Related Work ...... 13

2.3 Using Embedded Devices as Server Accelerators ...... 16

3: Admission Controlled Environment ...... 18

3.1 Using Admission Control...... 18

3.1.1 Using PI Controller ...... 20

3.1.2 Estimating the System Parameters...... 21 VII

3.1.3 Additive Increase Multiplicative Decrease ...... 23

3.2 Simulations ...... 24

4: Proposed System Architecture ...... 31

4.1 System Reference Architecture ...... 31

4.2 Utilized and Protocols ...... 32

4.2.1 Kernel and M odules ...... 33

4.2.2 File system, Hard Disk Drives and RAID Topology ...... 34

4.2.3 TFTP, PXE, NFS...... 35

4.2.4 U-BOOT, INITRD, BusyBox ...... 37

4.2.5 LVS and Routing ...... 38

4.2.6 FastCGI, SNMP, and PHP ...... 44

4.3 Admission Control and Load-Balancing ...... 46

4.4 Detail View of the Proposed Accelerator Cluster ...... 48

4.5 Power Consumption ...... 51

5: Testing and Results ...... 56

5.1 SpecWeb2009...... 59

5.2 Apache AB Benchmarker ...... 60

5.3 Custom Made Benchmarker ...... 63

5.4 Effectiveness of Admission control ...... 65

6: Conclusion ...... 68

7: Future Work ...... 70 8: References CH.l: INTRODUCTION

The use of the Internet is spreading into every comer of our lives from banking and shopping to distributed databases used in large corporations [1]. A special group of machines, known as servers, are used to host Internet applications. Unfortunately, most of these servers aggressively consume electricity due to their fast speed and broad

capabilities. Power consumption is the largest operating expense in any server farm. The

cost of energy consumed by an ordinary and not very powerful server during its lifetime

could be way more than its initial purchase cost [2].

Reducing energy consumption of processors or complete systems has been of a long

time interest to mobile and wireless system designers to extend battery life of mobile

devices. However, recently energy has become a major problem for large data centers and

server farms. The cost of energy consumption does not only include the price of the

electricity to drive the system, but also accumulate the price of cooling the system as

well. With today's very powerful servers, small form factor, and tight packing the heat

density is a major problem. Powerful and expensive cooling systems are required in order

to avoid reliability problems.

It is estimated that the total cost of ownership of a rack in a data center is

approximately $120K over the data center lifetime. In 2010, TelTub Inc. paid monthly

fees of $275 CAD for each 15A circuit breaker and $850 for rental space and cooling of a

standard rack mount (standard racks are 6.5 feet high with capacity of 10 4U servers per

1 rack). Each 2MB/s bandwidth costs $50 CAD monthly. This will lead to a yearly cost of

$20K for a standard rack ($120K if the average lifetime of a server is estimated to 6 years

) [33.

To address the energy crisis Google built one of their data centers in Oregon next to a power generation station (The Dalles, OR) [4], According to Google, it has the right combination of energy infrastructure Clearly there is a need to lower the energy demands of servers especially in large server farms and data centers

Power hungry servers have moving parts like fans to keep them cool. Every couple of years, servers in a server farm have to go through maintenance to replace fans or to upgrade the entire computer hardware. It is unlikely that the computer mother boards in these servers can be reused somewhere else and eventually they have to go through costly recycling process or end up in landfills.

Traditional energy-saving techniques like DVFS (Dynamic Voltage and Frequency

Scaling) and server shut off do not work as expected for large servers, due to the following reasons:

1. For complete server shutoff, it takes a long time to bring the server back on-line.

Without almost a perfect prediction of the workload; it is difficult to get a

noticeable reduction in energy without performance degradation (web traffic is

extremely volatile).

2. Usually, it requires a large reduction in frequency to see a considerable reduction

in energy consumption.

2 3. Voltage scaling has a minimum effect since modem processors work near their

minimum voltage anyway.

4. Even if we can reduce the energy consumption at light load, the infrastructure

required for cooling is conditioned at the maximum load anyway.

5. DVFS can reduce the energy consumption in CPUs but it cannot address the

energy consumption in other parts of the system (disks, RAM, ...). We need a

new approach to energy saving that includes the entire system not only the CPUs.

In this thesis we propose a cluster of low cost and low-power embedded hardware accelerators that can perform simple application level serving tasks (e.g. dynamic and static web hosting). The cluster can either replace powerful servers or can be used as extra torque for peak traffic moments. Previous work and attempts to increase performance and/or decrease power usage will be discussed in chapter 2. In chapter 3 we present a new admission control policy. Matlab simulation show that our policy outperforms previously known admission control policies and at the same time requires much less overhead. System reference architecture and dynamic details of the system will be discussed in chapter 4. And finally chapter 5 will present the result of stress testing of the proposed cluster using real-life scenarios and web traffic.

3 CH.2: PREVIOUS WORK

In this chapter we discuss the current state of the market and briefly mention related work that has been done to improve server performance and/or power consumption.

2.1 Current State Of The Market

The current trend in server design is to use clusters as much as possible to avoid wasting energy and meet quality-of-service (QoS) requirements. Multiple software-based virtual machines are installed on a physical server to further extend the use of precious

CPU cycles [5]. Below are the major techniques that are used to reduce the power consumption without huge throughput penalties:

2.1.1 Hard Disk Drive Improvements

Rotational-based Hard Disk Drives are currently affordable and vasdy used in the market. However, due to their mechanical and rotational nature, they consume a substantial amount of energy. Storage is one of the biggest consumers of energy.

Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time it takes to spin down and spin up the hard driver. This significantly limits the effectiveness of disk power management schemes without almost perfect prediction of the incoming traffic [6].

4 There have been some efforts to spin up and spin down the hard disk drives in a more granular way [7]. They all focus on reducing energy consumption while avoiding fully stopping the disk rotation. Multiple Idle States (MIS) modulates the disk RPM to optimize the energy consumption during idle periods while avoiding full stop and costly recovery time and energy consumption of stepper rotors [6].

Recently Solid State Drives that have absolutely no moving parts are becoming more affordable and durable. SSDs are phasing out the rotational disk drives slowly.

Amdahl-Balanced Blades [8], which are using SSD, have five times the throughput of a state-of-the-art computing cluster for data-intensive applications

2.1.2 Attempts To Make More Efficient Power Supplies

Precisely controlling power consumption is essential to avoid system failures caused by power capacity overload or overheating. However, existing work oversimplify the problem by controlling a single server independendy from the rest of the servers in the system. A cluster-level and adaptive power controller is needed to shift power among servers based on their performance while controlling the total power of the cluster to be lower than a constraint [9].

The increasing high-density servers may lead to a greater probability of thermal failure and hence require additional energy cost for cooling. An average 300 W server could easily consume 2.6 MWh of energy per year, with an additional 1 MWh for cooling

[9].

5 Power provisioning is commonly used in data centers, strictly enforcing various physical power limits. Many data centers are rapidly expanding the number of hosted servers while a capacity upgrade of their power distribution systems has lagged far behind [10]. There are many attempts to offer more efficient power supplies for data centers and power grids.

However, in this research, we are not focusing on ways to produce more efficient power supplies or thermodynamic techniques to produce more efficient chillers and cooling towers. We are focusing on the servers in particular and not the infrastructural parts of the data centers.

2.1.3 More Cores Per CPU

Recendy multi-core processors are becoming more affordable and popular. We even see commercial processors with 12 generic cores deployed in workstations [11].

There have been attempts to use reconfigurable core architectures to provide a single chip solution to a broad range of targets. When parallelism is abundant, the system can adapt and save large amounts of energy, at the same time when single thread performance is the bottleneck the system can be reconfigured to provide the necessary throughput. These sorts of solutions exploit techniques on LI cache clustering, and heterogeneous design to achieve energy efficient processors [12].

Often due to lack of software application compatibility or even proper hooks on operating system level, the true capabilities of multi-core processors are not fully utilized.

6 One example is lack of coordination between the I/O sub-systems and CPU power throttling especially in windows based virtualized environment.

2.1.4 Extensive Use Of Virtual Machines

Currently virtual machines are extensively used to maximize server utilization.

There are fewer attempts to focus on both of the power and performance optimization at the same time. Even techniques like Optimal Power Management (OPM) [13], which are assuring service-level agreements such as response time and throughput optimization is controlling power and application level performance separately and thus cannot

simultaneously provide explicit guarantees on both.

Performance-oriented solutions focus on using power as a knob to meet

application-level performance while reducing power consumption in a best-effort manner.

Power-oriented solutions treat power as the first-class control target by adjusting

hardware power states with no regard to the performance of the application services

running on the servers. Simultaneous power and performance control face several major

challenges. First, in today's data centers, a power control strategy may come directly from

a server vendor (e.g., IBM) and is implemented in the service processor firmware,

without any knowledge of the running on the server. A performance

controller needs to be implemented in the application software in order to monitor and

control the desired application-level performance, without direct access to the system

hardware. Therefore, it may not be feasible to have a single centralized controller that

7 controls both power and performance [14].

Techniques like CoCon [14] that is a cluster-level control architecture that coordinates both of the individual power and performance control loops for virtualized server clusters, provide more promising results when virtual machines are involved. Co-

Con monitors and control both of the two power control and performance factors rigorously, based on feedback control theory, which at least provide theoretically guaranteed control accuracy and system stability.

Virtual machines running on the same physical server are tightly coupled and correlated to the underneath hardware and a simple state transition of any hardware component will affect the application performance of all the virtual machines. As a result, reducing power solely based on the performance level of one virtual machine may cause another to violate its performance specification. Solutions like PARTIC [15] provide a two-layer control architecture that utilizes the CPU throttling while preserving the performance. PARTIC's primary control loop adopts a multi-input multi-output control approach to maintain load-balancing among all virtual machines so that they can have approximately the same performance level relative to their allowed peak values. The secondary performance control loop then manipulates CPU frequency for power efficiency based on the uniform performance level achieved by the primary loop [15].

Virtualization provide easier and cheaper ways to maintain and administrate the server cluster, but it does not particularly improve the power consumption due to the overhead introduced by extra task switching caused by multiple virtual machines (push and pops).

8 2.1.5 Using Faster CPUs

All components of a system should be capable of a proportional change in power consumption based on offered load (for example, by supporting idle/low-power states or energy-proportional performance scaling). Currently, memory and I/O devices are not energy proportional. For example, processors and memory consume 53 percent and 28 percent, respectively, of the power budget of an IBM POWER6-based server, but they consume 41 percent and 46 percent of the power budget of a comparable POWER7-based server. The key reason for this shift is that per-byte dynamic RAM power doesn’t scale as well as per-flop power from one generation of process technology to the next [16]. This particularly proves that placing faster CPUs on servers to utilize more virtual machines does not necessarily decrease the power consumption drastically.

2.1.6 Prefetching And Service Differentiation

A lot of work is done in improving the performance of web servers and achieving a specific QoS. Earlier work in this area was mainly either service differentiation [17] [18

] or using data prefetching [19]. In service differentiation customers (requests) are treated differently giving a priority for one type of requests (more important customers) over the others. In prefetching, data we think will be requested soon is prefetched ahead of being actually requested. Both of these 2 techniques can improve the performance of the system (for only one group of requests in service differentiation case) but there are no guarantees that a specific level of performance is met.

9 2.1.7 Using RAM For Random Access Data Intensive Operations

Recently, Oracle introduced database servers that are mainly using RAM for their storage as a result of poor small file random access performance of the rotational disk based data centers [20]. Memory based clusters are very fast and also have a very high power consumption and cost of ownership. For instance, 2GB DIMMs consume as much as a 1 TB hard drive! There have been attempts to use cluster of low-power embedded

CPUs along with small amounts of local flash storage to provide data intensive computing (not computationally intensive) [21]. However, the necessity of such complicated clusters are coming less bold as SSDs are becoming more affordable and provide better random access responses times.

2.1.8 Sleep Modes Instead Of Complete Shutdown

Servers can be put on sleep mode to consume less power during the less busy time intervals. It is possible to build a mathematical model of solving an optimal power management (OPM) used by a batch power scheduler in a server farm. OPM observes the state of a server farm and makes the decision to switch the operation mode (i.e., active or sleep) of the servers to minimize the power consumption while the performance requirements are met. If the OPM optimization problem is solved using Constrained

Markov Decision Process (CMDP), the job waiting time will be maintained below the maximum threshold while the power consumption is reduced [13]. However, the situation will be a bit more complex once we are dealing with servers that are running multiple

10 virtual machines.

2.1.9 Dynamic Voltage Scaling

There have been efforts various on combinations of dynamic voltage scaling and server throttling to reduce power consumption [22]. It is not always optimal to run servers at their maximum power levels and different techniques are used to distribute powers to different servers dynamically in server farms [23]. However, is becoming ubiquitous, and consequendy diverse applications are increasingly sharing resources, which drastically limit the ability to throtde hardware resources [24].

2.1.10 Dynamic Acceptance Rate On Peak Traffic Hours

Using load-balancing policies to distribute stateless requests among servers open the possibility to reach 10% energy savings. This purely done because of the better distribution of load among the servers and avoid exhausting a particular node [25]. Server over-utilization can be avoided by dynamically setting the maximum rate of accepted requests. Systems like QGuard [26], bases the admission decision on network-level information such as IP addresses and port numbers. Hence, they cannot take the potential resource consumption of requests into account, but have to reduce the acceptance rate of

“all” requests when one resource is over-utilized.

A more advance approach is to to have adaptive architecture that performs admission control based on the expected resource consumption of requests. This way, the

11 individual server resources are protected from over-utilization by dynamically setting the acceptance rate of resource-intensive requests. Resource intensive requests and the resource they demand can be identified by their URL in the HTTP header. This way accepted requests could be processed quickly since server resources are not over-utilized.

This leads to low response times to premium services [18].

12 2.2 Related Work

A lot of work is done in improving the performance of web servers and achieving a specific QoS. Earlier work in this area was mainly either service differentiation [17] or using data prefetching [19]. In service differentiation customers (requests) are treated differently giving a priority for one type of requests (e.g. more important customers) over the others. In prefetching, data that we think will be requested soon, is prefetched ahead of being actually requested. Both of these techniques can improve the performance of the system (for only one group of requests in service differentiation case) but there are no guarantees that a specific level of performance is met.

Service differentiation is combined with admission control in [27]. They classified incoming requests into two categories, and admission control is based on the queue size of each category and some real time system measurements. They tested their system using Apache with static contents and some basic form of dynamic content.

A self-tuning controller is proposed in [28]. They used a queuing model known as processor sharing model and a proportional integral (PI) controller to satisfy a target response time. Their queuing mode is M/G/l where the response time is given by the

equation: Where A is the arrival rate (assumed to be Poisson arrival) and E [X] is the Expected value of the service time. They also linearize the model around the operating point.

Equation 1 that describes the system is valid only in the steady state and for stable queues

(stable means average arrival rate is less than average service rate). For short time periods and in heavy traffic the arrival rate may be much greater than the service rate. Although the objective of the controller is to avoid such a case, but when it happens the equation used to model the system is not valid any more.

The authors in [24] proposed an admission control to control the response time of the server. In their model they used Eq. 1 to represent the system. They also proposed an adaptive control scheme where the model parameters are estimated online (using RLS technique) and is used to modify the controller parameters [29]. While in [30] the authors proposed an adaptive architecture that performs admission control. Their technique depends on using TCP SYN policer and an HTTP header-based connection control in order to maintain the required response time limit.

Malarait et al in [31] proposed a nonlinear continuous time model using fluid approximation for the server. They used this model to obtain an optimal configuration of the server in order to achieve maximum availability with performance constraints and maximum performance with availability constraints. They also validated their design using TPC-C benchmark.

Elnikety et al in [32] proposed a proxy called Gatekeeper to perform admission control as well as user-level request . Their idea depends on estimating the cost

14 of the request, then deciding if admitting that request will exceed the system capacity or not (system capacity is determined by using offline profiling). They noted that since the proxy is external to the server, no server modification is required for their gatekeeper.

Guitart in [33] proposed a session based admission control for secure environment. In their technique, they gave preference for connections that could use existing SSL connections on the server. They also estimated the service time for incoming requests in order to prevent overloading the server.

Blanquer et al in [34] proposed a software solution for QoS provisioning for large scale Internet servers. They proposed the use of traffic shaping and admission control together with monitoring response time and weighted fair queuing in order to guarantee the required QoS. For an excellent review of performance management for Internet applications, the reader is referred to [35].

Almost everyone who used Control theoiy also has used the average response time as the parameter to control. One major problem with that is the QoS requested is not on the form of average response time or average delay. The QoS is usually on the form percentage of the requests have a response time better than certain amount (e.g. 99% of requests should have response-times of less than 2000 ms) and guarantees on “average response time” do not solve this problem.

15 2.3 Using Embedded Devices As Server Accelerators

Embedded computers in comparison to powerful servers are tiny specialized machines, which are usually much slower and consume very little energy. Embedded computers are used in small systems that have to use less power and sometimes have to be smaller in size. Obviously, due to their embedded nature and size, they are harder to setup and work with.

There have been always attempts to build energy efficient servers using small low-power devices. For instance, a static web-server built using a small Intel XScale- based processor can achieve more than 1.7 times higher energy efficiency while its throughput (web pages per second) is only 20% lower than the equivalent Intel Pentium 4 based processor [36]. There have been also attempts to convert network router processors to small mini embedded servers [37]. However, there are not many success stories due to substantial challenges when dealing with embedded devices. The issue becomes particularly bolder when dealing with today's volatile and highly resource intensive services.

Low-powered designs (e.g. Intel Atom base servers) may not help to improve the energy efficiency of a server hardware (e.g. designs with always on and fully utilized

CPU resources). The assumption that low-power design is inherently better for energy efficiency is wrong [38]. At the same time throwing powerful servers at peak traffic times is an overkill and lead to waste power.

16 In this research, we propose the use of very cheap as well as vety energy-efficient micro-controllers/microprocessors as servers or server accelerators. As we show, there is still room for improvement by bringing even lower power servers or embedded devices to the picture. We strongly believe that simple tasks like servicing web requests can be done in a much simpler and more energy-efficient ways. System Architecture chapter will discuss the overall architecture and components we used in this research. Our cluster can be booted in a few seconds, which enables full grain power throttling if needed.

17 CH.3: ADMISSION CONTROLLED ENVIRONMENT

In this chapter we discuss broadly used admission control algorithms and propose a method with very little overhead that outperforms the commonly used controls techniques. Simulation is done to provide better comparison between the commonly used techniques and the newly proposed solution.

3.1 Using Admission Control

As we mentioned before, most of the work done using control theoretic approach to control quality of service considered the average response time as the parameter to control. The problem of that approach is that most of the required QoS is not about the average response time [28]. Controlling the average response time will not lead us to a very specific conclusion. The reason is that the Internet traffic is highly volatile and unpredictable. Classic queuing theory deals with such scenarios if we know the distribution of the incoming traffic and service time (or at least know some parameters about the underlying distribution such as the average and the standard deviation). For example consider the simplest type of queues known as M/M/1. The cumulative probability distribution of the response time is [39]:

F {T )= l-eMl~p) &

18 Where, |i is the service rate and p is the server utilization. Since the average time spent in the system for M/M/1 is l/(p(l-p)) = l/(p-A), where A is the arrival rate.

Another problem with using control theoretic approach is how to handle the overhead in calculating and adjusting the system and the controller parameters. In admission control, the system output should be monitored to collect statistics about the parameter to be controlled. Every sampling period, the collected data are used in order to calculate the admission probability and modulate the arrival with this probability to meet the required service performance measure. The major question here is how should we select the sampling period?

The requests arrivals to a server are usually in the milliseconds range, or even less for veiy powerful servers. Now, what should be the sampling period? If we consider the sampling period to be on order of seconds, that is good from the overhead point of view.

After all we do not want to overload the server with control calculation since that time is taken from serving incoming requests. However, since the web traffic is highly volatile and unpredictable, what happened a few seconds ago might not have an impact on the current operation of the system. For examples, 5 seconds ago we may have received a rush of requests that resulted in prolonging the response time and not meeting the required QoS, and we have decided to reduce the admission probability in order to slow down the arrival, but now there is a very little arrival at the moment. That leads to wasting CPU cycles because of the time difference between the sampling rate and traffic variations.

19 The second choice is taking the sampling time in the order of milliseconds.

Although the response will be much faster than the previous case, however that is too much overhead for the CPU. Even the online recursive estimator in [40] requires a number of large matrix multiplications every sample period in order to estimate the system parameters.

In this section, we simulate three different techniques for QoS guaranteeing in a . First we consider a simple M /G/l model similar to the one proposed in [24].

This model predicts the response time, so the only controllable parameter here is the average response time. Then we consider a general model for the system. We assume a second order model and we estimate the system parameters on line. Then a first order controller/filter is used to control the average response time. Our third model is a variation of the previous one where the parameter to control is the percentage of the requests that failed the QoS requirements. Finally we consider a fourth model where we used a simple variation of the Added Increase Multiplicative Decrease, AIMD, which was successfully implemented in congestion avoidance for TCP/IP protocols [41].

3.1.1 Using PI Controller

This is the method proposed in [28]. Eq. 1 is considered to represent the system where Trt represents the response time. The schematic diagram of the controller is shown in Figure 1 where the server is represented by Eq. 1.

20 Po

■i p A Server Eq. 1

Figure 1: Schematic diagram of the PI controller

In Fig. 1, t represents the response time, xref is the required average response time, P0 is the admission probability derived from Eq. 1 in order to make x = xref. And Ao is the unmodulated arrival rate, and AP is the correction produced by the controller to P in order to guarantee the required xre/. Linearizing Eq. 1 using Taylor series around the operating point Ao we can solve for the PI controller. The result of this scheme together with some comments about the scheme is discussed in the simulation subsection.

3.1.2 Estimating The System Parameters

Similar to [24] we assume no knowledge of the system under control (e.g. the server). By monitoring the input and output of the server we derive the model parameters. Here the system under control is the server with input Ao and output either the average response time or the percentage of the requests that confirm to the required QoS.

We tried several models for the system and found out that the best fit is a second order

21 system. In this part, we consider two solutions one that monitors the average response time and one that monitors the percentage of requests conforming to QoS. In this case, we assume that the system output y (no matter what the output is, it could be response time, or the percentage of packets that missed the service time threshold) can be represented as a second-degree system where the input u is the arrival rate as follows.

y _ l + aiZ-1+ a2Z~2 U~ bQ+bxZ~'+b 2Z- 2 (3)

Where Z'1 is the delay operator. The parameters ai and b( are estimated on line by measuring the output y and the input u and averaging them over the sampling period. We use a well-known recursive least square estimator [40].

Server Controller Eq. 3

Figure 2: Controller using Eq. 3

G = C a0Z'T-l +a,. (4)

22 The controller was designed as a PI controller on the form of a0, a u p0, Pi and Ko, which are chosen in order to achieve a reasonable overshot, settling time within the sampling period and the proper tracking of the output for Ko.

3.1.3 Additive Increase Multiplicative Decrease

This idea came from the sliding window control in TCP [41]. In TCP the window size is decreased (by a multiplicative factor) if there is a lost packet and is increased (by an additive factor) for successful transmission of a packet [42],

The proposed scheme works as follows. If the queue size grows beyond a high threshold N/,,g)1 the probability of accepting a new packet is multiplied by y, where 0

1. If the queue size drops below N;ow, the admission probability is increased by q, where

0

The obvious question is how to choose the values of Nj,i9h, N w , y» and q, The proper choice of these parameters depends on the service time and inter-arrival time distributions. In our simulation, we tried different values for the parameters and found that the best choice for N/,ig/, is the target delay divided by the average service time, while

N/ow = Nhigh/2. We also found the best values for q = 0.4, and y to be between 0.8 and 0.9.

Clearly the optimal values for these parameters depend on the arrival and service distribution and can be fine-tuned online.

23 3.2 Simulations

In this section we show the results of our simulation using Matlab for the cases proposed in previous section (PI, estimating system parameters, and AIMD). For all the experiment we ran the simulation for 1600 seconds using Matlab. We have collected the percentage of the requests accepted, the percentage of the requests that required less than 150 msec, and the average response time. For every experiment, we considered two traffic

scenarios.

• Traffic A: This is the baseline system; we assumed an average exponential

interarrival time of 55 msec, and an average exponential service time of 35 msec.

(64% utilization). The target response time is 150 msec.

• Traffic B: In this scenario, we start as in Traffic A. At the simulation midpoint

(after 800 seconds) we increase the arrival rate by decreasing the interarrival time

to 45 msec, (utilization of 78%). The objective here is to see how the controller

reacts to increasing the arrival rate in order to satisfy the QoS requirements.

Fig. 3 shows that response time for a 1600 seconds simulated run under traffic B.

We can see that after 800 seconds the average response time increases. The main function

of any controller is to adjust the admitting probability in order to avoid such a scenario.

Fig. 4 (same setting as Fig. 3) shows the response time and the acceptance probability as

a function of time. It is obvious that the controller suppressed the input in the second half

of the simulation leading to a more consistent (equal) response time. One thing that is

24 very noticeable here is the rapid changes in the admitting probability compared to the other methods we used. Although the probability changes very rapidly, the performance is the worse compared to the other techniques. One possible explanation for this behaviour is that Eq. 1 does not describe the system when the traffic increases beyond stability even for a short period of time.

Figure 3: Response time without a PI controller under traffic B

08

Figure 4: Response time and admitting probability under traffic B for a PI controller

Then we consider our technique where we assume a second order system and estimate system parameters on line. Once the parameters are estimated, they are used to

25 choose the parameters of a first order controller/filter in order to track the required QoS criterion either directly by controlling the percentage of conforming packets, or indirectly through controlling the average response time.

Figure 5 shows the response time and admitting probability for our system under traffic A assuming a 2nd order system with on-line parameters estimation. While

Fig. 6 shows the same system under traffic B scenario. The parameter to be controlled in this case is the average response time.

200 400 600 BOO 1000 1200 1400 1600 Time

200 4C0 600 800 1000 1200 1400 1800 Time Figure 5: Response time and admitting probability assuming a second order system and a first order filter under traffic A (controlling average response time).

26 T im e

0 l------,------l ------j ------<------1------i------1------1 0 200 400 600 800 1000 1200 1400 1600 T im e Figure 6: Response time and admitting probability assuming a second order system and a first order filter under traffic B (controlling average response time).

The changes in the admitting probability for this system are much less than the case of a PI controller. That by itself is not an advantage, however, it might give an indication that the system is stable and does not oscillate.

Figure 7 and 8 shows the same results as Figures 5 and 6 but in this case we use the percentage of the conforming requests as the parameter to control. Although it is difficult to see, directly from the Figures, which one is better in maintaining the required

QoS, controlling the response time, or the percentage of conforming packets directly from the Figures, Table 1 shows that in fact controlling the percentage of conforming packets produces better results.

27 15 e ■mE 1 gB |06 E 0 2 0 0 400 600 800 1000 1200 1400 1600 Time

06 |o _£l £Lg 0i------1------»------1------1------1------1------1------1 0 200 400 600 800 1000 1200 1400 1600 Time Figure 7: Response time and admitting probability assuming a second order system and a first order filter under traffic A (controlling QoS).

200 400 600 800 1000 1200 1400 1600 Time

ea. I I o H -4g f Q. q|------1------1------J------I ,------,------«------1------0 200 400 600 800 1000 1200 1400 1600 Time Figure 8: Response time and admitting probability assuming a second order system and a first order filter under traffic B (controlling QoS).

Figures 9 shows the system using AIMD with Nh,g/, =Tm J x av and N/ow = Nhigh/2. q

= 0.4, and y = 0.8. Where T,arget is the target delay and is set to 150 msec, and Tav is the average response time under traffic A. Figure 10 shows the same setting under traffic B.

28 The admitting probability varies very quickly compared to Fig 5,6,7 and 8. However, that is expected since the changes in the admitting probability is calculated every time the queue size grows beyond a specific threshold, or decreases below another threshold.

Tim© Figure 9: Response time and admitting probability using Additive Increase Multiplicative Decrease AIMD under traffic A

200 400 600 800 1000 1203 1400 1600

200 400 600 800 1000 1200 1400 1600

Figure 10: Response time and admitting probability using Additive Increase Multiplicative Decrease AIMD under traffic B

We summarize the results in Table 1. The first column shows the 4 different

29 techniques we used. The second column shows for every technique the results under traffic A and traffic B. The actual results are shown in columns 3, 4, and 5. Column 3 shows the percentage of admitted requests. Column 4 shows the percentage of the requests that is conforming to the required QoS. The first number shows the percentage of conforming requests with respect to all arrived requests, while the number in parenthesis shows the percentage of conforming requests to admitted requests only.

Finally column 5 shows the average response time for all admitted packets.

From Table 1 we can also see that AIMD has the highest admitting policy under traffic A (97%), while PI has the lowest (61%). It also shows that AMD has the highest conforming percentage under traffic A. Under traffic B assuming a 2nd order system with online parameters estimation has slightly higher admitting probability than AIMD (94% vs. 93%), however the percentage of conforming requests is much higher for AIMD

(compared to all arriving requests or admitted requests), basically that shows that the

AIMD rejects a very small percentage of incoming requests, but it rejects the right ones.

Technique Test Set Accepted Conforming Average Delay (msec) A 61% 53%(86%) 74 ms PI B : 57% 48%(84%) 80 ms A 92% 73%(79%) 90 ms Est-(W B j 88% j 63%(72%) 119 ms A 95% 75%(79%) 95 ms Est. (Tresp) B 94% 66%(70%) 125 ms A 97% 83%(86%) 78 ms AIMD B 93% 77%(83%) 85 ms

Table 1: Comparison between the four proposed methods

30 CH.4: PROPOSED SYSTEM ARCHITECTURE

In this chapter present our proposed system in details from both the software and hardware points of view. We also compare its power consumption to the state of the art powerful Internet servers. Various admission control scenarios are presented and discussed as well.

4.1 System Reference Architecture

As discussed, power consumption is the largest operating expense in a server farm. At the same time, low-power embedded systems are not as easy as general-purpose computers to work with. Currently, various virtualization techniques are used to further utilize the available hardware. If there is a way to easily manage a series of low-power embedded computers, it will be possible to partially replace some of the general power- hungry servers with a new cluster of cheap embedded devices. These embedded computers can at least accelerate or even reduce the number of power-hungry servers in server farms and data centers.

load-balancing policies alone, which distributes requests among servers, can provide 10 % energy savings [25]. This is purely done because of the better distribution of the load among the servers and avoids exhausting a particular node. Server exhaustion usually happens due to overloaded context switching and extensive usage of Push and

Pop machine instructions. In this research, we also exploit the same concept to distribute

31 the extra load on the tiny embedded server accelerators to avoid possible server exhaustion and needs to use more servers on the grid for the peak traffic hours.

In this research, we propose a series of green, cheap and reusable embedded hardware server accelerators, clustered together with free and open-source software to reduce the power consumption of Internet server farms. Furthermore, we also use results from classical adaptive control theory for admission control and load-balancing to satisfy the required QoS. This approach, not only will help us to consume less energy, but also will help us to reuse the tiny accelerators after hardware upgrades in server farms.

4.2 Utilized Software And Protocols

To avoid royalty charges, freely available and widely used open source operating system, Linux [43], is used as the main driver of each embedded computer. In order to provide fail-over capabilities and route the web requests to embedded accelerators, a

series of load-balancing machines are needed. Luckily, Linux operating system has built

in traffic routing capabilities through [44] modules.

Traffic in the network can be controlled via Linux clustering and NetFilter kernel

modules directly from the load-balancers. However, to have adaptive control at the Web­

server and application layers, we have to modify the web-server itself to guarantee

avoidance of overload of each embedded accelerator. Cherokee project is claimed to be the fastest open source web-server available in the market [45] that easily surpasses well-

32 known web-servers like Apache [46], Microsoft IIS [47] and Ngix [48]. Cherokee web- server is modified in order to allow utilization of adaptive control modeling at the application level.

Modem high volume web-applications demand a multi-tiered design to achieve scalability and reliability. In a three-tiered design, web-server layer will consult an application-server layer on top a database-server layer to solve complex business logic and reduce web-server layer overheads. In a three-tiered design, web-server layer is simply responsible for static web content and it will consult application-server layer to provide response to complex dynamic business requests. Luckily, there exist free and high quality open source software for each layer such as Apache HTTP web-server [46],

Apache Tomcat application-server [49] and MySQL database server [50]. Figure 11 shows a typical 3-tier web server architecture. Followings are the details of software and hardware components used in this research.

Network

WebServer(s) ApplicationServer(s) DatabaseServer(s) Clients) Figure 11: A typical 3-tier web server architecture

4.2.1 Linux Kernel And Modules

Linux Kernel preemption is an ability of the Linux Kernel to interrupt itself while it is doing something else, in order to work on a higher priority job, such as updating a

33 video interface. Linux Kernel under heavy loads, when enabled low-latency features, can provide better results and reduced scheduler latency. However servers have a veiy different workload requirement in comparison to their desktop counterparts. In server world, one should not force any preemption, as creation of extra scheduling overheads should be strictly avoided. As such, Linux kernels in this research are compiled with "No

Forced Preemption" flag enabled.

The Netfilter Framework [44] is a set of Linux Kernel modules, which are used for filtering and manipulating all network packets that pass through the machine. It is commonly used if one wishes to enable a firewall on the machine to protect it from different systems on the Internet, or to use the machine as a proxy for other machines on the network. In this research, connection tracking modules needed by Linux Virtual

Server [51] configured in “Director” mode are enabled.

4.2.2 File System, Hard Disk Drives And RAID Topology

RAID-1 [52] is used to mirror data on the Solid State Hard Drives of the Network

File Servers used in this research. Data is written identically to multiple mirrored disks.

The RAID array of disks provides fault tolerance on disk errors while assuring data integrity by mirroring.

34 BTRFS [53] (B-tree file system by Oracle Corporation and is GPL based) provides disk snapshots, checksums and integral multi-device spanning in Linux file system, which are crucial features for enterprise storage systems. BTRFS has special operation mode for SSDs to provide wear leveling and better protection and hence is used in this research. BTRFS provides a boost and increased read performance thanks to its

RAID-1 compatibility and special SSD support.

4.2.3 TFTP, PXE, NFS

Trivial File Transfer Protocol (TFTP) [54] is a very simple file transfer protocol. It is generally used for automated transfer of configuration or boot files between machines in a local environment. Compared to FTP, TFTP is extremely limited, providing no authentication, and is rarely used interactively by a user. Due to its simple design, TFTP could be implemented using a very small amount of memory. It is therefore useful for booting computers such as routers, which may not have any data storage devices. It is an element of the Preboot Execution Environment (PXE) [55] network boot protocol, where it is implemented in the firmware BIOS of the host's network card. TFTP typically uses

UDP as its transport protocol and data transfer is initiated on port 69.

The Preboot execution Environment (PXE) [55] is an environment to boot computers using a network interface from data storage disk devices on the fly. PXE uses several network protocols like Internet Protocol (IP), User Datagram Protocol (UDP),

35 Dynamic Host Configuration Protocol (DHCP) and Trivial File Transfer Protocol (TFTP

). Most x86 devices usually take the advantage of PXE and the firmware, which is saved on the client network card. Client tries to locate a PXE redirection service on the network

(Proxy DHCP) in order to receive information about available PXE boot servers. After parsing the answer, the firmware will ask an appropriate boot server for the file path of a network bootstrap program (NBP), download it into the computer's random-access memory (RAM) using TFTP and finally execute it. The PXE protocol is approximately a combination of DHCP and TFTP. DHCP is used to locate the appropriate boot servers, with TFTP used to download the initial bootstrap kernel programs and additional files.

To initiate a PXE bootstrap session the PXE firmware broadcasts a

DHCPDISCOVER packet extended with PXE-specific options (extended

DHCPDISCOVER) to port 67/UDP (DHCP server port). The PXE options identify the firmware as capable of PXE, but standard DHCP servers will ignore them. If the firmware receives DHCPOFFERs from such servers, it may configure itself by requesting one of the offered configurations to finally boot the client machine via downloaded kernel and additional files.

Network File System (NFS) [56] is a network file system protocol allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed. NFS builds on the Open Network Computing Remote Procedure Call

(ONC RPC) system. The Network File System is an open standard defined in RFCs, allowing anyone to implement the protocol. Current NFS servers provide data access

36 parallelism allowing higher speeds. The NFSv4.1 protocol defines a method of separating the file system meta-data from the location of the file data, which can allow multi-node

NFS servers. NFSv4.1 provides enterprise features such as Sessions, Directory

Delegation and Notifications, Multi-server Namespace, ACL and Retention Attributions.

4.2.4 U-BOOT, INITRD, BusyBox

Das U-Boot [57] is a boot-loader that has support for many embedded devices. It provides Dynamic Host Configuration Protocol (DHCP) support to obtain the embedded system IP addresses automatically and also support for loading the Linux Kernel via

TFTP server.

A Linux is a temporary file system used in the boot process of the

Linux kernel. Initrd is customized to support loading the Linux file system using Network

File Servers in order to be able to easily manage server accelerators [58].

BusyBox [59] provides several stripped-down Unix tools in a single executable. It runs on top of the Linux kernel before running Init program to help initializing some of the system wide configuration and modules. BusyBox is compiled with extensive support for network protocols such as DHCP to help early IP initialization of cluster of embedded server accelerators.

37 4.2.5 LVS And Routing

Linux Virtual Server (LVS) [60] is a free, advanced and open source load- balancing solution for Linux systems. LVS is a high-performance and highly available server add-on to Linux systems using clustering technology, which provides good scalability, reliability and serviceability. IPVS (IP Virtual Server) [61] is an advanced IP load-balancing software implemented inside the Linux kernel’s LVS subsystem. The IP

Virtual Server code is already included into the standard Linux kernel 2.6 source codes.

IPVS implements transport-layer load-balancing inside the Linux kernel (also known as

Layer-4 LAN switching). IPVS can direct requests for TCP/UDP based services to the real servers, and makes services of the real servers appear as a virtual service on a single

IP address.

LVS does also provide a module called KTCPVS (Kernel TCP Virtual Server) [62

] and it implements application-level load-balancing, also known as Layer-7 switching, inside the Linux kernel. Since the overhead of layer-7 switching in user-space is very high, it is a good idea to implement it inside the kernel in order to avoid the overhead of context switching and memory copying between user-space and kemel-space. Although the scalability of KTCPVS is lower than that of IPVS, it is flexible, because the content of request is known before the request is redirected to another server.

Users can use the LVS solutions to build highly scalable and highly available network services, such as web, email, media services and VoIP services, and integrate

38 scalable network services into large-scale reliable e-commerce or e-govemment applications. It implements several balancing schedulers (Round-Robin, Weighted

Round-Robin, Least-Connection, Weighted Least-Connection, Locality-Based Least-

Connection, Locality-Based Least-Connection with Replication, Destination Hashing,

Source Hashing, Shortest Expected Delay and Never Queue) [63]:

The Weighted Round-Robin scheduling is designed to better handle servers with different processing capacities. Each server can be assigned a weight, an integer value that indicates the processing capacity. Servers with higher weights receive new connections first than those with fewer weights, and servers with higher weights get more connections than those with fewer weights and servers with equal weights get equal connections. For example, the real servers, A, B and C, have the weights, 4, 3, 2 respectively, a good scheduling sequence will be AABABCABC in a scheduling period.

In the implementation of the weighted round-robin scheduling, a scheduling sequence will be generated according to the server weights after the rules of Virtual Server are modified. The network connections are directed to the real servers based on the scheduling sequence in a round-robin manner [63].

The weighted round-robin scheduling is better than the round-robin scheduling, when the processing capacity of real servers are different. However, it may lead to dynamic load imbalance among the real servers if the load of the requests varies highly.

In short, there is the possibility that a majority of requests requiring large responses may be directed to the same real server. The round-robin scheduling is a special instance of the

39 weighted round-robin scheduling, in which all the weights are equal [63].

The least-connection scheduling algorithm directs network connections to the server with the least number of established connections. This is one of the dynamic scheduling algorithms as it needs to count live connections for each server dynamically.

For a Virtual Server that is managing a collection of servers with similar performance, least-connection scheduling is good to smooth distribution when the load of requests vary a lot. Virtual Server will direct requests to the real server with the fewest active connections [63].

At a first glance it might seem that least-connection scheduling can also perform well even when there are servers of various processing capacities, because the faster server will get more network connections. In fact, it cannot perform very well because of the TCP's TIME_WAIT state. The TCP's TIME_WAIT is usually 2 minutes, during this 2 minutes a busy web site often receives thousands of connections, for example, the server

A is twice as powerful as the server B, the server A is processing thousands of requests and keeping them in the TCP's TIME_WAIT state, but server B is crawling to get its thousands of connections finished. So, the least-connection scheduling cannot get load well balanced among servers with various processing capacities [63].

The weighted least-connection scheduling is a superset of the least-connection scheduling, in which a performance weight factor will be assigned to each server. The servers with a higher weight value will receive a larger percentage of live connections at any given time. The default weight is one if it is not specified. The weighted least-

40 connections scheduling works as follows [63]:

Supposing there is n real servers, each server i has weight Wi (i-l,..,n), and alive

connections G (i=l,..,n), ALLjCONNECTIONS is the sum of G (i=l,..,n), the next

network connection will be directed to the server j, in which

CL Ci ( ALL CONNECTIONS ) ALL CONNECTIONS (5) mm Wj Wi

Since the ALLjCONNECTIONS is a constant, there is no need to divide Ci by

ALLjCONNECTIONS, and it can be optimized as:

Ci Q -= m in (6) Wj Wi

The weighted least-connection scheduling algorithm requires additional division

than the least-connection. In a hope to minimize the overhead of scheduling when servers

have the same processing capacity, both the least-connection scheduling and the weighted

least-connection scheduling algorithms are implemented [63].

The user-space tool to administrate LVS-IPVS is called “ipvsadm”. LVS support

integration to any type of operating systems, such as Linux, BSDs, Solaris, and Windows.

LVS/NAT balances servers of the operating systems having TCP/IP support, LVS/TUN

requires servers having IP Tunneling protocol, and LVS/DR requires servers having a

non-ARP device. Almost all modem operating systems support non-arp device and hence

41 they can be integrated in a LVS Director node [64].

LVS can easily handle many simultaneous connections. A LVS Diretcor node with

2 GB memory can easily handle 8 million or even more simultaneous connections (One

connection costs 128 bytes in the LVS node). The LVS solutions have already been

deployed in many real applications throughout the world, including Wikipedia [65].

One of the configuration modes of LVS is called Direct Routing (LVS_IPVS_DR)

[66] that is an IP load-balancing technology implemented in LVS. DR mode, directly routes packets to backend servers through rewriting MAC address of data frame with the

MAC address of the selected backend server. It has the best scalability among all other

methods because the overhead of rewriting MAC address is pretty low, but it requires that

the load-balancer and the backend servers (real servers) are in a physical network.

Real servers share the virtual IP address of DR load-balancer. The load-balancer

has an interface configured with the virtual IP address, which is used to accept request

packets, and it direcdy route the packets to the chosen servers. All the real servers have

their non-arp alias interface configured with the virtual IP address so that the real servers

can process the packets locally. The load-balancer simply changes the MAC address of

the data frame to that of the chosen server and retransmits it on the LAN. This is the

reason that the load-balancer and each server must be directly connected to one another by a single uninterrupted segment of a LAN (e.g. physically linked by a HUB or Switch)

[66]. The architecture of virtual server via direct routing is illustrated in Fig. 12:

42 Figure 12: Virtual Server in Direct Routing Mode

When an end-user accesses a virtual service provided by the server cluster, the packet destined for virtual IP address (the IP address for the virtual server) arrives. The load-balancer ( Director) examines the packet's destination address and port. If they are matched for a virtual service, a real server is chosen from the cluster by a scheduling algorithm, and the connection is added into the hash table that records connections. Then, the load-balancer directly forwards it to the chosen server. When the incoming packet belongs to this connection and the chosen server can be found in the hash table, the packet will be again directly routed to the server. When the server receives the forwarded packet, the server finds that the packet is for the address on its alias interface or for a local socket, so it processes the request and returns the result directly to the user finally. After connection termination or timeouts, the connection record will be removed from the hash table.

43 4.2.6 FastCGI, SNMP, Cherokee And PHP

Fast Common Gateway Interface, FastCGI, is a protocol for interfacing interactive programs and scripts with webservers. FastCGI's reduces the overheads introduced by interfacing the webservers and CGI programs such as process creations for each web request. Instead of creating a new process for each request, FastCGI uses persistent processes to handle a series of requests. The FastCGI server, not the web server, owns these processes. To service an incoming request, the web server sends Environment information and the page request itself to a FastCGI process over a shared memory socket

(in the case of local FastCGI processes on the web server) or TCP connection (in the case

of remote FastCGI processes in a server farm). Responses are returned from the process to the web server over the same connection, and the web server subsequently delivers that response to the end-user. The connection may be closed at the end of a response, but both the web server and the FastCGI service processes persist. FastCGI process can handle many requests over its lifetime, thereby avoiding the overhead of per-request process

creation and termination. Multiple connections to FastCGI servers can be configured to

increase stability and scalability.

Web site administrators and programmers can find that the separation of web applications from the web server in FastCGI has many advantages over embedded

interpreters (mod_perl, mod_php, etc.). This separation allows server and application

44 processes to be restarted independently, which is an important consideration for busy web sites. It also enables the implementation of per-application hosting service security policies, which is an important requirement for ISPs and web hosting companies.

Different types of incoming requests can be distributed to specific FastCGI servers, which have been equipped to handle those particular types of requests efficiently granting

SLA and quality of service in enterprise environment.

Simple Network Management Protocol (SNMP) is an Internet-standard protocol for managing devices on IP networks. Devices that typically support SNMP include routers, switches, servers, workstations, printers, modem racks, and more. SNMP is used mostly in network management systems to monitor network-attached devices for conditions that warrant administrative attention [56]. In this research we use SNMP to monitor the web-servers loads and condition to alert Linux Virtual Server load-balancer to change the load routing.

Cherokee is an open-source and cross-platform Web server. It is very fast, flexible and easy to configure. It supports a wide range of technologies including FastCGI,

SCGI, PHP, ASP .NET via Mono, CGI, uWSGI, SSI, TLS and SSL encrypted connections, Virtual hosts, Authentication, on the fly encoding, load-balancing, Apache compatible log files, Data Base Balancing, Reverse HTTP Proxy, Traffic Shaper, Video

Streaming and much more. Cherokee-Admin, a user-friendly interface, provides a no- hassle configuration of the server. Completely integrated with the Cherokee-Market, it can automatically deploy web applications for an optimal performance. Benchmarks

45 show it is faster that many web servers including Apache Web server [45]. In this

research, Cherokee has been modified to report the response time for each request via

shared memory mechanism. The average response times are periodically reported to the

Linux Virtual Server load-balancer to efficiently throtde load among cluster.

PHP [67] is a widely used general-purpose scripting language that is especially

suited for Web development and can be embedded into HTML. FastCGI pool of PHP run­

times is connected to the Cherokee Web Server to handle dynamic requests in this

research. Cherokee can also be interfaced with Perl, Mono/C#/ASP/VB, Java, C/C++,

Python, Ruby and TCL through FastCGI in case any other languages or scripting

languages are preferred.

4.3 Admission Control And Load-balancing

One of the major problems facing web service designers and providers is how to

provide the required QoS required by their customers within the constraints of the

available budget. Web traffic is highly unpredictable, volatile and dynamic [68]. The

servers must be configured in order to provide the required QoS but not any more in

order to minimize the system cost. Off-line provisioning is not effective with highly

dynamic and volatile traffic. A very precise and on-line configuration is required in order

to minimize the hardware as well as the energy cost.

46 Classical results from control theory have been recently applied to computing systems and especially web-server provisioning [69]. One of the major problems of using control theory in web service provisioning is the web server is highly nonlinear. Coming up with a mathematical model for web service is extremely difficult. Representing the web server as a queue and using results from queuing theory is very popular in performance evaluation of web servers [70, 71]. Adaptive control was used in [24] for admission control in order to maintain a specific QoS, while a queuing model based control was proposed in [72].

A major problem with the queuing model based control is that the system is usually modeled as an M/G/l queuing system. This is not an accurate description of a web server especially with the assumption of a Poisson arrival (The M in M/G/l system).

Another problem is that almost all queuing theory results are based on the assumption of a stable system where the arrival rate is less than the service rate (on the average and in the long run). If the system is pushed by a surge of traffic to the area where the arrival rate is greater than the service rate, the queueing theory results are not valid anymore and the model used to control the system is not valid.

In this research we are investigating other approach for admission control. In one approach we are assuming totally unknown system and trying different models with on­ line estimation/prediction of the model parameters. In the other approach, we are considering variations of the Additive increase multiplicative decrease (AIMD) that was successfully used in TCP congestion control. The results will be also used for load-

47 balancing and power control by controlling the number of active nodes in the cluster in order to achieve the required QoS.

4.4 Detail View Of The Proposed Accelerator Cluster

As mentioned in CH. 4.2, Cherokee web-server has been modified to report the response-time for each web request via shared memoiy mechanism. This means at any given moment, we have a clear understanding of how busy and responsive the cluster nodes are. The reporting interval is adjustable, but currendy every 500 ms we notify the load-balancers about the average response time and average load of the nodes in the cluster. Load-balancers have a daemon program, called RouteEngine, which goes over the average response-times and decides if a particular node is exhausted and needs help.

RouteEngine daemon activates/deactivates accelerators and join/detach them respectively to/from the cluster. LVS is setup in DR mode to provide IP direct routing with minimal overhead. The followings are the main indicators used by RouteEngine to change the LVS routing schema:

1 Processing Power factor (PP) depends on processor speed of the node and number

of CPUs in addition to their associated cores.

° Here is an example of how PP factor is calculated: Imagine a server that has

total of 4 CPUs with 12 cores each running at 3.7 GHz produce a PP speed

constant factor of 178:

48 p j , _ r 4 CPU sX 12 Cores X3700 M H z 1000 l_

° Obviously if CPU throttling is used, PP factor should be recalculated

accordingly whenever throttle happens.

2 Overall Load factor (OL), which tries to guess the overall load of the machine.

OL is calculated using the collected average response-time of the web-server

running on the node and the actual overall load of the machine. The reason we

have to use OL is to get a real sense of how busy the overall system is as average

web-server response time is not enough (imagine a web page that is consuming a

web-service call. Obviously, the page will be producing a lengthy response time

while there is not much load on the local machine as the page is just consuming a

remote service).

° Here is an example of how OL factor is calculated: /proc/loadavg reports an

average load of 10.50 for the node. PP factor is calculated as 178 and

Cherokee web-server reports an average repose time of 1000 ms on the last

500 ms. The overall load factor will be OL = 30:

q , _ r 10.50 AvgLoad X1000ms RspTimeXSOO ms Interval H 1000X178 l_

3 Each node, depends on its processing power, can tolerate certain amount of load.

MAX_OL will be the maximum overall load that a machine can tolerate before

becoming unresponsive or completely stall and hang. M A X jO L is obtained by

49 regression test and gradually increase the traffic on the node constantly

monitoring the PP speed factor and OL load factor.

RouteEngine uses the PP factor to enforce LVS with a basic Weighted Round-

Robin traffic scheduling. However to protect against sudden spikes of traffic,

RouteEngine always monitors the OL factor to make sure it does not reach the 70% of the

M A X jO L or it will switch to Weighted Least-Connection scheduling and activating extra nodes. Here are the steps that will be taken afterward:

1. If traffic spike remains steady for more than 5 seconds, while at least 70% of the

nodes are on 70% of their MAXJDL, the node activation will be no longer linear

and will be exponential to the point that all accelerators are involved and even

extra server activation should be started if the traffic remains steady. The reason

we use a golden ratio of 70% is that almost 30% of the machine power is spent on

context switching and on 70% CPU usage users start feeling noticeable delays and

lags.

2. A well-designed system should differentiate its critical services from ordinary

ones. If quality of service for premium accounts is needed, different virtual web-

server domains should be used for premium services so that they can be easily

distinguished from the ordinary ones. If RouteEngine notices that all nodes are in

85% of their MAXJDL, it will instruct the web-servers to only keep the

premium/essential services and start dropping request on ordinary domains until

system becomes stable again. 4.5 Power Consumption

Let us have a quick comparison between a power-hungry server and tiny embedded computer. A typical powerful server can easily consume 450 Watts of power assuming that room temperature does not increase. Table 2 shows a break down of a typical power consumption of a web server...... — —- “ Components Estimated Module Wattage Dual quad-core Intel Xeon E6510 clocked at 1.7 GHz 220 W

claimed to have a delivery of 11,000 DMIPS [73, 74] Intel Server Mother Board SR5520UR [75] 80 W Four SATA hard drives [76] 70 W Two Intel Ethernet Server Adapter X520-T2 [77] 40 W 12 Cooling fans 20 W Two 95% efficient Power supplies to support fail-over -

Table 2: Power Consumption of an Average Powerful Server

In the other hand an average ARM based embedded computer, used in this research, which helps to build the cluster of accelerators, will only use 5 Watts of power while having all of the following modules enabled (the board uses much less energy, 2

Watts, while its CPU is in idle state) [78]:

1 Marvell ® 88F5182, a high-performance Storage Networking System

Engine based on the Marvell Feroceon ® ARM9 CPU core and fully

51 compliant with the ARMv5TE. The CPU is clocked at 500 MHz and

claimed to deliver 1,100 DMIPS [79, 80]

2 Cryptographic engine with hardware implementation on encryption and

authentication engines to boost packet processing speed. It has a dedicated

DMA to feed the hardware engines with data from internal SRAM

memory. It implements AES, DES, 3DES and SHI and MD5 encryption

algorithms.

3 USB 2.0 ports support both peripheral and host with dedicated DMA for

data movement between memory and ports.

4 Integrated single GbE (10/100/1000) MAC with dedicated DMA for data

movement between memory and port and automatic hardware checksum

on receive and transmit.

5 Two SATA II interface with two Enhanced-DMA per SATA port and

automatic command execution without host intervention. Also there are

two XOR DMAs, which are useful for RAID applications and automatic

CRC-32 calculation. This will allow the use of Solid State hard drives for

fast caching.

Also note that a cluster of embedded computers can provide full grained power throttling as each embedded computer in the cluster can be shutdown or clocked down

on-demand in less busy hours allowing less power usage comparing to an always-on

server. Using a cluster of embedded computers as helper accelerators in conjunction with

52 powerful servers will have benefit for both business and the environment. Business will benefit from reduced costs in energy and the environment will benefit from reduction in energy consumption and from reusing the embedded computers after an upgrade of the server farm.

Fan-less embedded computers need less maintenance due to less moving parts.

Also unlike general servers, each embedded computer can be recycled and re-used in smaller devices like automotive industries or household appliances to further delay environmental impacts.

TS-7800 [81], a Single Board Computer, which fully utilizes Marvell® 88F5182

ARM architecture with maximum power consumption of 6 W @ 5 v. TS-7800 SBC is a small fan-less embedded device providing 128 MB DDR-RAM and 12k LUT programmable FPGA for further cryptographic engine implementation. This system boots

Linux in 700 ms from internal flash allowing fast switching from sleep mode, which uses only 200 mA. TS7800 boards are used for modeling and proof of concept of accelerators.

Figures 13 shows a logical view of the accelerator cluster in real-life data grid.

There is a local SSD/HDD assigned to each group of 6 accelerator provide local caching for accelerators. Also the entire cluster of accelerators have access to SAN services to provide more effective storage access instead of simply using NFS based file servers.

There is a fail-over procedure for LVS load-balancer A and B via heartbeat daemon. The web-traffic can be distributed between the real-servers and accelerators on-demand.

53 Load Balancer A

Load Balancer B

Figure 13: Logical view of the enterprise accelerator cluster in action

Figures 14 shows a picture of the proposed prototype, which uses NFS based file- server (SAN based storage services are very expensive and couldn't be used here). There will be more details on how the system was tested in Chapter 5, which we explain the details of the testing procedures and obtained results.

54 Figure 14: Real view of accelerator cluster

55 CH.5: TESTING AND RESULTS

In order to have an accurate evaluation of the proposed system, the system has to be tested with real web traffic. To generate real web traffic, six laptops have been used.

The laptops were booted via PXE using NFS based Debian Linux, which allowed different configuration to be pushed and tested.

Figure 15 shows a setup for a simple testbed of a baseline powerful web-server.

NFS file-server and PXE are just used to uniformly boot the test laptops, which generate the test traffic, load-balancer assures that the web-server doesn't get exhausted (e.g. it monitors the CPU load and response time of the web-server to perform admission control of the web traffic.)

Gigabit Switch

0 Test Traffic Network File Server Generator Laptops

Web Server

Figure 15: Simple test harness for a power hungry web-server

56 Figure 16 and 17 show partial overview of the tester system. Different testing platforms have been used to achieve more accurate benchmarking of the proposed architecture.

Figure 17: Test laptops used for benchmarking

Figure 16: Test laptops

Figure 18 demonstrates a simple overview of a testbed for accelerators. NFS file- server and PXE are just used to uniformly boot the test laptops which generate the test traffic, load-balancer assures that the traffic is spread among accelerators and also it monitors the CPU load and response time of the accelerators to perform admission control of the web traffic. Accelerators are booted via TFTP and they get their images via

NFS file-server. For simplicity reasons, the web pages are cached in memory of the accelerators. We do this as currently our accelerators are not connected to high performance Storage Area Network (for financial reasons) and instead we use low- performance Network File Server. Frequently accessing files from a NFS device will produce latency and unwanted overheads. To overcome this issue, at boot time, we save the entire top root directory of the web-server in /var/run/www that is a tempfs storage residing in RAM to avoid fetching web pages from NFS file-server each time a web request arrives.

Load Balancer Gigabit Sv itching

Network File Server

C l u s t e r o f A R M W e b Server A ccelerator

T est Traffic G enerator Laptops

Figure 18: Simple test harness for a cluster of web-server accelerators

In this research we have used SpecWeb2009, Apache AB Tool in addition to our in-house custom made bench-marker to test and benchmark the cluster of accelerators and web-server.

58 5.1 SpecWeb2009

The Standard Performance Evaluation Corporation (SPEC) is a non-profit organization that aims to produce, establish, maintain and endorse a standardized set of performance benchmarks for computers. SPECWeb is a benchmark for evaluating web server performance. Various workloads and web traffic are automatically generated using slave Clients that are coordinated by a master PrimeClient. SpecWeb has a preset of benchmarking web pages for Banking and E-Commerce that simulate a real-world scenario [73].

The PrimeClient initializes and controls the behavior of the traffic generator

Clients, runs initialization routines against the web server and BeSim (Back-End

Simulator), and collects and stores the results of the benchmark tests. Similarly, as a

logical component it may be located on a separate physical system from the clients, or it

may be on the same physical system as one of the Clients. The benchmark traffic

generator Clients run the application program that sends HTTP requests to the server and

receives HTTP responses from the server. Clients record the response times and forward

the results to PrimeClient for further analysis. For portability reasons, Client and the

PrimeClient programs have been written in Java. One or more load-generating Clients

may exist on a single physical system.

Back-End Simulator (BeSim) is intended to emulate a back-end application server

59 that the web server must communicate with in order to retrieve specific information needed to complete an HTTP response (customer data, for example). BeSim exists in order to emulate this type of communication between a web server and a back-end server.

BeSim is interfaced with the WebServers using FastCGI interface. BeSim can be located on the same physical machine as the Webserver.

Unfortunately, we have found a design issue in SpecWeb2009 that does not allow

SpecWeb to properly test systems behind load-balanceres. SpecWeb2009 has to be aware of each of BeSim modules directly. This means Besim modules cannot go behind a load- balancer to appear as one unified service. As a result real-word load-balancing techniques that utilize direct IP routing schema cannot be applied or tested using SpecWeb2009.

However, we have run SpecWeb2009 on a single accelerator in comparison to one real web-server alone to provide a feeling of how well a single configuration of server accelerator works in comparison to a single powerful real web-server. A powerful six- core AMD Phenom clocked at 3.7 GHz with 16 GB of RAM clocked at 2.0 GHz provide an average response time of 167 ms using SpecWeb2009 Banking Benchmark while the proposed single server accelerator produces 1,083 ms average response time for the same benchmark.

5.2 Apache AB Benchmarker

Apache Foundation has a program called AB, which is used to benchmark web-

60 servers. AB program allows fetching web pages in high volumes to get a feeling of how well a web-server can perform. Multiple requests can be sent in concurrent manner to better simulate the real word web traffic. Also AB program can be instructed to close the

TCP sessions for each request (i.e. no keep-alive feature) to further regression test the web-server and it's underneath hardware. Manual proxy setting, cookie features, POST and GET parameters are supported to simulate a full real life web-browser [82].

AB program is used to determine the shortest response-time of a powerful server in contrast to proposed embedded cluster while the systems are not carrying heavy loads.

Shortest response-time of a powerful six-core AMD Phenom machine clocked at 3.7 GHz with 16 GB of RAM clocked at 2.0 GHz while serving a simple dynamic login page is

1.01 ms. The proposed embedded ARM machine (CPU clocked at 500 MHz with 128

MB of SDRAM and Gigabit Ethernet) produces 7.6 ms of response-time. The same test has been done using a simple static page, which in this case the powerful AMD machine produced a best response time of 1.1 ms while the embedded ARM machine produced 4.3 ms. Please note both of the machines had idle CPU usage and the goal of the test was to see how fast the machines can respond to a simple dynamic login page and static pages.

61 Apache AB Benchmark - Static Page Test

AMD 6-Core 3.7 GHz ARM 500 MHz

Figure 19: Apache AB Benchmark - Static Page Test

In order to compare between the cluster of embedded ARM accelerators and our powerful AMD machine under heavy loads, we have to benchmark against the exact same traffic. Hence, AB program is loaded to our tester cluster to generate 100 million requests of the exact same type with concurrency level of 300 (e.g. cluster of six laptops booted via PXE using Linux Debian hosted on NFS each sending 50 concurrent requests

). In the case of benchmarking for a static page (a graphical image with file size of 2.1 kB

) the AMD machine produced an average response-time of 14.5 ms while the cluster of four ARM accelerators produced average response time of 97.0 ms. In the case of benchmarking for a dynamic page (a simple dynamic login page) the AMD machine produced an average response-time of 91.4 ms while the cluster of four ARM accelerators produced average response time of 706.2 ms. If the concurrency level changes to 400,

62 the cluster of four ARM accelerators will become overloaded and very slow. If the concurrency level changes to 600, the AMD machine will gradually become slow and eventually stalls.

Apache AB Benchmark - Dynamic Page Test

AMD 6-Cores 3.7 GHz ARM 500 MHz

Figure 20: Apache AB Benchmark - Dynamic Page Test

5.3 Custom Made Benchmarker

Apache AB program is a simple program, which is not able to simulate gradual traffic shapes to test different routing schemes. SpecWeb2009 is designed to fully support traffic ramp-up and ramp-down to test advance scenarios that needs dropping requests and traffic shaping. However, SpecWeb2009 is not able to test our cluster, as its BeSim module cannot go behind load-balancers. Hence we had to build our own fully custom made web-server benchmarker.

Our benchmarker is designed to leverage a cluster of tester slaves, which generate

63 the traffic load. Each Slave has a preset schedule, which it obtains from the Prime Slave.

The schedule is a XML file, which contains the following options:

1. Number of threads that the Slave can have. Each slave can freely define the

number of thread it wants to use to generate traffic load.

2. List of the web pages to fetch along with their POST/GET parameters if needed.

Each thread has a separate list for itself.

3. Keep-alive option can be set to false for each web page to further force the Slave

to close the connection after the request is sent.

4. Each thread can be repeated as many times as needed and hence

NumberOfTimeTorepeat parameter can be set for each thread if needed.

5. Each web page can have a parameter called WaitBeforeFetch, which suspends the

thread for the given amount before start fetching the page. This will help to shape

the traffic in granular fashion. The value is in millisecond.

6. Each Slave can have a value called StartTime, which works as timer that allows

different start time for slaves and their threads.

7. Each web page can have a parameter called TollerableThershold, which indicates

the maximum desirable response-time for the page. If the web-server responds

back within that given limit, the response will be marked acceptable. The value is

in millisecond.

64 8. Each web page can have a parameter called Timeout, which indicates the

maximum time the Slave thread waits for the page to be fetched.

Prime Slave is the master of all Slaves programs. Prime Slave is responsible for distributing XML based schedules for each slave traffic generator and also collecting the results from the slaves. Prime Slave will offer the following statistics:

1. Collection of all the response-times of fetched web pages.

2. Maximum and minimum response-time of fetched pages within the worker thread

and associated Slave.

3. Median, mean, standardDeviation and variance of response-times of fetched

pages within the worker thread and associated Slave.

4. HitTolIerableThershold variable, which is the total number of cases that response­

time exceeds the given desirable threshold within the worker thread and

associated Slave.

5.4 Effectiveness Of Admission Control

To test how well admission control can improve server response time, we have conduct a simple test case involving one accelerator. Regression test used 5000 concurrent request per second to query a static file with keep-alive feature turned off.

The accelerator had an average response time of 20 ms with average CPU utilization of

65 45% while only 1% of the requests exceeded 200 ms threshold for response time

(maximum response time we had experienced was 317 ms).

If concurrency level is changed to 10,000 connections per second, the accelerator's average response time will suddenly jump to 34 ms while CPU utilization will hike to 86 % and around 5% of the connections will exceed 200 ms threshold for response time (maximum response time we had experienced was 3542 ms). The situation will get worse as some connections queue up. However, with admission control enabled, as soon as we redirect half of the request to another accelerator, we will experience almost the same response time. Please note in our experiment it took at least 500 ms for the accelerator to get back to the normal state (e.g. same state as 5000 concurrent requests

) as it has to service the remaining queued up requests.

This experiment shows the importance of admission control for the embedded cluster as they can get exhausted much faster than a powerful server. As long as there is load-balancing and admission control involved in the accelerator cluster we can expect to meet the SLA requirements easily. As shown in Table 3, by adding more accelerators to the cluster, we can handle more requests while avoiding exhaustion.

66 Average Response Max. Connections Concurrency level CPU Time (capture Experienced Exceeded 200 ms (connections per second) Utilization interval is 500 ms) Response Time Threshold 5.000 requests/sec Around 45%Around 20 ms 317 ms Around 1% 10.000 requests/sec 86% and Around 34 ms 3542 ms Around 5% more 10,000 requests/sec with round robin load-balancing . , Ar.n. Around 20 ms 317 ms and.... admission control . & Around 45% I Around 1% between 2 accelerators ! ■ 120,000 requests/sec with round robin load-balancing . , ...... b Around 44%Around 21 ms 320 ms and admission control Around 1% between 4 accelerators

Table 3: Effectiveness of Admission Control

67 CH.6: CONCLUSION

Power consumption is the largest operating expense in any server farm and has to be reduced as much as possible. Expensive and power demanding servers are not reusable and have to go through cosdy recycling procedures or even end up in landfills. Embedded hardware is less general, less expensive, less power demanding and can be re-used in smaller devices. However, embedded hardware is hard to work with, as it is not as general as popular servers.

For simple tasks like serving web pages or less computationally intensive jobs (in general Internet application servers), embedded hardware can be used. Our research shows that a general 500 MHz ARM embedded hardware is at least 4 times slower than a powerful server running six cores and clocked at 3.7 GHz. However, for most of the jobs, we do not care about instant computing power and all what is needed is the throughput and torque of the system. The fact that a page is loaded in 20 ms or loaded in 100 ms will not make a much difference for most of the Internet users, but will make a much difference for environment and cost of operation.

Our proposed cluster of embedded ARM is consuming at least 25 times less energy in comparison to a powerful 500 Watts server while delivering 7 times slower responses for static pages or dynamic content. If the slowness factor is acceptable (i.e. loading a page in 630 ms instead of 90 ms) we can fully replace the entire server to reduce the power consumption. In case of time sensitive operations, the proposed cluster

68 can work as server accelerators to bring more throughput and torque at a lower cost to the system.

69 CH.7: FUTURE WORK

Dynamic Voltage Scaling (DVS) is not efficient in modem systems where the overall power consumption includes a large portion of static power consumption. In modem servers, static power dissipation even when the CPUs are idle can be as bad as

60% of the peak time power usage. Dynamic Power Management (DPM) in [83] utilizes sleep model for efficient transition in and out of sleep mode. DPM is not quite suitable for server farms that have many interconnected services and virtual machines. However,

DPM has minimal impact on our proposed embedded cluster as there is no VM involved and each accelerator is doing almost the same job as others. We would like to test DPM to even further minimize the mean power consumption of our cluster.

70 CH.8: REFERENCES

[1] U.S. Census. "Estimate of U.S. Retail e-Commerce Sales," Last Checked on June

2011; http://www.census.gov.

[2] X. B. Fan, W. D. Weber, and L. A. Barroso, “Power Provisioning for a

Warehouse-sized Computer,” Isca'07: 34th Annual International Symposium on

Computer Architecture, Conference Proceedings, pp. 13-23, 2007.

[3] TelTub Inc. "TelTub Weblog at blog.teltub.com," Last Checked on June 2011;

http://www.teltub.com.

[4] Tippit. Inc. "The Google Datacenter in Oregon," Last Checked on June 2011;

http://www.itmanagement.com/features/googles-oregon-datacenter 110107.

[5] A. Gaspar, S. Langevin, W. Armitage et a i, “The Role of Virtualization in

Computing Education,” Sigcse'08: Proceedings of the 39th Acm Technical

Symposium on Computer Science Education, pp. 131-132, 2008.

[6] H. Kim, E. J. Kim, and R. N. Mahapatra. "Power Management in RAID Server

Disk System Using Multiple Idle States " Last Checked on June 2011;

http ://facuity, cs.tamu.edu/ej kim/HPC WEB/docs/ucas-1 .pdf.

[7] Q. B. Zhu, F. M. David, C. F. Devaraj et al., “Reducing energy consumption of

disk storage using power-aware cache management,” 10th International

71 Symposium on High Performance Computer Architecture, Proceedings, pp. 118-

129, 2004.

[8] A. S. Szalay, G. C. Bell, H. H. Huang et ah, “Low-power amdahl-balanced blades

for data intensive computing,” SIGOPS Oper. Syst. Rev., vol. 44, no. 1, pp. 71-75,

2010 .

[9] X. R. Wang, and M. Chen, “Adaptive power control for server clusters,” 2008

IEEE International Symposium on Parallel & Distributed Processing, Vols 1-8,

pp. 2671-2675, 2008.

[10] D. Wang, "Meeting Green Computing Challenges." pp. 121-126.

[11] Advanced Micro Devices Inc. "AMD Opteron™ 6000 Series Platform," Last

Checked on June 2011; http://www.amd.com/us/products/server/processors/600Q-

series-platform/pages/6000-series-platform.aspx.

[12] R. G. Dreslinski, D. Fick, D. Blaauw et ah, “Reconfigurable Multicore Server

Processors for Low Power Operation,” Embedded Computer Systems:

Architectures, Modeling, and Simulation, Proceedings, vol. 5657, pp. 247-254,

2009.

[13] D. Niyato, S. Chaisiri, and L. B. Sung, “Optimal Power Management for Server

Farm to Support Green Computing,” Ccgrid: 2009 9th IEEE International

Symposium on Cluster Computing and the pp. Grid, 84-91, 2009.

[14] W. Xiaorui, and W. Yefu, “Coordinating Power Control and Performance

72 Management for Virtualized Server Clusters,” Parallel and Distributed Systems,

IEEE Transactions on, vol. 22, no. 2, pp. 245-259, 2011.

[15] Y. F. Wang, X. R. Wang, M. Chen et ah, “PARTIC: Power-Aware Response Time

Control for Virtualized Web Servers,” IEEE Transactions on Parallel and

Distributed Systems, vol. 22, no. 2, pp. 323-336, Feb, 2011.

[16] J. Carter, and K. Rajamani, “Designing Energy-Efficient Servers and Data

Centers,” Computer, vol. 43, no. 7, pp. 76-78, Jul, 2010.

[17] J. Almeida, M. Dabu, and P. Cao, Providing differentiated levels of service in web

content hosting, p.App. 91-102,1997.

[18] T. Voigt, and P. Gunningberg, “Adaptive resource-based web server admission

control,” Iscc 2002: Seventh International Symposium on Computers and

Communications, Proceedings, pp. 219-224, 2002.

[19] M. Banatre, V. Issamy, F. Leleu et ah, “Providing quality of service over the Web:

a newspaper-based approach,” in Selected papers from the sixth international

conference on World Wide Web, Santa Clara, California, United States, 1997, pp.

1457-1465.

[20] Oracle Corporation. "Oracle TimesTen In-Memory Database," Last Checked on

June 2011; http://www.orade.com/technetwork/database/timesten.

[21] D. G. Andersen, J. Franklin, M. Kaminsky et ah, “FAWN: A Fast Array of Wimpy

Nodes,” Sosp'09: Proceedings of the Twenty-Second Acm Sigops Symposium on

73 Operating Systems Principles, pp. 1-14, 2009.

[22] E. N. Elnozahy, M. Kistler, and R. Rajamony, “Energy-efficient server clusters,”

Power-Aware Computer Systems, vol. 2325, pp. 179-196, 2003.

[23] A. Gandhi, M. Harchol-Balter, R. Das et a l, “Optimal Power Allocation in Server

Farms,” Sigmetrics/Performance'09, Proceedings of the 2009 Joint International

Conference on Measurement and Modeling of Computer vol.Systems, 37, no. 1,

p p .157-168, 2009.

[24] L. Xue, H. Jin, S. Lui et al., "Adaptive Control of Multi-Tiered Web Applications

Using Queueing Predictor." pp. 106-114.

[25] S. S. Martinez, J. S. Pareta, B. Otero et al., “Self-organized server farms for

energy savings,” in Proceedings of the 6th international conference industry

session on Autonomic computing and communications industry session,

Barcelona, Spain, 2009, pp. 39-40.

[26] H. Jamjoom, J. Reumanny, and K. G. Shin, Qguard: Protecting Internet servers

from overload, University of Michigan, 2000.

[27] K. Li, and S. Jamin, "A measurement-based admission-controlled Web server."

pp. 651-659 vol.2.

[28] A. Kamra, V. Misra, and E. M. Nahum, "Yaksha: a self-tuning controller for

managing the performance of 3-tiered Web sites." pp. 47-56.

74 [29] K. J. Astrom, and B. r. Wittenmark, Adaptive control, 2nd ed., Mineola, N.Y.:

Dover Publications, 2008.

[30] T. Voigt, and P. Gunningberg, "Adaptive resource-based Web server admission

control." pp. 219-224.

[31] L. Malrait, S. Bouchenak, and N. Marchand, “Experience with CONSER: A

System for Server Control through Fluid Modeling,” Computers, IEEE

Transactions on, vol. 60, no. 7, pp. 951-963, 2011.

[32] S. Elnikety, E. Nahum, J. Tracey et al., “A method for transparent admission

control and request scheduling in e-commerce web sites,” in Proceedings of the

13th international conference on World Wide Web, New York, NY, USA, 2004,

pp. 276-286.

[33] J. Guitart, D. Carrera, V. Beltran et al., “Designing an overload control strategy

for secure e-commerce applications,” Computer Networks, vol. 51, no. 15, pp.

4492-4510, 2007.

[34] J. M. Blanquer, A. Batchelli, K. Schauser et al., “Quorum: flexible quality of

service for Internet services,” in Proceedings of the 2nd conference on

Symposium on Networked Systems Design \& Implementation - Volume 2, 2005,

pp. 159-174.

[35] J. Guitart, J. Torres, and E. Ayguade, “A survey on performance management for

Internet applications,” Concurrency and Computation: Practice and Experience,

75 vol. 22, no. 1, pp. 68-106, 2010.

[36] L. Cai. "Building Energy-Efficient Web Servers Using Low Power Devices:

Opportunities and Challenges," Last Checked on June 2011;

http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1159&context=ecetr.

[37] M. Yeh. "Low-power Linux Server," Last Checked on June 2011;

http://www.personal.psu.edu/kqyl/blogs/simplelife/2008/07/lowpower-linux-

server-miniitx.html.

[38] D. Meisner, and T. F. Wenisch, “Does low-power design imply energy efficiency

for data centers?,” in Proceedings of the 17th IEEE/ACM international

symposium on Low-power electronics and design, Fukuoka, Japan, 2011, pp. 109-

114.

[39] R. Jain, The art of computer systems performance analysis : techniques for

experimental design, measurement, simulation, and modeling, New York: Wiley,

1991.

[40] P. N. Paraskevopoulos, Digital control systems, London ; New York: Prentice

Hall, 1996.

[41] M. Allman, V. Paxson, and W. R. Stevens. "Request for Comments - TCP

Congestion Control - RFC2581 " Last Checked on June 2011;

http://tools.ietf.org/html/rfc2581.

[42] D. Comer, Internetworking with TCP/IP, 5th ed., Upper Saddle River, N.J.:

76 Pearson Prentice Hall, 2006.

[43] L. Torvalds. "The Linux Kernel Project," Last Checked on June 2011;

http://kemel.org/.

[44] Netfilter Organization. "Linux Packet Filtering Framework," Last Checked on

June 2011; http://www.netfilter.org.

[45] A. L. Ortega. "Cherokee Webserver Benchmarks," Last Checked on June 2011;

http://www.cherokee-project.com/benchmarks.html.

[46] Apache Software Foundation. "Apache HTTP Webserver," Last Checked on June

2011; http://httpd.apache.org.

[47] Microsoft Corporation. "Microsoft IIS Web Server," Last Checked on June 2011;

http://www.iis.net.

[48] I. Sysoev. "NGINX HTTP Server," Last Checked on June 2011; http://nginx.org.

[49] Apache Software Foundation. "Apache Tomcat Application Server," Last

Checked on June 2011; http://tomcat.apache.org.

[50] Oracle Corporation. "MySQL Database Server," Last Checked on June 2011;

http://www.mysql.com.

[51] The Linux Virtual Server Org. "The Linux Virtual Server Project," Last Checked

on June 2011; http://www.linuxvirtualserver.org.

[52] Wikimedia Foundation Inc. "Standard RAID Levels," Last Checked on June

77 2011; http://en.wikipedia.org/wiki/Standard RAID levels.

[53] Wikimedia Foundation Inc. "B-tree file system," Last Checked on June 2011;

http://en.wikipedia.org/wiki/Btrfs.

[54] Wikimedia Foundation Inc. "Trivial File Transfer Protocol," Last Checked on

June 2011; http ://en. wikipedia.org/wiki/TF I P.

[55] Wikimedia Foundation Inc. "Preboot Execution Environment," Last Checked on

June 2011; http://en.wikipedia.org/wiki/Preboot Execution Environment.

[56] Wikimedia Foundation Inc. "Simple Network Management Protocol," Last

Checked on June 2011;

http://en.wikipedia.org/wiki/Simple Network Management Protocol.

[57] DENX Software Engineering. "Das U-Boot, the Universal Boot Loader " Last

Checked on June 2011; http: //www.denx. de/wiki/U-Boot.

[58] Wikimedia Foundation Inc. "Linux Initial Ramdisk," Last Checked on June 2011;

http://en.wikipedia.org/wiki/Initrd.

[59] D. Vlasenko. "BusyBox," Last Checked on June 2011; http://www.busvbox.net.

[60] Linux Virtual Server Project. "Linux Virtual Server," Last Checked on June 2011;

http ://www.linuxvirtualserver.org.

[61] Linux Virtual Server Project. "IP Virtual Server," Last Checked on June 2011;

http://www.linuxvirtualserver.org/software/ipvs.html.

78 [62] Linux Virtual Server Project. "Kernel TCP Virtual Server," Last Checked on June

2011; http://www.linuxvirtualserver.org/software/ktcpvs/ktcpvs.html.

[63] Linux Virtual Server Project. "Linux Virtual Server Scheduling," Last Checked on

June 2011; http://www.hnuxvirtualserver.org/docs/scheduling.html.

[64] Linux Virtual Server Project. "Linux Virtual Server How-to," Last Checked on

June 2011; http://kb.linuxviitual.server.org/wiki/Mini Mini Howto.

[65] Wikimedia Foundation Inc. "Linux Virtual Server for Wikipedia," Last Checked

on June 2011; http://en.wikipedia.org/wiki/Linux Virtual Server.

[66] Linux Virtual Server Project. "Linux Virtual Server Direct Routing," Last

Checked on June 2011; http://www.linuxvirtuaIserver.org/VS-DRouting.html.

[67] The PHP Group. "Hypertext Preprocessor," Last Checked on June 2011;

http://www.php.net.

[68] P. Barford, and M. Crovella, “Generating representative Web workloads for

network and server performance evaluation,” SIGMETRICS Perform. Eval. Rev.,

vol. 26, no. 1, pp. 151-160,1998.

[69] J. Hellerstein, Feedback control of computing systems, New York: IEEE Press :

Wiley, 2004.

[70] E. D. Lazowska, Quantitative system performance : computer system analysis

using queueing network models, Englewood Cliffs, N.J.: Prentice-Hall, 1984.

79 [71] V. A. F. Almeida, and D. A. Menasce, “Capacity planning an essential tool for

managing Web services,” IT Professional, vol. 4, no. 4, pp. 33-38, 2002.

[72] S. Lui, L. Xue, L. Ying et a l, "Queueing model based network server

performance control." pp. 81-90.

[73] Standard Performance Evaluation Corporation. "SpecWeb2009 - Benchmark for

Evaluating Web Server Performance," Last Checked on June 2011;

http://www.spec.org/web2009.

[74] Intel Corporation. "Intel® Xeon® Processor 6000 Sequence Specifications," Last

Checked on June 2011;

http://www.intel.eom/p/en US/products/server/processor/xeon6000/specifications

[75] Intel Corporation. "Intel® Server MotherBoard S5520UR," Last Checked on June

2011 ;

http ://download. intel ■com/support/motherboards/server/s5520ur/sb/e44031011 s5

520ur s5520urt tps rl 8.pdf.

[76] A. Karabuto. "HDD Power Consumption and Heat Dissipation," Last Checked on

June 2011; http://ixbtlabs.com/articles2/storage/hddpower.html.

[77] Intel Corporation. "Intel® Ethernet Server Adapter X520-T2 10 Gigabit BASE-T

Ethernet Server Adapter Designed for Multi-Core Processors and Optimized for

Virtualization," Last Checked on June 2011;

80 http ://www.inteI. com/Assets/PDF/prodbrief/318349-004.pdf.

[78] Marvell Technology Group Ltd. "Marvell® Feroceon 88F5182," Last Checked on

June 2011; http://www.marvell.com.

[79] R. Longbottom. "Dhrystone Benchmark Results On PCs," Last Checked on June

2011; http://www.royIongbottom.org.uk/dhrystone results.htm.

[80] J. Gainsbrugh. "Dhrystone Linux Benchmarks," Last Checked on June 2011;

http://www.anime.net/~goemon/benchmarks.html.

[81] Technologic Systems. "EmbeddedArm TS-7800 SBC," Last Checked on June

2011; http://www.embeddedarm.com/products/board-detail.php?product=TS-

7800.

[82] Apache Software Foundation. "Apache HTTP server benchmarking tool," Last

Checked on June 2011; http://httpd.apache.Org/docs/2.0/programs/ab.html.

[83] W. Shengquan, "Power Saving Design for Servers under Response Time

Constraint." pp. 123-132.

81