Resource Allocation Solutions for Reducing Delay in Distributed Computing Systems (Thesis Proposal)

Takayuki Osogami Department of Computer Science Carnegie Mellon University [email protected]

April, 2004

Committee members: Mor Harchol-Balter (Chair) Hui Zhang Bruce Maggs Alan Scheller-Wolf (Tepper School of Business) Mark Squillante (IBM Research) 1 Introduction

Waiting time (delay) is a source of frustration for users who receive service via computer or communication systems. This frustration can result in lost revenue, e.g., when a customer leaves a commercial web site to shop at a competitor’s site. One obvious way to decrease delay is simply to buy (more expensive) faster machines. However, we can also decrease delay for free with given resources by making more efficient use of resources and by better jobs (i.e., by changing the order of jobs to be processed). For single server systems, it is well understood how to minimize mean delay, namely by the shortest remaining processing time first (SRPT) scheduling policy. SRPT can provide mean delay an order of magnitude smaller than a naive first come first serve (FCFS) scheduling policy. Also, the mean delay under various scheduling policies, including SRPT and FCFS, can be easily analyzed for a relatively broad class of single server systems (M/GI/1 queues). However, utilizing the full potential computing power of multiserver systems and analyzing their performance are much harder problems than for the case of a single server system. Despite the ubiquity of multiserver systems, it is not known how we should assign jobs to servers and how we should schedule jobs within each server to minimize the mean delay in multiserver systems. Also, it is not well understood how we can evaluate various assignment and scheduling policies for multiserver systems. In this thesis, we provide partial answers to these questions.

1.1 Multiserver architectures

In this thesis, we seek to minimize delay in distributed computing systems (multiserver architec- tures). Figure 1 shows four common models of multiserver architectures that we consider in this thesis.

(a) Network of workstations (NOW): There are n workstations and each workstation • owns a queue of jobs. Each workstation usually processes its own jobs, but we also al- low some workstations to help others (i.e., some workstations can process jobs from other workstations’ queue). Examples of NOWs include local area networks in universities and companies.

(b) Server farm with distributed queues: There are n servers and jobs arriving from •

1 (a) NOW (b) server farm (c) server farm (d) servers with (distributed queues) (central queue) affinities Figure 1: Four models of distributed computing systems that we consider in this thesis.

outside the server farm are immediately dispatched to one of the servers. Examples of a server farm with distributed queues include high volume web servers.

(c) Server farm with a central queue: There are n servers and one central queue. Here, • jobs arriving from outside the server farm waits in the central queue, and when one of the servers becomes available, a job is dispatched from the central queue to the available server. Examples of a server farm with a central queue include supercomputing centers.

(d) Servers with affinities: There are n servers and m classes of jobs. Here, jobs typically • have different affinities with different servers, i.e., a job may be processed more quickly on one server than on another. Examples of servers with affinities include multiprocessor systems, where cache affinity can significantly affect processing speed, and call centers, where people with different abilities serve different types of requests.

For simplicity, we set n = 2 and m = 2 in the figure. These models are not exhaustive, but they cover a wide range of distributed computing systems.

1.2 Where does delay come from

We start by asking where delay comes from. Long waiting times are experienced when the system load is high, i.e., when the average arrival rate (jobs per second), λ, is high relative to the average service rate (jobs per second), µ (see Figure 2).

2 Long waiting times are also experienced when utilization of system resources is poor. When system resources cannot by fully utilized, the effective average service rate (jobs per second), µ0, becomes smaller than the potential average service rate, µ. (For example, the potential service capacity is eaten up by context switching time in switching from one type of jobs to another and migration time to transfer jobs from one server to another.) As a result, λ can be high relative to µ0, which has the same effect as high load (λ is high relative to µ), causing long waiting times. In fact, maximizing utilization does not necessarily result in minimizing delay in distributed computing systems, and this makes the design of resource allocation mechanisms in distributed computing systems difficult. We will see later that there are situation where we want to keep some servers idle even in the presence of jobs in queue so that more important (e.g. short processing time) future arrivals can receive service immediately upon their arrivals. High load and poor utilization are not the only causes of the long waiting time; we can experience long waiting time even when the average system load is low (see Figure 3). The long waiting time at low load is primarily due to variability in service demand and/or interarrival time, but other factors such as higher moments and correlation of service demand and interarrival time can also increase the waiting time. Even if long-run average load is not too high, fluctuation in the load can cause lots of delay, i.e., high instantaneous load can be problematic. In particular, variability and autocorrelation in interarrival times often causes fluctuations in load, and peak load and average load can differ by an order of magnitude.

1.3 Brief summary of prior work on minimizing delay

We briefly summarizes prior work on minimizing delay by reducing the impact of high load and service demand variability, which are the primary sources of delay. More detailed literature review will be provided in later sections as needed.

1.3.1 Minimizing delay by combating high load

When a system is overloaded, that is when the arrival rate is higher than the service rate, we need to either decrease the arrival rate, degrade the service quality, or increase the service rate, to keep the mean waiting time low. Below we classify common approaches to combating high load into three types: decreasing the arrival rate, degrading the service quality, and increasing the service

3 λ 50

40 jobs in queue (waiting) 30

20

a job in service mean waiting time 10

µ 0 0 0.2 0.4 ρ 0.6 0.8 1 (a) An M/M/1 queue (b) The mean waiting time in an M/M/1 queue

λ Figure 2: The mean waiting time in an M/M/1/FCFS queue as a function of system load, ρ = µ , where λ is the average arrival rate (jobs per second) and µ = 1.0 is the average service rate (jobs per second).

rate. One way to decrease the arrival rate is to reject some arrivals into the system; this approach is known as admission control and has been applied to various computer systems such as web servers [29, 30, 31, 103, 155, 158] and packet networks [81, 18, 19]. Admission control may be combined with some scheduling policy so that rather than just dropping an arrival at random, the scheduling policy determines which arrivals to drop [28]. Degrading the service quality during overload period (e.g., by omitting pictures from web pages) is also popular at web servers and has been studied as content adaptation [1, 25]. An advantage of admission control and content adaptation is that they can be exercised within a single system. When multiple systems are available, we can increase the service rate of an overloaded system by utilizing resources of other systems. Load balancing and cycle stealing are two popular ap- proaches that make use of multiple systems to mitigate the impact of overload. Load balancing mitigates the impact of overload in a system, by sharing the load among many systems. Load balancing has been popular in networks of workstations (NOW) [21, 60] and implemented in

4 50 50

40 40

2 30 C =64 30 C2=64 S A

20 20

mean waiting time 2 mean waiting time C =8 2 10 S 2 10 C =8 2 C =1 A C =1 S A 0 0 0 0.2 0.4 ρ 0.6 0.8 1 0 0.2 0.4 ρ 0.6 0.8 1 (a) The mean waiting time in M/G/1 (b) The mean waiting time in G/M/1

Figure 3: The mean waiting time in an M/G/1/FCFS queue and the mean waiting time in a λ G/M/1/FCFS queue as a function of system load, ρ = µ , where λ is the average arrival rate (jobs per second) and µ = 1.0 is the average service rate (jobs per second). The variability of the service σ(S) demand is represented by the coefficient of variability, CS = E[S] , where E[S] denotes the mean service demand and σ(S) denotes the standard deviation of the service demand. The variability of the interarrival time is represented by CA, which is defined analogously. (In (b), the arrival process is assumed to be a batch Poisson process with geometric batch size.)

systems such as MOSIX [11] and Utopia [173]. A disadvantage of load balancing is that load from overloaded system can slow down other lightly loaded systems. This is a problem when users have ownership of their systems and expect certain guaranteed quality of service when they use their own systems. This is a motivation for cycle stealing, which allows overloaded systems to steal only idle cycles of other systems. Cycle stealing is implemented in systems such as Butler [122], Condor [104, 105], Sprite [42], Benevolent Bandit [49], Stealth Distributed Sched- uler [98], and Linger-Longer [135]. Cycle stealing is also motivated by the observation by Mutka and Livny [118], who report that workstations are idle more than 75% of the time in University of Wisconsin Computer Science Department.

1.3.2 Minimizing delay by combating service demand variability

Higher service demand variability can also cause longer waiting time both under a single server and under distributed servers. This is because large jobs in service can block all the other jobs in queue for a long time. Below we classify common approaches to combating service demand

5 variability according to whether preemption is required or not, i.e., whether the policy interrupts and subsequently resumes jobs (preemptive) or allows all jobs to run-to-completion once in service (nonpreemptive). When jobs are allowed to be preempted, service demand variability is not an issue, since preemption allows small jobs to jump ahead of long jobs. For example, the mean response time in an M/G/1 queue under is independent of the service demand variability. In fact, we can make use of the service demand variability to further improve the mean response time, for example by SRPT (shortest remaining processing time first) [95]. However, there are many real world settings where a job cannot be interrupted and subsequently resumed, for example in servers at supercomputing centers, at high-volume Web sites [80, 150], scalable systems for computers within an organization [134], and telecommunication systems with heterogeneous servers [15]. When preemption is not allowed, there is always a chance that large jobs block all the other jobs in queue no matter how we schedule the jobs. When multiple servers are available, however, we have flexibility in choosing which jobs should be processed on which server, which is known as task assignment. This flexibility may give us a way to combat service demand variability even when preemption is now allowed. Examples of task assignment policies that reduce the impact of service demand variability include Dedicated policy [59, 139], where some hosts are designated as the “short hosts” and others as the “long hosts,” and short jobs are always sent to the short hosts and long jobs to the long hosts, preventing short jobs waiting behind long jobs. Even when the job size is not known, a policy very similar to Dedicated, known as the TAGS policy (Task Assignment by Guessing Size) works almost as well when service demand has high variability [58].

1.3.3 Minimizing delay by combating interarrival time variability and correlation

Although it has been observed that irregularity (variability and correlation) in the arrival process can impact delay, little is known as to how the impact of arrival irregularity can be reduced and how much. For example, the above-mentioned approaches to combating overload can reduce the impact of arrival irregularity, since arrival irregularity often causes temporary overload; however quantitative effectiveness of these approaches is not well understood. Below, we briefly review papers that show existence or impact of arrival irregularity. Irregularity of arrival processes in computer systems has been observed in various ways, includ-

6 ing long range dependence, self-similarity, time of day effect, and flash crowd (also known as slash- dot effect). Recent studies show long-range dependence or self-similarity in arrival process or traffic in various computer and communication systems, including the Internet [38, 102, 128, 165, 166], supercomputing centers [149], web servers [78, 79], and MPEG streams [99]. Squillante et. al. show that correlation in arrival process can have a big impact on the system performance [149]. Time of day effect [22] has been observed in web servers [5, 79, 78] and supercomputing centers [48, 69, 167]. Flash crowd [5] and Slashdot effects [156] refer to the phenomenon that it is not uncommon to experience more than 100-fold increase in demand when a site becomes popular, and these effects are often observed in web servers.

1.4 Resource allocation mechanisms for minimizing delay

In this thesis, we consider the following resource allocation mechanisms, that determine (loosely speaking) a mapping between jobs and the server on which they will be run.

Cycle stealing: A cycle stealing mechanism is a particular resource allocation mechanism, • where a server processes jobs other than its own when the server’s own queue is empty (the “beneficiary” server is said to steal idle CPU cycles of the “donor” server). We consider three cycle stealing mechanisms with different objectives and assumptions.

– Cycle stealing for balancing load in NOW: This cycle stealing mechanism allows a lightly loaded workstation (donor) to help heavily loaded workstation (beneficiary) when donor has no jobs to do. This cycle stealing mechanism balances load, decreasing the load of the beneficiary and increasing the load of the donor. The delay at the beneficiary server is reduced due to the decreased load, while the delay in the donor server ideally stays the same, since the beneficiary only steals idle CPU cycles of the donor.

– Cycle stealing for increasing utilization under restrictive task assignment policies in server farms with distributed queues: When service demand has high variability and jobs cannot be preempted, allocating one server (“small” server) to small jobs and another server (“large” server) to large jobs (Dedicated policy) in a server farm can significantly decrease the mean delay, since it prevents small jobs

7 waiting behind large jobs. A disadvantage of Dedicated policy is that it may lead to low utilization of resource (CPU time). We consider a cycle stealing mechanism that increases utilization by allowing the “large” server to process small jobs when there are no large jobs. This increases utilization, while still preventing small jobs waiting behind large jobs.

– Cycle stealing for increasing utilization under restrictive task assignment policies in server farms with a central queue: We consider the above cycle stealing mechanism for increasing utilization also in server farms with a central queue.

Threshold-based policies: Threshold-based policies are a generalization of cycle stealing • mechanisms. While cycle stealing mechanisms allow a donor to help a beneficiary if and only if the donor queue is empty, threshold-based policies allow a donor not to help a beneficiary even if the donor queue is empty and they allow a donor to help a beneficiary even if the donor has its own jobs. We consider three or more threshold-based policies with different objectives and assumptions.

– Threshold-based policy for reducing switching cost in NOW: In networks of workstations, the donor workstation might not want to switch between its own jobs and the beneficiary jobs so frequently, since switching may require additional time such as context switching time and checkpointing time. We consider a threshold-based policy that allows a donor server to help a beneficiary less frequently by setting a threshold

TB on the beneficiary queue and a threshold TD on the donor queue. Now, the donor

switches to the beneficiary queue only when there are at least TB beneficiary jobs and there are no donor jobs, and the donor switches back to the donor queue only when

there are at least TD donor jobs.

– Threshold-based policy for prioritizing small jobs in servers with affinities: Cycle stealing mechanisms for increasing utilization in server farms that we introduced above allow a “large” server to process small jobs only when there are no large jobs. However, allowing the “large” server to process small jobs even in the presence of large jobs can further decrease the mean delay, since processing small jobs before large jobs can decrease the mean delay. However, biasing too much towards small jobs can starve

8 large jobs, resulting in infinite mean delay of large jobs. We consider a threshold-based policy that allows a “large” server to process small jobs if and only if there are at

least TS small jobs in a more general model than a server farm, namely in servers with affinities.

– Threshold-based policies adaptive to fluctuating load in servers with affini- ties: An advantage of threshold-based policy is that it allows us to optimize the overall performance by setting the appropriate threshold value for given environmental parameters such as loads and service demand distributions. A disadvantage is that the optimal threshold value depends on the environmental parameters. A threshold-based policy that is optimized at incorrect (mispredicted) settings can provide performance far from the optimal. We hope to propose new threshold-based policies that are adap- tive to environmental changes such as changes in load, while providing near optimal performance at estimated settings. We will be particularly interested in the case where jobs have affinity towards particular servers (i.e., they can be processed faster on some servers than others).

– Other threshold-based policies: We consider other threshold policies as needed.

We design various resource allocation mechanisms with various objectives and assumptions (e.g., jobs cannot be preempted or not; there are switching costs or not). The effectiveness of each resource allocation mechanism is demonstrated only under one or two of the four multiserver architectures in Figure 1. We hope, however, that intuitions obtained through the study under one multiserver architecture are also useful in understanding the effectiveness of the mechanism under the other multiserver architectures. We briefly describe how each resource allocation mech- anism demonstrated on one multiserver architecture might also be applied in other multiserver architectures.

1.5 Evaluating the effectiveness of resource allocation mechanisms: New anal- ysis methods

A primary contribution of this thesis is to quantitatively analyze the effectiveness of resource allocation mechanisms introduced in Section 1.4. The difficulty in the performance analysis of

9 distributed computing systems often comes from the large (infinite) state space required to cap- ture the state of the system. When there are n types of jobs, the state space often grows infinitely in n dimensions, and known approaches either are computationally expensive or have poor accu- racy in solution. We propose two (or more) new analysis techniques that are widely applicable in the analysis of computer system performance. We apply these techniques to analyze system performance under various resource allocation mechanisms. The first technique is recursive di- mensionality reduction (RDR) [163]. RDR reduces an n-dimensionally infinite state space into a one dimensionally infinite state space, which we can analyze efficiently via a known algorithmic method. In the case where RDR is used to reduce a two dimensionally infinite state space to 1D-infinite state space, we simply refer to it as dimensionality reduction (DR). In applying RDR, it is crucial to model a general distribution (such as service demand, interarrival times, and dura- tions of busy periods) by phase-type (PH) distributions, which are mixtures and/or convolutions of exponential distributions. Our second technique allows us to find a minimal PH distribution whose first three moments matches those of a given distribution [124, 123].

Organization

Figure 4 shows a proposed organization of the thesis. In each box, we show the resource allocation mechanisms or other topics that we consider (top), the multiserver architectures that we consider (figures), primary analysis techniques (bottom), and the fraction of already completed work (top- right corner) in the corresponding section(s). After the introduction, we first study cycle stealing mechanisms in Sections 2-4. In Section 2, we introduce dimensionality reduction (DR), which we use to analyze various resource allocation mechanisms throughout the thesis. We illustrate DR by applying it to analyze the cycle stealing mechanism for balancing load. In Section 3, we evaluate the cycle stealing mechanism for balancing load, where it requires switching time for the donor server to switch between working on the donor jobs and the beneficiary jobs. Sections 2-3 are based on the paper presented at the ACM SIGMETRICS 2003 [125]. In Section 4, we propose cycle stealing mechanisms for increasing utilization under restricted task assignment policies in server farms with distributed queues and with a central queue. We analyze its effectiveness using DR. This section is based on two papers presented at the IEEE ICDCS 2003 [61] and the ACM SPAA 2003 [62].

10 In Sections 5-6, we study threshold-based policies, which are a generalization of the cycle stealing mechanisms that we introduce in Sections 2-4. In Section 5, we propose a threshold- based policy for reducing switching cost in a NOW and analyze its effectiveness via DR. We also characterize good threshold values to minimize the overall mean delay. This section is based on the paper submitted to the Performance Evaluation Journal [126]. In Section 6, we will analyze the performance under a threshold-based policy for prioritizing small jobs in servers with affinities. I propose to work on this problem as a part of my thesis work. So far, we have assumed Poisson arrival processes. However, in real computer systems, it is observed that arrival processes are not stationary and often exhibits burstiness. Since quantita- tive characteristics of the impact of variability and correlation in arrival processes are not fully understood, we will study the impact of variable arrival processes on system performance in Sec- tions 7-8. In Section 7, we will study the impact of variability and correlation in arrival processes under a single server with the first-come-first-serve (FCFS) service discipline and with priority scheduling. For priority scheduling, we will limit our focus on the impact of variability in arrival processes. I propose to work on this problem as a part of my thesis work. In Section 8, we will study the impact of variability and correlation in the arrival processes under multiserver systems (in particular, a server farm with a central queue) with FCFS. Since, the impact of variability in arrival processes under FCFS is well studied in literature, I propose to study the impact of correlation under FCFS. In Section 9, we hope to propose new threshold-based policies that are adaptive to fluctuating load, caused by variability and correlation in arrival processes, in servers with affinities. I propose to work on this problem as a part of my thesis work. In Section 10, we summarize possible extensions to the analysis in previous sections. Here, we introduce recursive dimensionality reduction (RDR). While DR allows us to analyze systems of two servers and two classes of jobs, RDR allows us to analyze systems of more servers and more classes of jobs. This section is based on two working papers [127, 163]. In Section 11-12, we introduce a technique to approximate general distributions by phase type (PH) distributions. Approximating general distributions by PH distributions is a basis of analysis techniques, including DR and RDR, that we use throughout the thesis. These sections are based on two papers presented at the Performance TOOLS 2003 [124, 123], but it also describes the

11 idea of extending the results in the two papers. I propose to work on improving the results in the two papers as a part of my thesis work.

12 Cycle stealing mechanisms 100% completed Sections 2−3: Section 4: For balancing load For increasing utilization

analysis technique: dimensionality reduction

Threshold−based policies 50% Section 5: Section 6: For reducing switching cost For prioritizing small jobs

analysis technique: dimensionality reduction

Impact of variable arrival processes 0% Section 7: Section 8: Impact under single server Impact under multiserver FCFS

analysis technique: matrix analytic method, transform methods, diffusion approximation

80% Section 9: Threshold−based policies 0% Section 10: Extentsions to more servers and more classes

Section 9.1: For adapting to fluctuating load

Section 9.2: Other policies

Section 9.3: Comparison

analysis technique: recursive dimensionality reduction analysis technique: dimensionality reduction + Neuts

Approximating general distributions by PH distributions 80% Section 11: Section 12: Characterizing PH distributions Exp Exp Moment matching algorithm G PH

Figure 4: Organization of this thesis.

13 2 Cycle stealing for balancing load: Analysis via dimensionality reduction (Completed work)

In this section, we analyze the performance of a network of workstation under a cycle stealing mechanism for balancing load. A primary goal of this section is to introduce the dimensionality reduction (DR) technique that we use throughout the thesis to evaluate various resource allocation mechanisms. Specifically, we analyze the following model of cycle stealing. We assume there are two queues, the beneficiary queue and the donor queue, with independent arrival processes and service time distributions operating as M/M/1/FCFS queues (see Figure 5). Jobs arrive at average rate λB

(respectively, λD) at the beneficiary (respectively, donor) queue and have service requirement exponentially distributed with rate µB (respectively, µD). The load made up by beneficiary (respectively, donor) jobs is denoted by ρ (respectively, ρ ), where ρ = λB and ρ = λD . Our B D B µB D µD cycle stealing mechanism is characterized by the following set of rules, all of which are enforced preemptively (preemptive-resume):

The beneficiary server processes only beneficiary jobs. • The donor server processes beneficiary jobs if • – the donor queue is empty, and

– there are at least two beneficiary jobs.

Otherwise, the donor server processes donor jobs.

The difficulty of the analysis of computer systems often comes from large state space required to capture the state of the system. In this section, we introduce the dimensionality reduction (DR) technique that reduces a with a two dimensionally infinite state space into a Markov chain with a one dimensionally infinite state space (a quasi-birth-and-death (QBD) process with a finite number of phases 1 ), which we can analyze efficiently via the matrix analytic

1A QBD process is a continuous time Markov chain on the state space with a finite or infinite number of phases and an infinite number of levels, where there are transitions within each level and between two consecutive levels, and the structure of the transitions is the same for all levels (see Figure 6)

14 λB λD

beneficiary donor

Figure 5: A cycle stealing mechanism for balancing load in NOW. method [100, 120]. The key idea in our approach is to find a way to transform a 2D-infinite Markov chain into a 1D-infinite Markov chain which can be analyzed efficiently. Observe that the Markov chain that capture the behavior of our cycle stealing mechanism has state space (number of beneficiary jobs, number of donor jobs) that grows infinitely in two dimensions (see Figure 7 (a)).

2.1 Related work

Below, we review analytical techniques for evaluating systems involving multi-dimensionally in- finite state space. In particular, we review analysis of the coupled processor model and priority scheduling in multi-servers. In both models, popular approaches are the transform method, the reduction to a boundary value problem, and the matrix analytic method. The transform method and the reduction to a boundary value problem lead to a mathematically elegant expressions but often experience numerical instability in evaluating the expressions. Also, the reduction to a boundary value problem can only be applied to 2D-infinite Markov chains. The matrix analytic method is an algorithmic approach for evaluating a broad class of stochastic processes such as QBD processes. Although theory of the matrix analytic method has been developed for more gen- eral classes of stochastic processes, including nonhomogeneous (level-dependent) QBD processes, M/G/1 or G/M/1 type processes, and tree processes, the matrix analytic method is most efficient

15 infinite number of levels

λ λ λ 0,0 1 0,1 1 0,2 1 0,3 µ1 µ1 µ1 β α β α β α β α1 1 1 1 1 1 1 1 finite/infinite λ2 λ2 λ2 number 1,0 µ2 1,1 µ 1,2 µ 1,3 of phases 2 2 α β α β α β α β 2 2 2 2 2 2 2 2 λ λ λ 2,0 3 2,1 3 2,2 3 2,3 µ3 µ3 µ3

Same transitions Same transitions between levels within levels

Figure 6: A quasi-birth-and-death (QBD) process.

(in evaluating the solution) and simplest (in implementing the algorithm) when it is applied to QBD processes with a (small) finite number of phases. Therefore, most papers that evaluate multiserver systems (with a 2D-infinite state space) via the matrix analytic method first truncate the state space so that the resulting process becomes a QBD process with a (small) finite number of phases. The “coupled processor model” is related to our cycle stealing model. Here, two processors each serve their own class of job, and if either is idle it may help the other, increasing the rate of the other processor. Early analytical work on the coupled processor model includes papers by Konheim, Meilijson, and Melkman [96] and Fayolle and Iasnogorodski [45]. Both papers assume exponential service times. Konheim, Meilijson, and Melkman [96] apply a uniformization technique to determine generating function, and Fayolle and Iasnogorodski [45] reduces to the Riemann-Hilbert boundary value problem. Cohen and Boxma [36] extended the work to the case of general service times. They consider stationary workload, which they formulate as a Wiener-Hopf boundary problem. More recently, Borst, Boxma, and van Uitert [17] apply a transform method to the expressions in [36], yielding asymptotic relations between the workloads and the service requirement distributions. Borst, Boxma, and Jelenkovic [16] consider a similar question under generalized processor sharing. Both of these papers are concerned with the asymptotic behavior of workload, whereas our work isolates mean response time. Our work is thus complementary to

16 these results. Rao and Posner analyze the coupled processor model by the matrix analytic method by placing an upper bound on the number of jobs in a queue (so that the resulting process is a QBD process with a finite number of phases) [129]. Analysis of priority scheduling under multiple servers also involves dealing with multi-dimensionally infinite state space. Early analytical work on multiserver priority scheduling includes papers by Davis [40] and Buzen and Bondi [20] for nonpreemptive priority and by Segal [140] and Cobham [35] for preemptive priority. All of these papers assume equal mean exponential service times. Assuming finite buffers also leads to analytical tractability [91]. In the case of infinite buffers and different service times, transform methods, reduction to boundary value problems, and the matrix analytic method are popular approaches in the analysis. Gail et. al. [55] and Mitrani and King [115] apply a transform method for preemptive priority system, and Gail et. al. [54] and Kao and Wilson [90] apply a transform method for nonpreemptive priority system. Fayolle, King, and Mitrani [46] analyze 2-class M/M/c system, where class 1 has preemptive priority over class 2 on k servers and class 2 has preemptive priority over class 1 on c k servers, by reducing the − problem to the Riemann-Hilbert boundary value problem. Nain [119] analyzes the workload in a 2-class preemptive single server with threshold, where class 1 has preemptive priority over class 2 if the number of class 1 jobs is greater than the threshold and class 2 has priority otherwise. Nain also reduces the problem to the Riemann-Hilbert boundary value problem. Most approaches that make use of the matrix analytic method truncate the state space by explicitly placing a limit on the number of jobs so that the resulting process becomes a QBD process with a finite number of phases [89, 88, 101, 113, 121]. An exception is a recent work by Sleptchenko et. al. [144], who combine the transform method and the matrix analytic method.

2.2 Summary of results

DR makes use of the matrix analytic method but differs from all the above-mentioned work in the following three points.

1. DR does not require truncating the state space.

2. DR reduces a 2D-infinite Markov chain (QBD process with an infinite number of phases) to a 1D-infinite Markov chain (QBD process with a finite number of phases).

17 λ λ λ Cooperating B B B 0B,0D 1B,0D 2B,0D servers µB 2µB 2µB λ λ λ D BD D BD D BD λ λ λ B B B λ λ λ 0B,0D µ 1B,0D 2µ 2B,0D 2µ Working B B B B B B 0B,1D 1B,1D 2B,1D independently µB µB µB λD µ λD µ λD µ D D D (b) λB λB λB 0B,1D 1B,1D 2B,1D µB µB µB λ λ λD µD D µD D µD

λB λB λB 0B,2D 1B,2D 2B,2D µB µB µB λB λB λB λ λ 0B,0D µ 1B,0D 2µ 2B,0D 2µ λD µD D µD D µD B B B β λ β λ β λD 3 D 3 D 3 λB λB λB µ µ µ B B β B β1 β2 β1 β2 β1 2 λ λ λ (a) B B B 0B,1D 1B,1D 2B,1D µB µB µB

(c)

Figure 7: Markov chains for the cycle stealing model for balancing load. Figure (a) shows the 2D- infinite Markov chain tracking both the number of beneficiary jobs and the number of donor jobs. Figures (b) and (c) show the 1D-infinite Markov chain tracking the exact number of beneficiary jobs and the binary information (zero or 1) on the number of donor jobs. In (b) B is drawn ≥ D using a single (exponential) transition, while in (c) BD is represented by a 2-phase PH distribution with Coxian representation (see Figure 8).

3. DR extends to nD-infinite Markov chains (recursive dimensionality reduction, RDR (see Section 10)). RDR reduces an nD-infinite Markov chain to a QBD process with a finite number of phases.

We briefly describe how DR can be applied to reduce the 2D-infinite Markov chain in Fig- ure 7(a) to a 1D-infinite Markov chain. Figure 7(b) shows a 1D-infinite Markov chain that we obtain from the 2D-infinite Markov chain in Figure 7(a) via DR. Observe that this 1D-infinite chain exactly tracks the number of beneficiary jobs, while only providing binary information on the number of donor jobs (specifically whether there are zero or at least one donor job). The

1D-infinite chain relies on a new type of transition which is marked in bold and labeled BD.

This BD transition denotes the time from when the donor server has at least one job until the donor server is free, namely a donor busy period. In Figure 7(b) it is assumed that the busy period durations are exponentially distributed. However, the busy period duration, BD, is not exponentially distributed in general. We first compute the first three moments of BD and then

18 Exp p Exp absorption µ1 µ2 1−p

Figure 8: A PH distribution is the distribution of the absorption time in a continuous time Markov chain. The figure illustrates a 2-phase PH distribution with Coxian representation, where the ith state has exponentially-distributed sojourn time with rate µi for i = 1, 2. The absorption time is the sum of the times spent in each of the states, starting at state 1.

find a phase type (PH) distribution which matches these (see Section 12). Figure 7(c) shows the 1D-infinite Markov chain where the busy period transitions have been replaced by a 2-phase PH distribution with Coxian representation (see Figure 8). The limiting probabilities of the Markov chain in Figure 7(c) can be used to calculate the mean number of jobs of each type, E[NB] and E[ND], which in turn gives mean response time via Little’s law [94]. The limiting probabilities of the 1D-infinite Markov chain can be obtained efficiently via the matrix analytic method [100]. Deriving E[NB] from the limiting probabilities is straightforward, since we track the exact number of beneficiary jobs. We derive E[ND] by conditioning on the state of the chain. In the top row (states (iB,0D) for i = 0, 1, 2, ...), the number of donor jobs is zero. In the rest of states, the mean number of donor jobs is the same as the mean number of jobs in an M/M/1 system with arrival rate λD and with service rate µD given that the system is busy. Throughout we have assumed exponential jobs sizes, but this can be easily extended to general distributions. Extending job size distributions to general distributions can be done in a similar way as we extend the busy period distributions to general distributions in Figure 7(b)-(c). An interesting open problem is to derive a bound on the error in the mean response time obtained by DR when k moments, in particular for k = 3, of the busy periods are matched. Ideally, the error should be represented as a function of k. (This problem will not be addressed in the thesis.)

19 3 Cycle stealing for balancing load with switching cost (Completed work)

In this section, we seek to understand the benefit of cycle stealing for the beneficiary and the penalty to the donor. Our analysis relies on the dimensionality reduction (DR) technique intro- duced in Section 2. Although cycle stealing provides obvious benefits to the beneficiary, these benefits come at some cost to the donor. For example, the beneficiary’s job may have to be check- pointed and the donor’s working set memory reloaded before the donor can resume, delaying the resumption of processing of donor jobs. In our model we refer to these additional costs associated with cycle stealing as switching costs. We consider the same cycle stealing mechanism as in Section 2 except that switching time is required for the donor server to switch between the donor jobs and the beneficiary jobs (see Figure 9). When the donor server becomes idle, the donor transitions into the switching state, for a random amount of time, Ksw. After Ksw time, the donor server is available to work on the beneficiary queue and the beneficiary queue becomes an M/M/2 queue. When there is a new arrival in the donor queue (either during Ksw, or during the time the donor is helping the beneficiary), the donor transitions into a switching back state, for a random amount of time, Kba. After the completion of the switch back, the donor server resumes working on its own jobs until the donor queue is empty. We first review related work on cycle stealing mechanisms in Section 3.1. We then briefly summarizes our results on the effectiveness of the cycle stealing mechanism in Section 3.2.

3.1 Related work

Three types of cycle stealing mechanisms for networks of workstations (NOWs) have been con- sidered in the literature, depending on the behavior when an idle workstation is reclaimed by its owner. Below, we review the three types of cycle stealing for NOWs and other cycle stealing mechanisms that are not for NOWs. In the first type of cycle stealing mechanism, the jobs that have been running on an idle server are killed when the idle server is reclaimed by its owner. Butler [122] is a representative implementation of this type of cycle stealing. This cycle stealing model is also studied by Bhatt

20 λB λD

Ksw

Kba switching costs beneficiary donor

Figure 9: A cycle stealing mechanism for balancing load in NOW with switching cost. et. al. [14] and Rosenberg [131, 132, 133], who consider a theoretical problem of optimal cycle stealing strategy. This type of cycle stealing is the most conservative in that the cost (time) required for the owner to reclaim his idle machine is smallest. However, in this model, all the work that has been done on the idle workstation is lost on reclaiming. In the second type of cycle stealing, the job that have been running on an idle workstation is thus migrated, keeping the state of the job, to other idle workstations. Condor [104, 105], Sprite [42], and Benevolent Bandit [49] are representative implementation of this type of cycle stealing2. A disadvantage of the first two types of cycle stealing mechanisms is that they cannot efficiently make use of short idle periods, and this is a motivation for the next mechanism. In the third mechanism, the jobs that have been running on an idle workstation are preempted but keep staying on the workstation when the idle workstation is reclaimed by its owner. This mechanism for networks of workstations is implemented as Stealth Distributed Scheduler [98] and Linger-Longer [135]. Awerbuch et. al. [8] consider a problem of optimal policies in this mechanism. Cycle stealing for balancing load in NOWs that we consider in this section and in Section 2 belongs to the second type. Cycle stealing for increasing utilization under restricted task as- signment policies in server farms that we consider in Section 4 is not cycle stealing in strict

2Other systems that support process migration include Charlotte [7], Accent [171], Mach [114], Emerald [87], Tue [145], and Locus [154].

21 sense, since it does not allow preemption. Nonpreemptive models are motivated by application to supercomputing centers, where jobs are run to completion. Cycle stealing is also popular in Internet computing, which allows one to steal idle cycles of personal computers at home. Typically, cycle stealing in Internet computing is the third type of cycle stealing as classified above. Examples of Internet computing include SETI @ home [72] for searching extraterrestrial intelligence, Folding @ home [70] for protein folding research, Com- pute Against Cancer [73] for cancer research, Fight AIDS @ home [75] for AIDS research, and GIMPS [76] for finding large prime numbers. Systems that support Internet computing include Javelin [33]. Other concepts related to load balancing or cycle stealing include dynamic resource allocation, which dynamically assign resources among multiple applications at data centers [4, 27, 26], and content distribution networks [97] such as Akamai. Adaptive parallelism is also related to cycle stealing; it dynamically reallocates processors to parallel applications, depending on the availability of the processors. Representative implementations of adaptive parallelism includes Piranha [24] and ATLAS [10]. More generally, the concept of using multiple systems as a single resource is studied in the context of grid computing [9, 32, 52]. System implementations towards facilitating grid computing includes AppLeS [13], Legion [57], and MOL [56].

3.2 Summary of results

Our analysis yields many interesting results concerning cycle stealing. Here, we state some of our findings under our assumptions. While cycle stealing obviously benefits the beneficiaries and hurts the donors, we find that when ρB > 1, cycle stealing is profitable overall even under significant switching costs, as it may ensure stability of the beneficiary queue (see Figure 10). For ρB < 1, we define load-regions under which cycle stealing pays (see [125]). We find that in general the switching cost is only prohibitive when it is large compared with the mean size of the donor job. Under zero switching cost, cycle stealing always pays. A counterintuitive result is that when

ρB < 1, the mean response time of the beneficiaries is surprisingly insensitive to the variability of the donor job size distribution (see [125]). Even when the variability of the donor job sizes is very high, and donor help thus is very bursty, the beneficiaries still enjoy significant benefits.

22 E[K] = 0 E[K] = 1 50 50 CS CS Dedicated Dedicated 40 40

30 30 Performance of beneficiaries 20 20

10 10 Beneficiary response time Beneficiary response time 0 0 0 0.2 0.4 0.6ρ 0.8 1 1.2 0 0.5 ρ 1 1.5 B B

2.2 CS 3.2 CS Dedicated Dedicated 3 2.1 2.8

2.6 2 Performance of donors 2.4

2.2

1.9 Donor response time Donor response time 2

1.8 1.8 0 0.5 ρ 1 1.5 0 0.2 0.4 0.6ρ 0.8 1 1.2 B B

Figure 10: The mean response time for beneficiaries and donors as a function of ρB. Graphs show the case of cycle stealing versus dedicated servers (i.e., no cycle stealing). In all figures, job sizes are exponential with rates µB = 1 and µD = 1. Switching costs, K, are exponential with mean 0 or 1 as labeled. Donor load is fixed at 0.5.

4 Cycle stealing for increasing utilization under restricted task assignment policies (Completed work)

A primary goal of this section is to propose a task assignment policy that improves upon known task assignment policies when job sizes have high variability. Our task assignment policy involves cycle stealing mechanism, in that when a server (host) is idle, jobs that would not be sent to the server if the server was busy are sent to the server, utilizing otherwise idle CPU cycles. This section is organized as follows. We first review prior work on task assignment policies in Section 4.1. Next, we describe our task assignment policies with cycle stealing in Section 4.2 and the results on the effectiveness of these task assignment policies in Section 4.3. Our analysis relies on DR introduced in Section 2.

23 4.1 Prior work

Below, we provide an overview of literatures on task assignment policies, limiting our discussion to non-preemptive systems. The most common task assignment policy used is Round-Robin. The Round-Robin policy is simple, but it neither maximizes utilization of the hosts, nor minimizes mean response time. When the job size has an exponential distribution or increasing failure rate, the M/G/k policy has been proven to minimize mean response time and maximize utiliza- tion [169]. The M/G/k policy holds all jobs at the dispatcher unit in a single FCFS queue, and only when a host is free does it receive the next job. The M/G/k policy is provably identical to the Least-Work-Remaining policy which sends each job to the host with the least total remaining work [58]. A related policy is the Shortest-Queue policy where incoming jobs are immediately dispatched to the host with the fewest number of jobs [44, 168]. While policies like M/G/k and Shortest-Queue perform well under exponential job size distributions, they perform poorly when the job size distribution has higher variability. In such cases, it has been shown analytically and empirically that the Dedicated policy far outperforms these other policies with respect to minimizing mean response time [59, 139]. In the Dedicated policy, some hosts are designated as the “short hosts” and others as the “long hosts.” Short jobs are always sent to the short hosts and long jobs to the long hosts. Even when the job size is not known, a policy very similar to Dedicated, known as the TAGS policy (Task Assignment by Guessing Size) works almost as well when job sizes have high variability [58]. While Dedicated assignment may be preferable to the M/G/k and Shortest-Queue policies for highly variable job sizes, it is clearly not optimal. One problem is that Dedicated leads to situations where the servers are not fully utilized. This observation leads us to propose task assignment policies that have both the variance-reducing benefit of the Dedicated policy and the high-utilization property of M/G/k and Shortest-Queue: namely, task assignment with cycle stealing.

4.2 Task assignment with cycle stealing

We propose two cycle stealing algorithms:

24 Long server Long server When free, Long jobs grab long job. always sent If no long, here. grab short job. L L L

Short server Short jobs first S S S Short server Dispatcher check if long When free, server is idle. grab next If so, go there. Else come here. short job. (a) The CS-DQ algorithm (b) The CS-CQ algorithm

Figure 11: Cycle stealing mechanisms for increasing utilization under restricted task assignment policies in server farms.

Cycle stealing with Distributed Queues (CS-DQ): In this algorithm (shown in Figure 11(a)), all jobs are immediately dispatched to a host upon arrival. There is a designated short job host and a designated long job host. An arriving long job is always dispatched to the long job host. An arriving short job first checks to see if the long job host is idle. If the long job host is idle, the short job is dispatched to the long job host. If the long job host is not idle (either working on a long job or a short job), then the arriving short job is dispatched to the short job host. Jobs at a host are serviced in FCFS order.

Cycle stealing with Central Queue (CS-CQ): In this algorithm (shown in Figure 11(b)), all jobs are held in a central queue. Whenever the short job host becomes idle, it picks the first short job in the queue to run. Whenever the long job host becomes idle, it picks the first long job in the queue. However, if there is no long job, the long job host picks the first short job in the queue.

4.3 Summary of results

Our analysis shows that cycle stealing (both CS-DQ and CS-CQ) algorithms can help short jobs tremendously, while penalizing long jobs only by a small percentage. We also find that CS-CQ is a superior strategy to CS-DQ from the perspective of both the beneficiaries and the donors. (See Figure 12.)

25 30 22 Dedicated Dedicated 25 CS−DQ CS−DQ CS−CQ CS−CQ 20 21

15

10 20 long jobs response time

short jobs response time 5

0 19 0 0.5 1 1.5 0 0.5 ρ 1 1.5 ρ s s (a) Gain of short jobs (b) Pain of long jobs

Figure 12: Results of analysis in the case where shorts (mean size is 1) and longs (mean size is 10) are drawn from exponential distributions. The load of long jobs is fixed at ρL = 0.5.

5 A threshold-based policy for reducing switching costs (Completed work)

In Section 3, we have seen that cycle stealing provides benefit to the beneficiary with some pain to the donor when switching time is required when the donor server switches between the donor jobs and the beneficiary jobs. In this section, we consider a threshold-based policy for reducing switching time (cost). Our threshold-based policy allows the donor server to switch to the beneficiary queue only when the beneficiary queue has sufficient number of jobs, and allows the donor server to switch back to the donor queue only when the donor queue has sufficient number of jobs. Our threshold-based policy thus reduces the time that the donor server spends in switching time. More formally, we consider the same model as the cycle stealing mechanism in Section 3 except th th that now we places two thresholds, NB and ND (see Figure 13). If the donor server is idle and th if the number of jobs at the beneficiary queue is at least NB , the donor transitions into the switching state, for a random amount of time, Ksw. After Ksw time, the donor server is available to work on the beneficiary queue and the beneficiary queue becomes an M/M/2 queue. When the th number of donor jobs in queue reaches ND (either during Ksw, or during the time the donor is helping the beneficiary), the donor transitions into a switching back state, for a random amount of time, Kba. After the completion of the switch back, the donor server resumes working on its own jobs until the donor queue is empty.

26 λB λD

donor comes to help th th donor comes back th : NB ND : th if NB > NB and NB =0 if ND > ND NB ND

Ksw

Kba switching costs beneficiary donor

Figure 13: A threshold-based policy for reducing switching costs.

th th A primary goal of this section is to understand the effect of threshold values, NB and ND , on performance. In particular, we seek to understand the optimal threshold values to minimized the mean response time.

5.1 Summary of results

Figure 14 shows the mean response time for beneficiary jobs (top row) and the mean response th th time for donor jobs (bottom row) as a function of ρB for different threshold values, NB and ND . th th In the left column we study the effect of changing NB from 1 to 10 as we hold ND fixed at 1. th th In the right column we study the effect of changing ND from 1 to 10 as we hold NB fixed at 1.

Throughout, XB and XD are exponential with mean 1 and we fix ρD = 0.5. th As NB is increased from 1 to 10, Figure 14 shows only slightly higher response times for the th beneficiary jobs. Increasing NB does not change the beneficiary stability region, although the th beneficiary queue is helped less frequently. We also see that increasing NB creates less penalty for the donor, as the donor doesn’t have to visit the beneficiary queue as frequently. Observe that the donor mean response time is always bounded above by the mean response time for a th corresponding M/GI/1 queue with setup cost Kba, and this bound is tight for all NB values as th ρB reaches its maximum as the beneficiary queue always exceeds NB in this case. We conclude th th that NB has somewhat small impact; however higher values of NB are more desirable for the system as a whole under higher switching cost.

27 th th Effect of NB Effect of ND 50 50 NBth=1, NDth=1 NBth=1, NDth=1 NBth=10, NDth=1 NBth=1, NDth=10 40 40

30 30 Performance of beneficiaries 20 20

10 10 Beneficiary response time Beneficiary response time

0 0 0 0.5 ρ 1 0 0.5 ρ 1 B B th th 3.5 NBth=1, NDth=1 NB =1, ND =1 th th NBth=10, NDth=1 12 NB =1, ND =10

3 10

8 Performance of donors 2.5 6 Donor response time 2 Donor response time 4

2 0 0.5 ρ 1 0 0.5 ρ 1 B B

Figure 14: The mean response time for beneficiaries and donors as a function of ρB. Graphs show th th th th th th the case of (i) NB = ND = 1, (ii) NB = 10 and ND = 1, and (iii) NB = 1 and ND = 10. In all figures XB, XD, and switching costs are exponential with mean 1. ρD = 0.5.

th th By contrast increasing ND from 1 to 10 has dramatic effects. In general increasing ND th drastically improves beneficiary response time. This result is not obvious, since increasing ND both means that the donor spends more time at the beneficiary queue before leaving, but also means that when the donor leaves the beneficiary queue, the donor will be absent for a longer time th (since more time is needed to empty the donor queue). Another positive effect of increasing ND th is less switching overall. In the end, it is the enlargement of the stability region due to higher ND which substantially improves the beneficiary response time when the switching costs are large and th beneficiary load is high. When switching costs are very low, increasing ND very slightly worsens the mean response time for beneficiary jobs, since beneficiaries experience longer intervals between th help. In all cases evaluated, Increasing ND results in much higher mean response times for donor jobs, since, for N th > 1, the donor job arriving at an empty queue must wait for another N th 1 D D − th jobs to arrive before being served. We conclude that increasing ND has large impact, positive for th th the beneficiaries, but negative for the donors. Thus setting ND is much trickier than NB .

28 6 A threshold-based policy for prioritizing small jobs (Proposed work)

As a motivating example, we once again consider cycle stealing with central queue (CS-CQ) that we introduced in Section 4. It turns out that we can further improve the overall mean response time by allowing the long job host to process short jobs even in the presence of long jobs in queue, giving more priority to short jobs. However, giving too much preference to short jobs may lead to instability of long jobs. So, a recent approach to optimize the overall mean response time is to place a threshold, TS, in the queue of short jobs [12, 63, 147, 164] such that

Short job host processes only short jobs. • Long job host processes short jobs if either •

– the number of short jobs is at least TS, or

– there are no long jobs, and the number of short jobs is at least two (i.e., idle cycle stealing).

Otherwise long job host processes long jobs.

We refer to this threshold-based policy as the T1 policy.

6.1 Proposed work

In this section, I propose to analyze the performance under the T1 policy in the more general context of servers with affinities (heterogeneous distributed computing systems) where (i) the jobs originating at different servers may have different mean size (processing requirement), (ii) the servers may have different speeds, (iii) jobs may have different affinities with different servers, i.e., a job may be processed more efficiently (have shorter duration) if run on one server than when run on another server, and (iv) different jobs may have different importance (weights). We will analyze the weighted average mean response time (weighted response time), i cipiE[Ri], where ci is the weight (importance) of jobs originating at server i, λi is the averagePrate of jobs originating at server i (type i jobs), pi = λi/ j λj is the fraction of type i jobs, and E[Ri] is P

29 λ1 λ2

ρ =λ /(µ +µ (1−ρ )) ρ =λ /µ 1 1 1 12 2 2 2 2 µ µ µ 1 12 2

Figure 15: A two server model. the mean response time of type i jobs. Our analysis will make use of DR that we introduce in Section 2. Figure 15 shows the model that we will consider for the case of two servers. Jobs arrive at queue 1 and queue 2 with average rates λ1 and λ2, respectively. Server 1 processes jobs in queue

1 with average rate µ1 (jobs/sec), while server 2 can process jobs in queue 1 with average rate µ12

(jobs/sec) and can process jobs in queue 2 with average rate µ2 (jobs/sec). We define ρ1 = λ1/µ1, ρ = λ /µ , and ρˆ = λ /(µ + µ (1 ρ )). Here, ρˆ is the load of type 1 jobs assuming server 2 2 2 1 1 1 12 − 2 1 2 helps server 1 as much as possible, while processing all the type 2 jobs. Note that ρ2 < 1 and

ρˆ1 < 1 are necessary for the queues to be stable under any allocation policy.

6.2 Prior work

Even in the simple model of just two servers shown in Figure 15, the optimal allocation policy is not known in general, despite the fact that this problem has been investigated in numerous papers [12, 63, 64, 164, 147, 143]. Below, we review prior work on the T1 policy and other allocation policies for servers with affinities. One common allocation policy is based on the cµ rule [37], where a server processes jobs from the nonempty queue with the highest cµ value, biasing in favor of jobs with high c (high importance) and high µ (small expected size3). Under the cµ rule, server 2 in Figure 15 serves jobs from queue 1 (rather than queue 2) if c1µ12 > c2µ2, or queue 2 is empty. The cµ rule is

3Note that the average size of a job is 1/µ, where µ is the average service rate.

30 provably optimal in the fluid limit [148]. However Squillante et. al. [147] as well as Harrison [63] have shown that cµ rule may lead to instability even if ρˆ1 < 1 and ρ2 < 1. For example, the cµ rule may force server 2 to process jobs from queue 1 even when many jobs are built up at queue 2, leading to instability in queue 2 and under-utilization of server 1. More recently, the generalized cµ rule, which is based on greedily minimizing the delay functions at any moment, has been applied to related models [112, 110]. However, in our model, the generalized cµ rule reduces to the cµ rule and hence has the same stability issues. Squillante et. al. [147] and Williams [164] propose the T1 policy that, under the right choice of threshold value, improves upon the cµ rule and guarantees stability whenever ρˆ1 < 1 and

ρ2 < 1. The motivation behind placing the threshold on queue 1 is that it ‘reserves” a certain amount of work for server 1, preventing server 1 from being under-utilized and server 2 from being overloaded. Bell and Williams prove the optimality of the T1 policy for a model closely related to ours in the heavy traffic limit, where ρˆ1 and ρ2 are close to 1 from below [12]. Williams conjectures that the T1 policy is optimal for more general models in the heavy traffic limit [164]. In this section, we will provide the first accurate and efficient analysis of the weighted mean response time under the T1 policy. Squillante et. al. evaluate the weighted mean response time under the T1 policy by simulation and a coarse approximation [147]. Williams [164] and Bell and Williams [12] study the T1 policy only in the heavy traffic limit.

7 Impact of variable arrival processes under single server systems (Proposed work)

In this section, I plan to study the impact of variability and correlation of interarrival times on performance under single server systems (G/G/1 queues). Here, we consider the correlation as well as the variability of interarrival times. This is because correlation is prevalent in arrival processes at computer systems such as web servers [79, 78] and supercomputing centers [149], and correlation in the arrival process can have a big impact on the performance of computer systems [149]. We consider both FCFS and other priority scheduling policies. In Sections 7.1-7.2, we study the impact of interarrival time variability and correlation under FCFS. In Section 7.3, we study the impact of interarrival time variability under priority scheduling policies. Here, we investigate

31 the effect of scheduling in reducing the impact of interarrival time variability. We focus only on the variability of interarrival times, and the effect of scheduling in reducing the impact of the correlation of interarrival times is deferred as future work. In Section 7.1, we review prior literatures on the impact of interarrival time variability under FCFS scheduling. As a motivating example, consider a GI/GI/1/FCFS queue with a batch Poisson process, where the batch size has a geometric distribution. The mean delay, E[W ], in such a queue is expressed as follows:

ρ C2 + 1 1 C2 1 E[W ] = S E[S] + A − E[S], (1) 1 ρ 2 1 ρ 2 − − where CA and CS are the coefficient of variability in interarrival time and job size distributions, respectively, ρ is system load, and E[S] is the mean job size. The formula suggests the impact of 2 the interarrival time variability; that is, the mean delay is a linear function of CA. Sections 7.2-7.3 constitute the proposed work. In Section 7.2, we will study the impact of interarrival time correlation under FCFS scheduling. In Section 7.3, we will study the quantitative effectiveness of scheduling in reducing the impact of interarrival time variability.

7.1 Impact of interarrival time variability under FCFS

A common characteristic of long-range dependence, self-similarity, time of day effect, and flash crowd in arrival processes is the variability in the interarrival times. Below, we elaborate on the impact of interarrival time variability on system performance under FCFS and existence of interarrival time variability, reviewing the literature.

7.1.1 Impact of interarrival time variability on system performance

Equation (1) shows the impact of interarrival time variability in a particular arrival process, but approximations and bounds for the mean delay in a general GI/GI/1/FCFS queue also suggest the impact of interarrival time variability. Table 7.1.1 shows the known approximations and bounds for the mean delay in a GI/GI/1/FCFS queue based on the first two moments of the job size and interarrival time distributions. In particular, the approximation formula, which is exact at heavy 2 traffic limit, suggests that the mean delay linearly increases with CA. Observe also that all the

32 Lower bound Condition Citation ρ2C2 +ρ(ρ 2) S − 2λ(1 ρ) [95] p. 43 ρ2C2 +ρ−(ρ 1) S − 2λ(1 ρ) [117] p. 226 2 2 2− CA+ρ CS 1+ρ 2λ(1 ρ) 2λ 1/λ-MRLA/G/1 [95] p. 42 2 2− 2 − 2 CA+ρ CS CA+ρ 2λ(1 ρ) 2λ IFR/G/1 [95] p. 42, [117] p. 226 − − Approximation Exact when Citation 2 2 2 CA+ρ CS 2λ(1 ρ) heavy-traffic, M/G/1 [95] p. 29, [169] p. 511 − Upper bound Condition Citation 2 2 2 CA+ρ CS 2λ(1 ρ) Kingman’s [169] p. 476 (1 (1 ρ)2)−C2 +ρ2C2 − − A S 2λ(1 ρ) Daley’s [169] p. 477 − Table 1: Known bounds and approximations for the mean delay in GI/GI/1/FCFS queues based on the first two moments of the service time distribution and the interarrival time distribution. Here, γ-MRLA is defined to be a set of distributions with mean residual life bounded above by γ. Note that γ-MR-LB includes distributions with increasing failure rate.

approximations and bounds except the unconditional lower bounds (first two rows)4 have a linear 2 term in CA. Finally, Iyengar et. al. study the impact of interarrival time variability (without correlation) on the mean delay in web servers by simulation and show that it can have a significant impact on the mean delay [77].

7.1.2 Existence of arrival variability

In this section, we review prior work on measurement of arrival processes at computer systems. In particular, we review job request patterns at supercomputers, traffic patterns at the Internet,

4In fact, it is impossible to get a lower bound of the mean delay for a GI/GI/1/FCFS queue so that it depends on the second moment of the interarrival time [161]. Consider a sequence of random variables, Ab:

2 C (b 1)2 A − 1 b 1 λ with probability C2 +(b 1)2 − − A − C2    A bλ with probability C2 +(b 1)2  A −

These random variables have mean1/λ and coefficient of variability CA. It is known that these random variables give a (tight) lower bound on the mean delay for a GI/M/1 queue where the interarrival time has mean 1/λ and coefficient of variability CA and its support is [0, bE[T ]] [161]. As b , Ab converges to 1/λ with probability 1, → ∞ which gives a (tight) lower bound on the mean response time for a GI/M/1 queue where the interarrival time has mean 1/λ and coefficient of variability CA and its support is [0, ). ∞

33 and HTTP request patterns at web servers. We see that the inter-request time at supercomputers 2 has high variability (typically, 2 < CA < 10). We also see that there are papers that suggest high variability in the interarrival time of packets/sessions in the Internet and HTTP requests at the web servers. The characteristics of the workload at supercomputing centers have been extensively studied in the last ten years. Most of the workload traces studied are available at Parallel workload archive [74]5. We analyze the interarrival time variability in these traces and summarize it in Table 7.1.2. Parts of these traces and related traces are also studied by others, and we briefly summarize these studies below. Feitelson et. al. study a subset of the workload trace at NASA Ames iPSC/860 and 2 report that CA is 3.56 during days, 2.11 during nights, and 2.83 during weekends [48]. Windisch 2 et. al. study a subset of the workload trace at SDSC Paragon and report that CA is 3.00 during days, 5.83 during nights, 3.13 during weekends, and 5.33 overall [167]. Hotovy et. al. study a 2 different workload trace at CTC SP2 and report that CA is 1.76 during days, 2.49 during nights, and 2.55 during weekends [69]. Hotovy also observes that the batch size at CTC SP2 is well represented by a Zipf distribution with coefficient 1.85 2.1 depending on the definition of − ∼ − the batch [68]. Feitelson models arrival process as a batch Poisson arrival process based on the study of six workload traces, where the batch size is represented by a generalized Zipf distribution with the harmonic order 2.2 [47]. The work by Hotovy an Feitelson suggest that users tend to ∼ submit sequences of jobs. A majority of the Internet traffic is carried by the hyper text transfer protocol (HTTP) for the world wide web (WWW), and characteristics of WWW, such as embedded objects and the flash crowd effect, make the Internet traffic bursty. Before the era of WWW, the arrival process of packets or sessions (TCP connections) in the Internet are well approximated by the Poisson process, and hence the interarrival time distribution is well approximated by the exponential distribution (see for example [142]). Paxson and Floyd, however, report that the WWW traffic

5The workload log from the NASA Ames iPSC/860 was provided by Bill Nitzberg [48]. The workload log from the SDSC Paragon was provided by Reagan Moore and Allen Downey [167]. The workload log from the CTC SP2 was provided by the Cornell Theory Center, a high-performance computing center at Cornell University. The workload log from the KTH SP2 was provided by Lars Malinowsky. The workload log from the LANL CM-5 was graciously provided by Curt Canada. The workload log from the LLNL Cray T3D was provided by Moe Jette. The workload log from the SDSC SP2 was provided by Victor Hazelwood. The workload log from the LANL Nirvana cluster was provided by Fabrizio Petrini. The workload log from the SDSC Blue Horizon was provided by Travis Earheart and Nancy Wilkins-Diehr.

34 2 year system mean (sec) CA 1993 NASA Ames iPSC/860 3728.32 2.40 1995 SDSC Paragon 1206.55 5.41 1996 SDSC Paragon 1257.55 4.40 1996-97 CTC SP2 357.35 9.12 1996-97 KTH SP2 772.65 6.93 1994-96 LANL CM-5 487.99 5.70 1996 LLNL T3D 431.72 9.73 1998-00 SDSC SP2 923.82 7.08 1999-00 LANL Origin 2000 93.31 8.20 2000-02 SDSC Blue Horizon 467.17 6.80

Table 2: Interarrival times measured at various parallel systems. Interarrival times greater than six hours are ignored, since these probably correspond to system down times. Negative interarrival times (due to measurement error) are also ignored.

has a very different nature; in particular they observe that the WWW traffic is bursty and has self-similarity [128]. Since then, a number of researchers have measured the Internet traffic, in particular the process of the number of packets per a unit interval at different time scales, and observed burstiness and self-similarity (see for example [38, 102, 165]). Some papers also report the marginal distribution and/or variance of the interarrival times. In particular, Feldmann observed 2 that the squared coefficient of variability of the interconnection time is 1 < CA < 4.5 [50]. Paxson and Floyd report that the interarrival time distribution has a long tail and well approximated by a Pareto distribution, but Downey reexamines and reports that the interarrival time distribution has a lighter tail and well approximated by a lognormal distribution or a Weibull distribution [43]. Teams at the Bell laboratories also observe that the interarrival time distribution is well approximated by a Weibull distribution [23, 34]. Deng also reports that the interarrival time distribution deviate from the exponential distribution [41]. The Internet Traffic Archive [71]. has a collection of traces on LAN, WAN, and web clients, and the arrival process can be further investigated. The request process, that is the number of request per a unit time interval, at web servers has also be shown to have burstiness and self-similarity (see for example [79]), but the marginal distribution and/or variance of the interarrival times are not usually reported. This is partly due to the fact that web servers such as Apache leave access logs measured in a course granularity

35 of seconds. The only exception is that Mogul measures the access pattern at the DEC 1994 2 California election server in mili-second resolutions; unfortunately, CA is not clear from this 2 2 paper [116]. Teams at IBM report that 0 < CA < 1 [78, 79] or that 1 < CA < 16 [77]. There is also an interesting observation that the distribution of the inter-reference times for each file is well approximated by the exponential distribution [6]. Again, the Internet Traffic Archive [71] has a collection of traces on web servers, including 1998 world cup web site, and the arrival process can be further investigated (but only in a granularity of seconds). An interesting open problem is to measure the access pattern at (busy) web servers in a granularity of mili-seconds or finer and to examine whether self-similarity property holds in the mili-seconds to seconds scales. This problem is motivated by the work by Xia et. al. [170], where they analyze the performance of web servers under the assumption of self-similarity in finer scale than seconds. (This problem will not be addressed in the thesis.)

7.2 The impact of interarrival time correlation under FCFS (Proposed work)

In this section, I propose to study the impact of interarrival time correlation under FCFS schedul- ing. Here we will derive simple (closed form) lower and upper bounds on the mean response time in a G/G/1 queue when the arrival process is a particular arrival process (such as Markov modulated Poisson arrival process, MMPP) that has correlation in interarrival times. We will investigate how the correlation affects the mean response time through the study of the closed form. We will also compare the bounds with exact analysis via the matrix analytic method.

7.3 The impact of interarrival time variability under priority scheduling poli- cies (Proposed work)

In this section, I propose to study the quantitative effectiveness of scheduling in reducing the impact of interarrival time variability. Recall equation (1), the mean delay in a GI/GI/1/FCFS queue with a batch Poisson arrival process. When the service time, S, is deterministic, FCFS is equivalent to SRPT, which is optimal with respect to minimizing the mean delay, and the mean

36 delay is given by

ρ E[S] 1 C2 1 E[W ] = + A − E[S]. 1 ρ 2 1 ρ 2 − − This suggests that any scheduling policy cannot remove the effect of interarrival time variability when the job size is deterministic. It is, however, not clear how scheduling can help in reducing the impact of interarrival time variability when the job size has variability. An approach that we will take is to approximate the arrival process by a simpler, more tractable process such as a batch Poisson process. We will first derive a simple closed form solution for the mean response time in GI/GI/1 queue under two class priority scheduling. (In the case of batch Poisson arrival processes, closed form solutions are derived in [152].) We will then derive the (upper and lower bounds of) mean response time under other scheduling policies such as SRPT. From the closed form solution (approximation) and bounds, we will quantify how much scheduling can reduce the impact of the interarrival time variability. Other possible approaches include studying heavy traffic approximations [162].

8 Impact of variable arrival processes under multiserver systems (Proposed work)

In this section, I plan to study the impact of interarrival time variability on performance under multiserver systems (G/G/c queues). We will limit our study on FCFS. The impact of interarrival time variability in multiserver systems under other priority scheduling policies is left as future work. We study the impact of interarrival time variability in Section 8.1, and will study the impact of correlation in Section 8.2. In Section 8.1, we review prior literatures on the impact of interarrival time variability in multiserver systems. In Section 8.2, I propose to study the impact of interarrival time correlation in multiserver systems.

37 Lower bound Condition Citation (c 1)E[X](1 C2 ) E[D(1)] − − S E[D(1)]: mean delay in GI/GI/1 [169] p. 497 − 2c Approximation Exact when Citation 2 2 2 2 CA+ρ CS /c 2λ(1 ρ) heavy-traffic [95] p. 47 − Upper bound Condition Citation 2 2 2 2 CA+ρ CS /c 2λ(1 ρ) ρ 1/c [169] p. 498 2 2− 2 ≥ CA+ρ CS /c 2λ(1 ρ) [169] p. 502 (1 (1 ρ)2)C−2 +ρ2C2 /c − − A S 2λ(1 ρ) [39] p. 52; [169] p. 502 ρ(ρ2C2 /c+2(1− ρ)2C2 ) S − A (1 2ρ)λ c: even, ρ < 1/2 (for odd c, may use c 1) [137] − − Table 3: Known bounds and approximations for the mean delay in GI/GI/c/FCFS queues based on the first two moments of the service time distribution and the interarrival time distributions.

8.1 The impact of interarrival time variability in multiserver systems

Table 8.1 shows known approximations and bounds for the mean delay in a GI/GI/c/FCFS queue based on the first two moments of the job size and interarrival time distributions. In particular, approximation formula, which is exact at heavy traffic limit, suggests that the impact of job size 2 6 variability, CS, on the mean delay is reduced by having multiple servers , while the impact of 2 interarrival time variability, CA is not. Observe also that all the approximations and bounds are 2 deceasing functions of the number of servers c, but the terms including CA stays constant. They suggest that having many servers reduces the impact of job size variability but may not reduce the impact of interarrival time variability.

8.2 The impact of interarrival time correlation in multiserver systems (Proposed work)

In this section, we will study the impact of interarrival time correlation in multiserver systems. Here we will analyze the mean response time in a G/G/c queue when the arrival process is a particular arrival process (such as Markov modulated Poisson arrival process, MMPP) that has correlation in interarrival times. Our analysis is numerical but exact, and it makes use of the

6In [127], we show that having multiple servers can reduce the impact of job size variability on delay. This result is not limited to heavy traffic limit.

38 100 Ts=2 Ts=20 80

60

40

mean response time 20

0 0.6 0.7ρ 0.8 L

Figure 16: Sensitivity of the overall mean response time under the T1 policy to changes in load. matrix analytic method.

9 A threshold-based policy for adapting to fluctuating load (Proposed work)

In this section, I propose to investigate threshold-based policies that are robust against fluctuations in load caused by the variability and the correlation in arrival processes. An advantage of the T1 policy that we will investigate in Section 6 is that it provides a good performance, as is suggested by the heavy traffic limit. A disadvantage of the T1 policy is that the optimal threshold value depends on loads, ρS and ρL; hence the threshold value that minimizes the overall mean response time at a certain load can provide the overall mean response time far from the optimal at a slightly different load. Figure 16 shows the overall mean response time under the T1 policy with two different thresh- old values. Lower TS provides good performance at a lower load (solid line), while higher TS provides good performance at a higher load (dashed line). This suggests that when the load fluctuates over time, the T1 policy with a fixed threshold value provides good performance only when the load is at a certain estimated value; the T1 policy performs poorly all the other time. In this section, we will evaluate and compare various allocation policies, including the T1 policy and a new allocation policy that we propose. The objective of the new allocation policy is to provide low mean response time for the model in Figure 15 under fluctuating loads. We plan to evaluate various allocation policies via DR that we introduce in Section 2.

39 A possible approach to designing a new allocation policy is the use of control theory. Control theory has been applied to web servers [109, 106, 141, 159, 157] and TCP congestion control [65, 108, 107] to deal with fluctuating loads. A natural allocation policy that control theory suggests to be robust against fluctuating load operates as follows:

Server 1 processes only type 1 jobs. • Server 2 processes type 1 jobs if either •

– c1µ12N1 > c2µ2N2, or

– there are no type 2 jobs, and the number of type 1 jobs is at least two (i.e., idle cycle stealing).

Otherwise server 2 processes type 2 jobs.

Here, Ni is the number of type i jobs in system for i = 1, 2. We refer to this allocation policy as the robust policy. The robust policy has a characteristic of minimizing a quadratic objective function: 2 cipiE[Ri] , Xi where ci is the weight (importance) of jobs originating at server i, λi is the average rate of jobs originating at server i (type i jobs), pi = λi/ j λj is the fraction of type i jobs, and E[Ri] is the mean response time of type i jobs. The generP alized cµ rule suggests that the robust policy minimizes the quadratic objective function in related models [112, 110]. However, it is not well understood how the robust policy performs with respect to minimizing the weighted mean response time, i cipiE[Ri]. AnotherP possible approach to designing a new allocation policy is the use of multiple threshold values so that an appropriate threshold value is chosen depending on the system load. We refer to this policy as the adaptive dual-threshold (ADT) policy. If the ADT policy is not robust, we will try looking for other policies based on what the problems are. Control theory allows us to design a “robust” allocation policy without a priori knowledge on the system load, but such a policy does not necessarily provide the optimal performance. This is in contrast to threshold-based policies, which provide good performance at an estimated load. We will make use of a priori knowledge

40 on the characteristics of loads, such as mean load, and use ideas from control theory to provide robustness in our policy.

10 Extensions to more servers and more classes (Partially completed and proposed work)

Analysis via the dimensionality reduction (DR) in Sections 2-5 is limited to two servers (hosts) and two classes of jobs. In this section, we propose the recursive dimensionality reduction (RDR) that reduces an nD-infinite Markov chain to a 1D-finite Markov chain, allowing as to analyze systems with more than two servers and more than two classes jobs. Here, we will summarize possible extensions to the analysis in previous sections. We illustrate RDR by applying it to analyze the mean response time under the CS-DQ, which we considered in Section 4, for the case of three hosts (see Figure 17). There are a designated short job host, designated medium job host, and a designated long job host. An arriving long job is always dispatched to the long job host. An arriving medium job first checks to see if the long job host is idle. If so, the medium job is dispatched to the long job host. If the long job host is not idle, then the arriving medium job is dispatched to the medium job host. An arriving short job first checks to see if the long job host is idle. If so, the short job is dispatched to the long job host. If the long job host is not idle, then the arriving short job checks to see if the medium job host is idle. If so, the short job is dispatched to the medium job host. If the medium job host is not idle, then the arriving short job is dispatched to the short job host. Jobs at a host are serviced in FCFS order. Observe that the assignment of short jobs depends on whether the long job host is busy or idle and whether the medium job host is busy or idle. Hence, the Markov chain that exactly captures the behavior of short jobs needs to have state space (number of short jobs, number of medium jobs, number of long jobs) that grows infinitely in three dimensions. However, the only information that we need about the long (medium, respectively) job hosts is whether the long (medium, respectively) job host is busy or idle. Thus, RDR first derives the busy period of the long job host. The busy period of the long job host is then used in the analysis of medium jobs. At the same time, the busy period of the medium job host is derived. Finally, the busy periods

41 Long server

Long jobs always sent here.

Medium server Medium jobs first check if long server is idle. If so, go there. Else come here. Dispatcher Short server Short jobs first check if long server is idle. If so, go there. Else check if medium server is idle. If so, go there. Else come here.

Figure 17: Cycle stealing for increasing utilization under restricted task assignment policies in server farms with distributed queues (CS-DQ).

λ λ λ long server 0,0S,0M µL 1,0S,0M µL 2,0S,0M L L L µL processes longs µ µ µ λS S S S λ λ λL long server 0,1S,0M L 1,1S,0M L 2,1S,0M processes shorts µ µ µ λM M M M λ λ λ 0,0S,1M L 1,0S,1M L 2,0S,1M L long server processes mediums number of longs

Figure 18: Analysis of CS-ID for three hosts via RDR: analysis of longs. of the long job host and the medium job host are used in the analysis of short jobs. Figure 18 shows a Markov chain that tracks exactly the number of long jobs, medium jobs, and short jobs in the long job queue and long job host. In the top row, there are no medium jobs and no short jobs. Here, the long job host processes long jobs if there are any long jobs. In the middle row, there is one short job and no medium jobs. Here, the long job host processes the short job. In the bottom row, there is one medium job and no short jobs. Here, the long job host processes the medium job. Observe that the busy period of the long job host starts when the Markov chain leaves state (0,0S,0M) and ends when the Markov chain comes back to state (0,0S,0M). The length of a busy period is easily analyzed by conditioning on which job (long, medium, or short) starts the busy period. Also, the mean response time of long jobs is easily analyzed via a known formula for an M/M/1 queue with setup time [152].

42 The mean response time of medium jobs can be analyzed via DR. Figure 19(a) shows a 1D- infinite Markov chain that we obtain via DR. The 1D-infinite Markov chain tracks exactly the number of medium jobs and the number of short jobs in the medium job queue and medium job host, but the only information that the chain tracks about the long jobs is whether there are at least one long jobs or not, i.e. whether the long job host is busy or idle. In the top two rows, there are no short jobs. Here, the medium job host processes medium jobs if there are any medium jobs. In the bottom two rows, there is one short jobs. Here, the long job host processes the short job. In the first and third rows, the long job host is busy, and there are transitions shown in thick arrows from these rows to the second and fourth rows, where the long job host is idle. The transitions shown in thick arrows corresponds to the busy period of the long job host. In practice, the thick arrows should be replaced by PH distributions to better approximate the distribution of the length of the busy period (see Section 2). For the easy of explanation, however, below we assume that the busy period, BL, is approximated by an exponential distribution, i.e., the thick arrow is a single transition. Since the Markov chain tracks exactly the number of medium jobs, the mean response time of medium jobs can be analyzed via the matrix analytic method. The length of a busy period of the medium job host needs to be analyzed for analyzing the mean response time of short jobs. Observe that the busy period of the long job host starts when the Markov chain leaves state (0,0S,Lbusy) and ends when the Markov chain comes back to either state (0,0S,Lbusy) or state (0,0S,Lidle) (see Figure 19(b)). Thus, there are two different types of 7 busy periods, BM1 and BM2, depending on the states where the busy period ends. The length of the two types of busy periods are analyzed via Neuts’ algorithm [120]. The mean response time of short jobs can now be analyzed using the busy period of the long job host and the two types of busy periods of the medium job host. Figure 20 shows a 1D-infinite Markov chain that we use to analyze the mean response time of short jobs. The 1D-infinite Markov chain tracks exactly the number of short jobs, but the only information that the chain tracks about the medium and long jobs is whether the long job host is busy or idle and whether the medium job host is busy or idle. In the first row, the long job host and the medium job host are both busy. In the second row, the long job host is idle and the medium job host is busy. In the third row, the long job host is busy and the medium job host is idle. In the fourth row, the

7 When BL is represented by n-phase PH distribution, the number of different busy periods becomes n(n + 1).

43 λ λ λM M M 0,0S,Lbusy 1,0S,Lbusy µ 2,0S,Lbusy µ µM M M λ λ +λ λ λ λ λ λ λ S+ M L S+ M+ L µ S+ M+ L µ BL µS BL S BL S 0,0S,Lidle 1,0S,Lidle 2,0S,Lidle µ µ µ λS M M M

λ λM λM 0,1S,Lbusy µM 1,1S,Lbusy µ 2,1S,Lbusy M M µM

λ λ +λ λ λ λ λ λ λ B S+ M L µ S+ M+ L µ B S+ M+ L L S BL S L µS 0,1S,Lidle 1,1S,Lidle µ 2,1S,Lidle µ µM M M

number of mediums (a) 1D Markov chain for the medium job host.

0,0S,Lbusy 1,0S,Lbusy 2,0S,Lbusy BM1

0,0S,Lidle 1,0S,Lidle 2,0S,Lidle

0,1S,Lbusy 1,1S,Lbusy 2,1S,Lbusy

BM2

0,1S,Lidle 1,1S,Lidle 2,1S,Lidle

number of mediums (b) Four busy periods of the medium job host.

Figure 19: Analysis of CS-ID for three hosts via RDR: analysis of mediums.

44 λ S λS 0,Mbusy,Lbusy 0,Mbusy,Lbusy µ µS S λ λ +λ +λ λ λ +λ +λ M B BL S M L M BL S M L M2 BM2 0,Mbusy,Lidle 0,Mbusy,Lidle µS µS BM1 BM1

0,Midle,Lbusy 0,Midle,Lbusy µ µS S BL B λ λ λ L λ λ λ S+ M+ L S+ M+ L

0,Midle,Lidle 0,Midle,Lidle µS µS

number of shorts

Figure 20: Analysis of CS-ID for three hosts via RDR: analysis of shorts. long serve and the medium job host are both idle. Transitions shown in thick arrows from the first and third rows to the second and fourth rows correspond to the busy period of the long job host. Transitions shown in thick arrows from the top row to the bottom two rows corresponds to the busy periods of the medium job host. Observe that depending on where the busy period ends, the length of the busy period differs and is labeled, BM1 or BM2. Since the Markov chain tracks exactly the number of short jobs, the mean response time of short jobs can be analyzed via the matrix analytic method.

11 Characterizing the phase type distribution (Partially completed and proposed work)

In Section 12, we propose a moment matching algorithm that finds a PH distribution, P , that well-represents a given distribution, G, in the sense that the first three moments of P and G agree. We use the approximate PH distribution, P , to represent busy period durations and job sizes in DR and RDR. For computational efficiency, it is desirable that P has as few phases as possible. In this section, we characterize the necessary number of phases in P to well-represent a given distribution, G. The results in this section are used to prove the optimality, with respect to

45 the number of phases required, of the moment matching algorithm that we propose in Section 12. More formally, we characterize the set of distributions which are well-represented by an n- phase PH distribution, for each n = 1, 2, 3, . . ..

Definition 1 Let (n) denote the set of distributions that are well-represented by an n-phase PH S distribution for positive integer n.

Our characterization of (n), n 1 will allow one to determine, for any distribution G, the {S ≥ } minimal number of phases that are needed to well-represent G by a PH distribution. In finding simple characterizations of (n), it will be helpful to start by defining an alternative S to the standard moments, which we refer to as normalized moments.

F Definition 2 Let µk be the k-th moment of a distribution F for k = 1, 2, 3. The normalized F k-th moment mk of F for k = 2, 3 is defined to be

µF µF mF = 2 and mF = 3 . 2 F 2 3 F F (µ1 ) µ1 µ2

F 2 Notice the correspondence to the coefficient of variability CF and skewness γF of F : m2 = CF +1 F F F µ3 and m3 = νF m2 , where νF = F 3/2 . (Notice the correspondence between νF and the skewness (µ2 ) F q µ¯3 F of F , γF , where γF = F 3/2 and µ¯k is the centralized k-th moment of F for k = 2, 3.) (µ¯2 )

11.1 Prior work

All prior work on characterizing (n) has focused on characterizing (2)∗ , where (2)∗ is the set of S S S distributions which are well-represented by a 2-phase PH distribution with Coxian representation

(see Figure 8). Observe (2)∗ (2). Altiok [3] showed a sufficient condition for a distribution G S ⊂ S to be in (2)∗ . More recently, Telek and Heindl [153] expanded Altiok’s condition and proved the S necessary and sufficient condition for a distribution G to be in (2)∗ . While neither Altiok nor S Telek and Heindl expressed these conditions in terms of normalized moments, the results can be expressed more simply with our normalized moments. In this section, we will characterize (2), S as well as characterizing (n), for all integers n 2. S ≥

46 (2) Sv

S(2) (3) S(3) Sv (4) (4) S Sv

Figure 21: A simple characterization of (n) by (n). Solid lines delineate (n) (which is irregular) S SV S and dashed lines delineate (n) (which is regular – has a simple specification). Observe the nested SV structure of (n) and (n). (n) is close to (n) in size and is contained in (n). (n) is almost S SV SV S S S contained in (n+1). SV

11.2 Summary of Results

While the goal of the section is to characterize the set (n), this characterization turns out to be S ugly. One of the key ideas is that there is a set (n) (n) which is very close to (n) in size, SV ⊂ S S such that (n) has a very simple specification via normalized moments. SV Definition 3 For integers n 2, let (n) denote the set of distributions, F , with the following ≥ SV property on their normalized moments:

n n + 2 mF > and mF mF . (2) 2 n 1 3 ≥ n + 1 2 − We derive a nested relationship between (n) and (n) for all n 2. This relationship is SV S ≥ illustrated in Figure 21. There are three points to observe: (i) (n) is a proper subset of (n+1) S S for all integers n 2, and likewise (n) is a proper subset of (n+1); (ii) (n) is contained ≥ SV SV SV in (n) and close to (n) in size; providing a simple characterization for (n); (iii) (n) is almost S S S S contained in (n+1) for all integers n 2. SV ≥ More formally, we prove the following theorem:

Theorem 1 (n) (n) (n+1) (n), where (n) is the set of distributions that are well- SV ⊂ S ⊂ SV ∪ E E

47 m3

3 (2) Sv

(3) E2 Sv 2 (4) E3 Sv

3/2 (32) Sv E31 1 1 4/3 3/2 2 m2

Figure 22: Depiction of (n) sets for n = 2, 3, 4, 32 as a function of the normalized moments. SV Observe that all possible nonnegative distributions lie within the region delineated by the two (n) dotted lines: m2 1 and m3 m2 [92]. for n = 2, 3, 4, 32 are delineated by solid lines, ≥ ≥ SV which includes the border, and dashed lines, which does not include the border. represented by an -n distribution for integers n 2. ≥ An Erlang-n distribution refers to the distribution of a random variable, which is equal to the sum of n i.i.d. exponential random variables. Notice that the normalized moments of distributions in

(n) (n) (n), m and m , satisfy the following conditions: E 2E E3

(n) n + 1 (n) n + 2 mE = and mE = . (3) 2 n 3 n

Theorem 1 tells us that (n) is “sandwiched between” (n) and (n+1). From Figure 22, we S SV SV see that (n) and (n+1) are quite close for high n. Thus we have a very accurate representation SV SV of (n). S

11.3 Proposed work

In [124], we prove Theorem 1 where (n) is replaced by the set of n phase PH distributions S with Coxian representation (see Figure 8). We conjecture that the theorem holds for general PH distributions, and I propose to work on proving the theorem as a part of my thesis work.

48 12 Approximating general distributions by phase type distribu- tions (Completed work)

In applying the RDR, it is crucial to approximate general distributions by phase type (PH) distributions, which is thus the problem that we investigate in this section. Approximating general distributions by PH distributions has very broad applicability in the analysis of computer system performance, since the Markovian property of PH distributions often allows analytical tractability. A popular approach to approximate a general distribution, G, by a PH distribution, P , is to match the moments of P and G. Moment-matching algorithms are evaluated along four different measures: The number of moments matched – In general matching more moments is more desirable. The computational efficiency of the algorithm – It is desirable that the algorithm have short running time. Ideally, one would like a closed-form solution for the parameters of the matching PH distribution. The generality of the solution – Ideally the algorithm should work for as broad a class of distributions as possible. The minimality of the number of phases – It is desirable that the matching PH distribu- tion, P , have very few phases. Recall that the goal is to find P which can replace the input distribution G in some queueing model, allowing a Markov chain representation of the problem. Since it is desirable that the state space of this resulting Markov chain be kept small, we want to keep the number of phases in P low. A primary goal of this section is to propose a moment-matching algorithm which performs very well along all four of these measures. Our solution matches three moments, provides a closed form representation of the parameters of the matching PH distribution, applies to almost all nonnegative distributions, and is nearly minimal in the number of phases required. We choose to limit our discussion to three-moment matching, because matching the first three moments of an input distribution has been shown to be effective in predicting mean performance for variety of many computer system models [53, 61, 62, 125, 127, 146, 163, 172].

49 12.1 Related work

Prior work has contributed a very large number of moment matching algorithms. While all of these algorithms excel with respect to some of the four measures mentioned earlier, they all are deficient in at least one of these measures as explained below. In cases where matching only two moments suffices, it is possible to achieve solutions which perform very well along all the other three measures [111, 136]. If one is willing to match only a subset of distributions, then again it is possible to achieve solutions which perform very well along the remaining three measures. Whitt [160] and Altiok [3] focus on the set of distributions with C 2 > 1 and sufficiently high third moment. Telek and Heindl [153] focus on the set of distributions with C 2 1 and various ≥ 2 constraints on the third moment. Johnson and Taaffe [83, 84] come closest to achieving all four measures. They provide a closed-form solution for matching the first three moments of almost all nonnegative distributions, G. Unfortunately, their solution requires twice as many phases as the necessary number of phases. In complementary work, Johnson and Taaffe [86, 85] again look at the problem of matching the first three moments of almost all nonnegative distributions, using nearly minimal number of phases. Unfortunately, their algorithm requires solving a nonlinear programing problem and hence is very computationally inefficient. Above we have described the prior work focusing on moment-matching algorithms (three moments). There is also a large body of work focusing on fitting the shape of an input distribution using a PH distribution. Of particular recent interest has been work on fitting heavy-tailed distributions to PH distributions [51, 66, 67, 93, 130, 151]. There is also work which combines the goals of moment matching with the goal of fitting the shape of the distribution [82, 138]. The work above is clearly broader in its goals than simply matching three moments. Unfortunately there’s a tradeoff: obtaining a more precise fit requires more phases. Additionally it can sometimes be computationally inefficient [82, 138].

12.2 Summary of results

The key idea in our solution is to match a general input distribution to a distribution in a subset of the PH distributions. We carefully design this subset, which we call the set of EC distributions:

Definition 4 An n-phase EC (Erlang-Coxian) distribution is a particular PH distribution whose

50 EN COX 2 p Exp Exp Exp Exp pX Exp λY λY λY λX1 λX2

1−p 1−pX

Figure 23: The Markov chain underlying an EC distribution, where the first box above depicts the underlying continuous time Markov chain in an N-phase Erlang distribution, where N = n 2, − and the second box depicts the underlying continuous time Markov chain in a two-phase PH distribution with Coxian representation (see Figure 8). Notice that the rates in the first box are the same for all states. underlying Markov chain is of the form in Figure 23.

We now provide some intuition behind the creation of the EC distribution. A PH distribution with Coxian representation (see Figure 8) is very good for approximating any distribution with high variability, but it requires many more phases for approximating distributions with lower second and third moments. The large number of phases needed implies that many free parameters must be determined, which implies that any algorithm that tries to well-represent an arbitrary distribution using a minimal number of phases is likely to suffer from computational inefficiency. By contrast, an n-phase Erlang distribution has only two free parameters and is also known to have the least normalized second moment (see Section 11) among all the n-phase PH distributions [2]. However the Erlang distribution is obviously limited in the set of distributions which it can well-represent. Our approach is therefore to combine the Erlang distribution with the two-phase PH distribution with Coxian representation, allowing us to represent distributions with all ranges of variability, while using only a small number of phases. Furthermore the fact that the EC distribution has very few free parameters allows us to obtain closed-from expressions for the parameters (n, p, λY , λX1, λX2, pX ) of the EC distribution. An interesting open problem is to derive a closed form solution for the parameters of a PH distribution, P , without a mass probability at zero, so that the first three moments of P match those of a given distribution. (This problem will not be addressed in the thesis.) Another interest- ing open problem is to extend the moment matching algorithm to k > 3 moments. (This problem will not be addressed in the thesis.)

51 A Schedule

Below, we summarize the problems which I propose to work on as a part of my thesis for one year. The thesis will be defended in May 2005.

Design and analysis of resource allocation policies under fluctuating load

This research includes the analysis of the T1 policy (Section 6) and other resource allocation policies such as the ADT policy (Section 9). Here, we will propose new resource allocation policies. We will submit papers to appropriate conferences such as the ACM PODC and the ACM SIGMETRICS.

Analysis of the impact of irregular arrival processes

This research includes the analysis of various scheduling policies such as FCFS and priority scheduling under arrival processes with variable and correlated interarrival times (Sections 7-8). We will submit papers to appropriate conferences such as the ACM SIGMETRICS.

Extensions to more servers and more classes

In Sections 2-4 and 9, we have assumed rather simple multiserver architectures with two servers and two classes of jobs. We will work on extending our results to more servers and more classes of jobs. We will also look for approximations for these more complex systems. The results will be summarized in Section 10.

Proving the optimality of the moment matching algorithm

We will prove Theorem 1 (Section 11). We will submit a paper to a journal (either Stochastic Models, QUESTA, or Performance Evaluation).

52 References [1] T. F. Abdelzaher and N. Bhatti. Web content adaptation to improve server overload behavior. Computer Networks, 31:1563–1577, 1999. [2] D. Aldous and L. Shepp. The least variable phase type distribution is Erlang. Communications in Statistics - Stotchastic Models, 3:467–473, 1987. [3] T. Altiok. On the phase-type approximations of general distributions. IIE Transactions, 17:110–116, 1985. [4] K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, and S. Krishnakumar. Oceano - SLA based management of a computing utility. In Proceedings of the IFIP/IEEE Symposium on Integrated Network Management, pages 855–868, May 2001. [5] M. Arlitt and T. Jin. A workload characterization study of the 1998 world cup web site. IEEE Network, May/June:30–37, 2000. [6] M. F. Arlitt and C. L. Williamson. Internet web servers: Workload characterization and performance implications. IEEE/ACM Transactions on Networking, 5:631–645, 1997. [7] Y. Artsy and R. Finkel. Designing a process migration facility: The Charlotte experience. IEEE Computer, 22(9):47–56, 1989. [8] B. Awerbuch, Y. Azar, A. Fiat, and T. Leighton. Making commitments in the face of uncertainty: How to pick a winner almost every time. In Proceedings of the ACM STOC, pages 519–530, May 1996. [9] M. Baker, R. Buyya, and D. Laforenza. Grids and grid technologies for wide-area distributed com- puting. Journal of Software: Practice and Experience, 32(15):1437–1466, 2002. [10] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. ATLAS: An infrastructure for global com- puting. In Proceedings of the ACM SIGOPS European workshop: Systems support for worldwide applications, pages 165–172, September 1996. [11] A. Barak, S. Guday, and R. Wheeler. The MOSIX Distributed Operating System, Load Balancing for UNIX. Lecture Notes in Computer Science, Vol. 672. Springer-Verlag, 1993. [12] S. Bell and R. Williams. Dynamic scheduling of a system with two parallel servers in heavy traffic with complete resource pooling: Asymptotic optimality of a continuous review threshold policy. Annals of Probability, 11:608–649, 2001. [13] F. Berman and R. Wolski. The AppLeS project: A status report. In Proceedings of the NEC Research Sympoium, May 1997. [14] S. N. Bhatt, F. R. Chung, F. T. Leighton, and A. L. Rosenberg. On optimal strategies for cycle- stealing in networks of workstations. IEEE Transactions on Computers, 46(5):545–557, 1997. [15] F. Bonomi and A. Kumar. Adaptive optimal load balancing in a nonhomogeneous multiserver system with a central job scheduler. IEEE Transactions on Computers, 39(10):1232–1250, October 1990. [16] S. Borst, O. Boxma, and P. Jelenkovic. Reduced-load equivalence and induced burstiness in GPS queues with long-tailed traffic flows. Queueing Systems, 43:274–285, 2003. [17] S. Borst, O. Boxma, and M. van Uitert. The asymptotic workload behavior of two coupled queues. Queueing Systems, 43:81–102, 2003. [18] L. Breslau, S. Jamin, and S. Shenker. Comments on the performance of measurement-based admission control algorithms. In Proceedings of the IEEE INFOCOM, pages 1233–1242, March 2000. [19] L. Breslau, E. W. Knightly, S. Shenker, I. Stoica, and H. Zhang. Endpoint admission control: Architectural issues and performance. In Proceedings of the ACM SIGCOMM, pages 57–69, October 2000. [20] J. Buzen and A. Bondi. The response times of priority classes under preemptive resume in M/M/m queues. Operations Research, 31:456–465, 1983. [21] L. Cabrera. The incluence of workload on load balancing strategies. In Proceedings of the USENIX Summer Conference, pages 446–458, June 1986. [22] M. Calzarossa and G. Serazzi. A characterization of the variation in time of workload arrival patterns. IEEE Transactions on Computers, c-34(2):156–162, 1985. [23] J. Cao, W. S. Cleveland, D. Lin, and D. X. Sun. On the nonstationarity of internet traffic. In Proceedings of the ACM SIGMETRICS, pages 102–112, June 2001. [24] N. Carriero, E. Freeman, D. Gelernter, and D. Kaminsky. Adaptive parallelism and Piranha. IEEE Computer, 28(1):40–49, 1995.

53 [25] S. Chandra, C. S. Ellis, and A. Vahdat. Application-level differentiated multimedia web services using quality aware transcoding. IEEE Journal on Selected Areas in Communications, 18(12):2544–2565, 2000. [26] J. Chase, L. Grit, D. Irwin, J. Moore, and S. Sprenkle. Dynamic virtual clusters in a grid site manager. In Proceedings of the International Symposium on High Performance Distributed Computing, pages 90–103, June 2003. [27] J. S. Chase, D. C. Anderson, P. N. Thankar, and A. M. Vahdat. Managing energy and server resources in hosting centers. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 103–116, October 2001. [28] H. Chen and P. Mohapatra. Session-based overload control in QoS-aware web servers. In Proceedings of the IEEE INFOCOM, pages 516–524, June 2002. [29] X. Chen, P. Mohapatra, and H. Chen. An admission control scheme for predictable server response time for web accesses. In Proceedings of the World Wide Web, pages 545–554, May 2001. [30] L. Cherkasova and P. Phaal. Predictive admission control strategy for overloaded commercial web server. In Proceedings of the IEEE MASCOTS, pages 500–507, August 2000. [31] L. Cherkasova and P. Phaal. Session-based admission control: A mechanism for peak load manage- ment of commercial web sites. IEEE Transactions on Computers, 51:669–685, 2002. [32] M. Chetty and R. Buyya. Weaving computational grids: How analogous are they with electrical grids? IEEE Computing in Science and Engineering, 4(4):61–71, 2002. [33] B. Christiansen, P. Cappello, M. Ionescu, M. Neary, K. Schauser, and D. Wu. Javelin: Internet-based parallel computing using Java. Concurrency: Practice and Experience, 9(11):1139–1160, 1997. [34] W. S. Cleveland, D. Lin, and D. X. Sun. IP packet generation: Statistical models for TCP start times based on connection-rate superposition. In Proceedings of the ACM SIGMETRICS, pages 166–177, June 2000. [35] A. Cobham. Priority assignment in waiting line problems. Operations Research, 2:70–76, 1954. [36] J. Cohen and O. Boxma. Boundary Value Problems in Queueing System Analysis. North-Holland Publ. Cy., 1983. [37] D. Cox and W. Smith. Queues. Kluwer Academic Publishers, 1971. [38] M. E. Crovella and A. Bestavros. Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6):835–846, December 1997. [39] D. J. Daley. Some results for the mean waiting-time and workload in GI/GI/k queues. In Frontiers in Queueing: Models and Applications in Science and Engineering. CRC Press, 1997. [40] R. Davis. Waiting-time distribution of a multi-server, priority queueing system. Operations Research, 14:133–136, 1966. [41] S. Deng. Empirical model of WWW document arrivals at access link. In Proceedings of the IEEE International Conference on Communication, pages 1797–1802, June 1996. [42] F. Douglis and J. Ousterhout. Transparent process migration: Design alternatives and the Sprite implementation. Software: Practice and Experience, 21(8):757–785, 1991. [43] A. B. Downey. Evidence for long-tailed distributions in the internet. In Proceedings of the ACM SIGCOMM Internet Measurement Workshop, pages 229–241, November 2001. [44] A. Ephremides, P. Varaiya, and J. Walrand. A simple dynamic routing problem. IEEE Transactions on Automatic Control, AC-25(4):690–693, 1980. [45] G. Fayolle and R. Iasnogorodski. Two coupled processors: The reduction to a Riemann-Hilbert problem. Zeitschrift fur Wahrscheinlichkeitstheorie und vervandte Gebiete, 47:325–351, 1979. [46] G. Fayolle, P. King, and I. Mitrani. The solution of certain two-dimensional markov models. Advances in Applied Probability, 14:295–308, 1982. [47] D. G. Feitelson. Packing schemes for gang scheduling. Lecture Notes in Computer Science, 1162:89– 110, 1996. [48] D. G. Feitelson and B. Nitzberg. Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. In Proceedings of IPPS ’95 Workshop on Job Scheduling Strategies for Parallel Processing, pages 215–227, April 1995. [49] R. Felderman, E. Schooler, and L. Kleinrock. The Benevolent Bandit laboratory: A testbed for distributed algorithms. IEEE Journal on Selected Areas in Communications, 7(2):303–311, 1989. [50] A. Feldmann. Characteristics of TCP connection arrivals. In K. Park and W. Willinger, editors, Self-Similar Network Traffic and Performance Evaluation. Wiley-Interscience, January 2000.

54 [51] A. Feldmann and W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 32:245–279, 1998. [52] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organi- zations. International Journal of Supercomputer Applications, 15(3), 2001. [53] H. Franke, J. Jann, J. Moreira, P. Pattnaik, and M. Jette. An evaluation of parallel job scheduling for ASCI blue-pacific. In Proceedings of the Supercomputing, pages 679–691, November 1999. [54] H. Gail, S. Hantler, and B. Taylor. Analysis of a non-preemptive priority multiserver queue. Advances in Applied Probability, 20:852–879, 1988. [55] H. Gail, S. Hantler, and B. Taylor. On a preemptive Markovian queues with multiple servers and two priority classes. Mathematics of Operations Research, 17:365–391, 1992. [56] J. Gehring and A. Streit. Robust resource management for metacomputers. In Proceedings of the International Symposium on High-Performance Distributed Computing, pages 105–111, August 2000. [57] A. S. Grimshaw and W. A. Wulf. The legion vision of a worldwide virtual computer. Communications of the ACM, 40(1):39–45, 1997. [58] M. Harchol-Balter. Task assignment with unknown duration. Journal of the ACM, 49(2):260–288, 2002. [59] M. Harchol-Balter, M. Crovella, and C. Murta. On choosing a task assignment policy for a distributed server system. IEEE Journal of Parallel and Distributed Computing, 59:204–228, 1999. [60] M. Harchol-Balter and A. Downey. Exploiting process lifetime distributions for dynamic load bal- ancing. ACM Transactions on Computer Systems, 15(3):253–285, 1997. [61] M. Harchol-Balter, C. Li, T. Osogami, A. Scheller-Wolf, and M. Squillante. Task assignment with cycle stealing under central queue. In Proceedings of The 23rd International Conference on Distributed Computing Systems (ICDCS 2003), pages 628–637, May 2003. [62] M. Harchol-Balter, C. Li, T. Osogami, A. Scheller-Wolf, and M. Squillante. Task assignment with cycle stealing under immediate dispatch. In Proceedings of The Fifteenth ACM Symposium on Par- allelism in Algorithms and Architectures (SPAA 2003), pages 274–285, June 2003. [63] J. Harrison. Heavy traffic analysis of a system with parallel servers: Asymptotic optimality of discrete review policies. Annals of Applied Probability, 8(3):822–848, 1998. [64] J. Harrison and M. Lopez. Heavy traffic resource pooling in parallel server systems. Queueing Systems, 33(4):339–368, 1999. [65] C. Hollot, V. Misra, D. Towsley, and W. Gang. A control theoretic analysis of RED. In Proceedings of the IEEE INFOCOM, pages 1510–1519, April 2001. [66] A. Horv´ath and M. Telek. Approximating heavy tailed behavior with phase type distributions. In Advances in Matrix-Analytic Methods for Stochastic Models, pages 191–214. Notable Publications, July 2000. [67] A. Horv´ath and M. Telek. PhFit: A general phase-type fitting tool. In Proceedings of Performance TOOLS 2002, pages 82–91, April 2002. [68] S. Hotovy. Workload evolution on the Cornell Theory Center IBM SP2. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 27–40. Springer-Verlag, 1996. [69] S. Hotovy, D. Schneider, and T. J. O’Donnell. Analysis of the early workload on the Cornell Theory Center IBM SP2. In Proceedings of the ACM SIGMETRICS, pages 272–273, 1996. [70] http://folding.stanford.edu/. [71] http://ita.ee.lbl.gov/. [72] http://setiathome.ssl.berkeley.edu/. [73] http://www.computeagainstcancer.org/. [74] http://www.cs.huji.ac.il/labs/parallel/workload/. [75] http://www.fightaidsathome.org/. [76] http://www.mersenne.org/prime.htm. [77] A. Iyengar, E. MacNair, and T. Nguyen. An analysis of web server performance. In Proceedings of the IEEE GLOBECOM, pages 1943–1947, November 1997. [78] A. Iyengar, E. MacNair, M. Squillante, and L. Zhang. A general methodology for characterizing access patterns and analyzing web server performance. In Proceedings of the IEEE MASCOTS, pages 167–174, July 1998.

55 [79] A. Iyengar, M. S. Squillante, and L. Zhang. Analysis and characterization of large-scale web server access patterns and performance. World Wide Web, 2(1-2):85–100, 1999. [80] V. S. Iyengar, L. H. Trevillyan, and P. Bose. Representative traces for processor models with infinite cache. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 62–72, February 1996. [81] S. Jamin, P. B. Danzig, S. Shenker, and L. Zhang. A measurement-based admission control algorithm for integrated services packet networks. In Proceedings of the ACM SIGCOMM, pages 2–13, August 1995. [82] M. A. Johnson. Selecting parameters of phase distributions: Combining nonlinear programming, heuristics, and Erlang distributions. ORSA Journal on Computing, 5:69–83, 1993. [83] M. A. Johnson and M. F. Taaffe. An investigation of phase-distribution moment-matching algorithms for use in queueing models. Queueing Systems, 8:129–147, 1991. [84] M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Mixtures of Erlang distributions of common order. Communications in Statistics — Stochastic Models, 5:711–743, 1989. [85] M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Density function shapes. Communications in Statistics — Stochastic Models, 6:283–306, 1990. [86] M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Nonlinear programming approaches. Communications in Statistics — Stochastic Models, 6:259–281, 1990. [87] E. Jul, H. Levy, N. Hutchinson, and A. Black. Fine-grained mobility in the emerald system. ACM Transactions on Computer Systems, 6(1):109–133, 1988. [88] E. Kao and K. Narayanan. Computing steady-state probabilities of a nonpreemptive priority multi- server queue. Journal on Computing, 2:211–218, 1990. [89] E. Kao and K. Narayanan. Modeling a multiprocessor system with preemptive priorities. Management Science, 2:185–97, 1991. [90] E. Kao and S. Wilson. Analysis of nonpreemptive priority queues with multiple servers and two priority classes. European Journal of Operational Research, 118:181–193, 1999. [91] A. Kapadia, M. Kazumi, and A. Mitchell. Analysis of a finite capacity nonpreemptive priority queue. Computers and Operations Research, 11:337–343, 1984. [92] S. Karlin and W. Studden. Tchebycheff Systems: With Applications in Analysis and Statistics. John Wiley and Sons, 1966. [93] R. E. A. Khayari, R. Sadre, and B. Haverkort. Fitting world-wide web request traces with the EM-algorithm. Performance Evalutation, 52:175–191, 2003. [94] L. Kleinrock. Queueing Systems, Volume I: Theory. A Wiley-Interscience Publication, 1975. [95] L. Kleinrock. Queueing Systems, Volume II: Computer Applications. A Wiley-Interscience Publica- tion, 1976. [96] A. Konheim, I. Meilijson, and A. Melkman. Processor-sharing of two parallel lines. Journal of Applied Probability, 18:952–956, 1981. [97] B. Krishnamurthy, C. Wills, and Y. Zhang. On the use and performance of content distribution network. In Proceedings of the ACM SIGCOMM, pages 169–182, November 2001. [98] P. Krueger and R. Chawla. The stealth distributed scheduler. In Proceedings of the IEEE ICDCS, pages 336–343, May 1991. [99] M. Krunz and S. Tripathi. On the characteristics of VBR MPEG steams. In Proceedings of the ACM SIGMETRICS, pages 192–202, June 1997. [100] G. Latouche and V. Ramaswami. Introduction to Matrix Analytic Methods in Stochastic Modeling. ASA-SIAM, Philadelphia, 1999. [101] H. Leemans. The Two-Class Two-Server Queue with Nonpreemptive Heterogeneous Priority Struc- tures. PhD thesis, K.U.Leuven, 1998. [102] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of Ethernet traffic (extended version). IEEE/ACM Transactions on Networking, 2:1–15, 1994. [103] K. Li and S. Jamin. A measurement-based admission-controlled web server. In Proceedings of the IEEE INFOCOM, pages 651–659, 2000. [104] M. Litzkow, M. Livny, and M. Mutka. Condor — A hunter of idel workstations. In Proceedings of the International Conference of Distributed Computing Systems (ICDCS), pages 104–111, June 1988. [105] M. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the unix kernel. In Proceedings of the USENIX Winter Conference, pages 283–290, January 1992.

56 [106] X. Liu, L. Sha, Y. Diao, S. Froehlich, J. Hellerstein, and S. Parekh. Online response time optimization of Apache web server. In Proceedings of the International Workshop on Quality of Service (IWQoS), pages 461–478, June 2003. [107] S. Low, L. Peterson, and L. Wang. Understanding TCP Vegas: A duality model. In Proceedings of the ACM SIGMETRICS, pages 226–235, June 2001. [108] S. H. Low and D. E. Lapsley. Optimization flow control — I: Basic algorithm and convergence. IEEE/ACM Transactions on Networking, 7(6):861–874, 1999. [109] Y. Lu, T. Abdelzaher, and C. Lu. Feedback control with queueing-theoretic prediction for relative delay guarantees in web servers. In Proceedings of the IEEE Real-Time Embedded Technology and Applications Symposium, pages 208–218, May 2003. [110] A. Mandelbaum and A. Stolyar. Scheduling flexible servers with convex delay costs: Heavy traffic optimality of the generalized cµ-rule. Operations Research, to appear. [111] R. Marie. Calculating equilibrium probabilities for λ(n)/ck/1/n queues. In Proceedings of the Per- formance, pages 117–125, 1980. [112] J. V. Miegham. Dynamic scheduling with convex delay costs: The generalized cµ rule. Annals of Applied Probability, 5(3):809–833, 1995. [113] D. Miller. Steady-state algorithmic analysis of M/M/c two-priority queues with heterogeneous servers. In R. L. Disney and T. J. Ott, editors, Applied probability - Computer science, The In- terface, volume II, pages 207–222. Birkhauser, 1992. [114] D. Milojicic, W. Zint, A. Dangel, and P. Giese. Task migration on the top of the Mach migration. In Proceedings of the USENIX Mach Symposium, pages 273–290, April 1992. [115] I. Mitrani and P. King. Multiprocessor systems with preemptive priorities. Performance Evaluation, 1:118–125, 1981. [116] J. Mogul. Network behavior of a busy web server and its clients. Technical Report 95/5, DEC Western Research Laboratory, October 1995. [117] A. Muller and D. Stoyan. Comparison Methods for Stochastic Models and Risks. John Wiley & Sons, 2002. [118] M. W. Mutka and M. Livny. The available capacity of a privately owned workstation environment. Performance Evaluation, 12:269–284, 1991. [119] P. Nain. On a generalization of the preemptive resume priority queue. Advances in Applied Probability, 18:255–273, 1986. [120] M. F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. The Johns Hopkins University Press, 1981. [121] B. Ngo and H. Lee. Analysis of a pre-emptive priority M/M/c model with two types of customers and restriction. Electronics Letters, 26:1190–1192, 1990. [122] D. Nichols. Using idle workstations in a shared computing environment. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 5–12, November 1987. [123] T. Osogami and M. Harchol-Balter. A closed-form solution for mapping general distributions to minimal PH distributions. In Proceedings of the Performance TOOLS, pages 200–217, September 2003. [124] T. Osogami and M. Harchol-Balter. Necessary and sufficient conditions for representing general distributions by Coxians. In Proceedings of the Performance TOOLS, pages 182–199, September 2003. [125] T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf. Analysis of cycle stealing with switching cost. In Proceedings of the ACM SIGMETRICS, pages 184–195, June 2003. [126] T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf. Analysis of cycle stealing with switching costs and thresholds. 2003 (submitted for publication). [127] T. Osogami, A. Wierman, M. Harchol-Balter, and A. Scheller-Wolf. How many servers are best in a dual-priority FCFS system? Technical Report CMU-CS-03-213, School of Computer Science, Carnegie Mellon University, 2004. [128] V. Paxson and S. Floyd. Wide-are traffic: The failure of Poisson modeling. IEEE/ACM Transactions on Networking, pages 226–244, June 1995. [129] B. Rao and M. Posner. Parallel exponential queues with dependent service rates. Computers and Operations Research, 13(6):681–692, 1986.

57 [130] A. Riska, V. Diev, and E. Smirni. Efficient fitting of long-tailed data sets into PH distributions. Performance Evaluation, 2003 (to appear). [131] A. L. Rosenberg. Guidelines for data-prallel cycle-stealing in network of workstations. Journal of Parallel and Distributed Computing, 59:31–53, 1999. [132] A. L. Rosenberg. Guidelines for data-prallel cycle-stealing in network of workstations, II: On max- imizing guranteed output. International Journal of Foundations of Computer Science, 11:183–204, 2000. [133] A. L. Rosenberg. Optimal schedules for cycle-stealing in a network of workstations with a bag-of-tasks workload. IEEE Transactions on Parallel and Distributed Systems, 13(2):179–191, 2002. [134] K. W. Ross and D. D. Yao. Optimal load balancing and scheduling in a distributed computer system. Journal of the ACM, 38(3):676–690, July 1991. [135] K. D. Ryu, J. K. Hollingsworth, and P. J. Keleher. Mechanisms and policies for supporting fine- grained cycle stealing. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages CD–ROM, May 1999. [136] C. Sauer and K. Chandy. Approximate analysis of central server models. IBM Journal of Research and Development, 19:301–313, 1975. [137] A. Scheller-Wolf, , and K. Sigman. New bounds for expected delay in FIFO GI/GI/c queues. Queueing Systems: Theory and Applications, 26(1):169–186, 1997. [138] L. Schmickler. MEDA: Mixed Erlang distributions as phase-type representations of empirical distri- bution functions. Communications in Statistics — Stochastic Models, 8:131–156, 1992. [139] B. Schroeder and M. Harchol-Balter. Evaluation of task assignment policies for supercomputing servers: The case for load unbalancing and fairness. In Proceedings of HPDC 2000, pages 211–219, 2000. [140] M. Segal. A multiserver system with preemptive priorities. Operations Research, 18:316–323, 1970. [141] L. Sha and X. Liu. Queueing model based network server performance control. In Proceedings of the IEEE Real-Time Systems Symposium, pages 81–90, December 2002. [142] J. Shoch and J. Hupp. Measured performance of an Ethernet local network. Communications of the ACM, 23:711–721, 1980. [143] R. Shumsky. Approximation and analysis of a call center with specialized and flexible servers. working paper, 2004. [144] A. Sleptchenko, A. van Harten, and M. van der Heijden. An exact solution for the state probabilities of the multi-class, multi-server queue with preemptive priorities, 2003 – Manuscript. [145] P. Smith and N. Hutchinson. Heterogeneous process migration: The Tui system. Software: Practice and Experience, 28(6):611–638, 1998. [146] M. Squillante. Matrix-analytic methods in stochastic parallel-server scheduling models. In Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications, July 1998. [147] M. Squillante, C. Xia, D. Yao, and L. Zhang. Threshold-based priority policies for parallel-server systems with affinity scheduling. In Proceedings of the IEEE American Control Conference, pages 2992–2999, June 2001. [148] M. S. Squillante, C. H. Xia, and L. Zhang. Optimal scheduling in queuing network models of high- volume commercial web sites. Performance Evaluation, 47(4):223–242, 2002. [149] M. S. Squillante, D. D. Yao, and L. Zhang. Analysis of job arrival patterns and parallel scheduling performance. Performance Evaluation, 36-37:137–163, 1999. [150] M. S. Squillante, D. D. Yao, and L. Zhang. Web traffic modeling and web server performance analysis. In Proceedings of the IEEE Conference on Decision and Control, December 1999. [151] D. Starobinski and M. Sidi. Modeling and analysis of power-tail distributions via classical teletraffic methods. Queueing Systems, 36:243–267, 2000. [152] H. Takagi. Queueing Analysis: Vol. 1, Vacation and Priority Systems. North-Holland, 1991. [153] M. Telek and A. Heindl. Matching moments for acyclic discrete and continuous phase-type distribu- tions of second order. International Journal of Simulation, 3:47–57, 2003. [154] G. Thiel. Locus operating systems, a transparent system. Computer Communications, 16(6):336–346, 1991. [155] T. Voigt, R. Tewari, D. Freimuth, and A. Mehra. Kernel mechanisms for service differentiation in overloaded web servers. In Proceedings of the USENIX Annual Technical Conference, pages 189–202, June 2001.

58 [156] L. Wald and S. Schwarz. The 1999 southern california seismic network bulletin. Seismological Research Letters, 71(4), 2000. [157] M. Welsh. An Architecture for Highly Concurrent, Well-Conditioned Internet Services. PhD thesis, University of California, Berkeley, 2002. [158] M. Welsh and D. Culler. Adaptive overload control for busy internet servers. In Proceedings of the USENIX Conference on Internet Technologies and Systems, March 2003. [159] M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 203–343, October 2001. [160] W. Whitt. Approximating a point process by a renewal process: Two basic methods. Operations Research, 30:125–147, 1982. [161] W. Whitt. On approximations for queues, I: Extremal distributions. AT&T Bell Laboratories Tech- nical Journal, 63:115–138, 1983. [162] W. Whitt. Stochastic-Process Limits. Springer-Verlag, 2002. [163] A. Wierman, T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf. Analyzing the effect of prioritized background tasks in multiserver systems. Technical Report CMU-CS-03-213, School of Computer Science, Carnegie Mellon University, 2004. [164] R. Williams. On dynamic scheduling of a parallel server system with complete resource pooling. In D. McDonald and S. Turner, editors, Analysis of Communication Networks: Call Centers, Traffic and Performance. American Mathematical Society, 2000. [165] W. Willinger, M. Taqqu, W. Leland, and D. Wilson. Self-similarity in high-speed packet traffic: Analysis and modeling of Ethernet traffic measurements. Statistical Science, 10:67–85, 1995. [166] W. Willinger, M. Taqqu, R. Sherman, and D. Wilson. Self-similarity through high-variability: Statis- tical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking, 5(1):71–86, 1997. [167] K. Windisch, V. Lo, B. Nitzberg, D. Feitelson, and R. Moore. A comparison of workload traces from two production parallel machines. In Proceedings of the Symposium on the Frontiers of Massively Parallel Computation, pages 319–326, May 1996. [168] W. Winston. Optimality of the shortest line discipline. Journal of Applied Probability, 14:181–189, 1977. [169] R. W. Wolff. Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989. [170] C. Xia, Z. Liu, M. Squillante, L. Zhang, and N. Malouch. Analysis of performance impact of drill-down techniques for web traffic models. In In Proceedings of the 18th International Teletraffic Congress, pages 409–418, September 2003. [171] E. Zayas. Attacking the process migration bottleneck. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 13–24, November 1987. [172] Y. Zhang, H. Franke, J. Moreira, and A. Sivasubramaniam. An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration. IEEE Transactions on Parallel and Distributed Systems, 14:236–247, 2003. [173] S. Zhou, X. Zheng, J. Wang, and P. Delisle. Utopia: A load sharing facility for large, heterogeneous distributed computer systems. Software - Practice and Experience, 23(12):1305–1336, 1993.

59