Striping Do esnt Scale How to Achieve for Continuous



Media Servers with

ChengFu Chou

Department of Computer Science

University of Maryland at College Park

Leana Golub chik

University of Maryland Institute for Advanced Computer Studies

and Department of Computer Science

University of Maryland at College Park

John CS Lui

Department of Computer Science and Engineering

The Chinese University of Hong Kong

Abstract

Multimedia applications place high demands for QoS p erformance and reliability on storage

servers and communication networks These often stringent requirements make design of cost

eective and scalable continuous media CM servers dicult In particular the choice of data

placement techniques can have a signicant eect on the scalability of the CM server and its

ability to utilize resources eciently In the recent past a great deal of work has fo cused on

wide data striping as a technique which implicitly solves load balancing problems although

it do es suer from multiple shortcomings Another approach to dealing with load imbalance

problems is replication The main fo cus of this pap er is a study of scalability characteristics

of CM servers as a function of tradeos b etween striping and replication More sp ecically

striping is a go o d approach to load balancing while replication is a go o d approach to isolating

no des from b eing dep endent on other system resources The appropriate compromise b etween

the degree of striping and the degree of replication is key to the design of a scalable CM server

This is the topic of our work

Keywords continuous media servers distributed systems p erformance evaluation scalability

endtoend design

Technical areas Multimedia and Digital Libraries Distributed Systems Architecture



The work of ChengFu Chou and Leana Golub chik and was supp orted in part by the NSF CAREER grant

CCR

Intro duction

With the rapid growth of multimedia applications there is a growing need for largescale continuous

media CM servers that can meet the user demand Multimedia applications such as video stream

delivery digital libraries distance learning systems and so on place high demands for qualityof

service QoS p erformance and reliability on storage servers and communication networks These

often stringent requirements make endtoend design of costeective and scalable continuous media

servers dicult The scalability of a CM servers architecture dep ends on its ability to

expand as user demand and data sizes grow

maintain p erformance characteristics under growth or reconguration

maintain p erformance characteristics under degradation of system resources which can b e

caused by losses in network and storage capacities

1

In particular in a continuous media server the choice of data placement techniques can have

a signicant eect on the scalability of the system and its ability to utilize resources eciently

Existing data placement techniques in conjunction with scheduling algorithms address two ma jor

ineciencies in such systems various overheads in reading data from storage devices eg due

to disk arm movement and load imbalance eg due to skews in data access patterns In this

work we fo cus on the latter issue and sp ecically on its b earing on the scalability characteristics

of a distributed CM server

Due to the enormous storage and IO bandwidth requirements of multimedia data a CM

server is exp ected to have a very large disk farm Thus we must necessarily consider designs which

contain multiple disk clusters and pro cessing no des ie we must consider distributed designs An

imp ortant consideration then is the placement of ob jects on the no des of the CM server As in

traditional systems data placement on a distributed storage subsystem directly aects

the load balancing characteristics of that system

In the recent past a great deal of research work eg as in has fo cused on wide

data striping as a data placement technique for designing continuous media servers By wide data

1

Here by data placement we mean decision of which ob ject or fraction of an ob ject to place on which disk or disk

cluster ie this do es not refer to data placement issues within a single disk

striping we mean that each ob ject is strip ed across all the disks of the system Recall that the

p otential load imbalance is largely due to the skews in data access patterns which without data

striping could result in high loads on some disks containing the more p opular ob jects while the

disks containing less p opular ob jects may b e idling Moreover the problem is exacerbated by the

fact that access patterns change over time ie the p opularity of a particular ob ject is a function

of time

Thus an advantage of wide data striping is that it implicitly achieves load balance by decou

pling an ob jects storage from its bandwidth requirements However wide data striping also suers

from several shortcomings

It is not practical to assume that a system can b e constructed from homogeneous disks ie

as the system grows or exp eriences faults and thus disk replacement we are b e forced to use

disks with dierent transfer and storage capacity characteristics having to strip e ob jects

across heterogeneous disks would lead to further complications

An appropriate choice of a striping unit the ob ject size and the communications network

infrastructure dictate an upp er b ound on the numb er of disks over which that ob ject can b e

strip ed b eyond which replication of ob jects is needed to increase the numb er of simultaneous

users as describ ed in eg to the b est of our knowledge in implementations describ ed in

2

striping is p erformed over at most a few tens of homogeneous disks only Note that

delivery of relatively short continuous media ob jects is of use to many imp ortant applications

including digital libraries newsondemand systems and so on

Due to the continuity constraints some form of synchronization in delivery of a single ob ject

from multiple no des must b e considered The need for some form of synchronization arises

from the fact that dierent fractions of an ob ject are b eing delivered from dierent no des

at dierent times during the ob jects display and hence some form of co ordination b etween

these no des and p erhaps the client is required in order to present a coherent display of

the ob ject

As the user demand and data sizes grow and hence the system requires more storage and disk

bandwidth capacities the resulting expansion of the disk subsystem results in restriping of

al l the ob jects

2

In fact we are not aware of in the current literature any largescale implementation that utilizes heterogeneous

disks

Due to the need for communication of data b etween the no des over which an ob ject is strip ed

the capacity of the communication network limits the p erformance of the distributed CM

server This limitation directly aects the scalability of the CM server and is one of the main

issues we investigate in this work

Another approach to dealing with the load imbalance problem arising from skews in data access

patterns is replication ie creating a sucient numb er of copies of a p opular ob ject so as to

meet the demand for that ob ject Sp ecically we consider a hybrid approach where instead of

striping each ob jects across all the no des of the system we constrain the striping to a single no de

and replicate p opular ob jects on several no des in order to provide sucient bandwidth capacity to

service the demand for these ob jects

Of course a disadvantage of this approach is a need for additional storage space Furthermore

techniques are needed for adjusting the numb er of replicas as the access patterns change Some of

these issues are addressed in in the context of workloads with relatively infrequent changes in

ob ject access patterns as well as in our previous work where we prop ose dynamic replication

techniques in the context of more frequent changes in data access patterns In this pap er we

improve on our previous work on dynamic replication techniques as describ ed in Section

However the main fo cus of this pap er is on the tradeos b etween striping and replication

which are as follows In a smallscale CM server where all disks are assumed to b e connected to a

single no de data striping can provide b etter p erformance characteristics than replication b ecause

of its ability to deal with load imbalance problems without the need for additional storage space

and without signicant networking constraints However in a largescale CM server data striping

results in a need for signicant communication network capacities which can lead to p o or scalabil

ity characteristics and high costs Essentially striping is a go o d approach to load balancing while

replication is a go o d approach to isolating no des from b eing dep endent on other nonlo cal

system resources That is the wider we strip e in a distributed CM system the more we are dep en

dent on the availability of network capacity Furthermore replication has the b enet of increased

reliability in terms of a longer mean time to loss of data from the disk subsystem see Section

and b dealing with lack of network resources see Section including network partitioning The

downside of replication is that it increases storage space requirements and hence cost However

as storage costs decrease fairly rapidly and the need for scalability grows replication b ecomes a

more attractive technique

In summary the appropriate compromise b etween the degree of striping and the degree of

replication is key to the design of a scalable distributed CM server This is the topic of our pap er

Related Work

Recently much research has b een done on design of continuous media and sp ecically videoon

demand storage servers eg as in to a name a few Much of this work falls

3

into several broad categories which include smallscale servers where in most cases all disks

are connected to a single no de mediumscale LANbased servers and mediumscale either

distributed or not servers which employ high sp eed interconnects such as ATMbased technology

To the b est of our knowledge most of these designs employ wide data striping techniques and the

corresp onding existing successful implementations employ only tens of disks In contrast the use

of replication for the purp ose of addressing workload demand problems has b een less explored In

the authors consider skews in data access patterns but in the context of a static environment

In the authors address various questions arising in the context of load imbalance problems due

to skews in data access patterns but in a less dynamic environment than we investigate here We

b elieve that the p olicies used in this pap er can b e complementary to the techniques develop ed in

In the authors also consider dynamic replication as an approach to load imbalance and

in our previous work we study a taxonomy of dynamic replication schemes However all of

these works except our work in either a assume some know ledge of frequencies of data access

to various ob jects in the system andor b do not provide users with ful l use of VCR functionality

andor c consider less dynamic environments than the one considered here Our motivation in

doing away with such assumptions in our work is largely due to considerations of applicability of

dynamic replication techniques in more general settings and to a wider range of applications of

continuous media servers

Lastly to the b est of our knowledge previous works do not consider alternative design character

istics that aect the scalability of CM servers in an endtoend setting ie taking into consideration

b oth the network and the storage resource constraints The quantitative study of such issues and

the costp erformance and reliability characteristics that distributed designs exhibit under growth

reconguration degradation of resources and changes in workloads are essential to assessing the

scalability of prop osed architectures and to the development of largescale CM servers in general

3

We divide these into broad categories as it would b e nearly imp ossible to list all pap ers on CM server designs

Our Contributions

The main contributions of this work are as follows

Quantitative evaluation of performance and resource demand characteristics of data striping

vs data replication techniques in largescale distributed CM servers Such evaluation is crucial

to achieving a scalable design of continuous media servers

Improved dynamic replication techniques for distributed hybrid CM servers needed to achieve

b etter p erformance by adjusting the numb er of replicas in the system based on changed in

data access patterns and user demand

Quantitative evaluation of reliabili ty characteristics of data striping vs data replicationbased

systems

Illustration of ease of designing heterogeneous hybrid CM systems without loss in p erformance

characteristics

Based on this endtoend costp erformance and reliability study we argue that hybrid designs result

in large scalable continuous media systems

Hybrid CM System Architectures

A hybrid system architecture is illustrated in Figure It consists of a set of no des connected by a

high sp eed switch which we term a global switch The global switch is a high bandwidth resource

which can for instance corresp ond to a high capacity WAN or an ATMtyp e infrastructure Each

no de i contains one or more pro cessing units PUs and one local switch which is used to connect

all lo cal PUs as well as lo cal clients Each client connects to the nearest lo cal switch dep ending

on their geographical lo cation which is also connected to some no de i Requests from this client

which are serviced by a PU from no de i are termed lo cal clients or lo cal requests When a

request from a client cannot b e serviced by its lo cal no de i the request is forwarded to a remote

no de j which contains a replica of the ob ject b eing requested We term this request a global

client or a global request as it requires some capacity of the global switch in order to receive Ci=client i C1 C2 C3 Cj

processing unit

Node i Local Switch d1 (node i has k processing units)

d2

......

PU1 PU2 PUk dn

global PU (processing unit) Switch

Node 1 Node 2 ...... Node i

global Switch

Node Node Node ...... i+1 i+2 m

Hybrid VOD System Architecture

Figure Multimedia System

service That is when a remote no de services a client the continuous media data is delivered from

the remote no de through the global switch to the lo cal no de and subsequently to the client

Each PU has one or more CPUs memory and an IO subsystem eg a clusterarray of disks

Furthermore each PU of each no de is also connected to the global switch Each no de x S where

S is the set of no des in the system has a nite storage capacity D in units of CM objects as

x

well as a nite service capacity B in units of CM access streams For instance consider a server

x

that supp orts delivery of MPEG video streams where each stream has a bandwidth requirement

of Mbitss and each corresp onding video le is mins long If each no de in such a server has

MBytess of bandwidth capacity and GB of storage space then each such no de can supp ort

B simultaneous MPEG video streams and store D MPEG video ob jects Likewise

x x

we measure the global and lo cal switch capacities in units of access streams In general dierent

no des in such a hybrid system may dier in their storage IO bandwidth and networking ie

lo cal switch service capacity This exibility of the hybrid architecture should result in a scalable

system which can grow on a no de by no de basis

Each CM ob ject resides on one or more no des of the system dep ending on its current p opularity

An ob ject is strip ed on the intrano de basis but not on the interno de basis That is an ob ject is

strip ed only across lo cal disks which b elong to the same no de Ob jects that require more than

a single no des service capacity to supp ort the corresp onding demand are replicated on multiple

no des The numb er of replicas needed to supp ort requests for a continuous ob ject is a function of

demand and therefore this numb er should change when the demand for that ob ject changes Let

R t denote the set of no des containing replicas of ob ject i at time t Thus R t varies with time

i i

as the p opularity of ob ject i changes The precise details of how R t changes over time are given

i

in Section

Up on a customers arrival at time t there is a probability p t that the corresp onding request

i

i

is for ob ject i and a probability q t that this request is generated by a client lo cal to no de j The

j

admission of this customer into the system pro ceeds as follows If at time t ob ject i resides on no de

j and there is service capacity available at no de j then the system admits and serves this new

request at no de j ie lo cally Let L t b e the load on no de x at time t If at time t ob ject i do es

x

not reside on no de j or there is no service capacity available at no de j then the system examines

the load information on each no de in R t and if there is sucient capacity on at least one of

i

these no de and in the interconnection network ie the global switch to service the newly arrived

request the system assigns this request to the leastloaded no de in R t Otherwise the customer

i

is rejected

Note that in the hybrid system we need to maintain load information on remote no des as well

as other b o okkeeping information including recomputation of replicationderepli cation thresholds

as describ ed in Section This will require some communication capacity although this required

capacity is signicantly smaller than the capacity needed for transmitting CM ob ject data from

a remote no de to a lo cal no de Furthermore communication of such b o okkeeping information is

needed in the wide striping case as well since co ordination b etween no des is needed for scheduling

of each request in the system Since the exact amount of b o okkeeping information dep ends on a

particular implementation we do not consider this any further here However we would like to

p oint out that in the case of the wide data striping architecture the b o okkeeping information must

b e exchanged b etween no des to schedule a newly arrived request On the other hand in the hybrid

architectures we can tradeo the frequency of collecting such information or the uptodateness

of the information for p erformance ie we can tradeo relying more on lo cal information rather

than remote information for some loss in p erformance

To assess the scalability characteristics of the p otential designs in an environment where data

access patterns change over time we consider the following costp erformance and reliability metrics

the systems acceptance rate which is dened as the p ercentage of all arriving customer

requests that are accepted by the system with zero waiting time

the capacity in units of access streams of the global switch required to supp ort a particular

architecture and corresp onding acceptance rate

the capacities in units of access streams as well as the numb er of lo cal switches required to

supp ort a particular architecture and corresp onding acceptance rate

the amount of disk storage in units of continuous media ob jects required to supp ort a

particular architecture and corresp onding acceptance rate

the mean time to failure MTTF of a particular architecture

Note that in the p erformance evaluation done in this work we do not consider queueing of cus

tomers that can not b e admitted immediately since that would entail consideration of scheduling

p olicies for requests in the queue The appropriateness of various queueing discipli nes and the

customers willingness to wait for service are in general largely a function of the particular ap

plication supp orted by the CM server Thus although in practice a CM server can have some

nite queueing capacity here we do not consider queueing of customer requests that can not b e

admitted at request time since the queueing disciplin es appropriate for these requests are largely a

function of sp ecic applications using the CM server and we do not limit this study to a particular

application

Table summarizes the main notation used in this pap er We will dene this notation through

out the pap er as it is needed

S set of all no des in the system

N numb er of no des in the system N jS j

N total numb er of disks in b oth wide data striping and hybrid systems

d

K numb er of distinct ob jects in the system

B maximum service capacity of no de x in streams

x

B average service capacity of no des in the system in streams

D average storage capacity of no des in the hybrid system

w

average storage space capacity of no des in the wide data striping system D

D total storage capacity on no de x in units of ob jects in the hybrid system

x

D t available storage space on no de x at time t in the hybrid system

x

w

D total storage capacity on no de x in the wide data striping system

x

w

D total storage space in the wide data striping system

L t load on no de x at time t in streams

x

P

B L t A t available service capacity for ob ject i at time t A x

x x i i

x2R (t)

i

ReT h replication threshold ie the threshold for adding another copy of ob ject i

i

D eT h dereplication threshold ie the threshold for removing a copy of ob ject i

i

H dierence b etween the replication and the dereplication thresholds ie H ReT h D eT h

i i i i

i

T length of ob ject i in units of normal playback time

leng th

i

T early acceptance time for ob ject i in units of normal playback time

ea

average arrival rate to the system

at the latest access time for ob ject i in the system

i

R t set of no des containing a replica of ob ject i at time t

i

p t probability of an arriving request b eing for ob ject i at time t

i

i

q t probability of an arriving request for ob ject i b eing for no de j at time t

j

D storage space threshold for activating the dereplication pro cess

S

Table Summary of notation

Dynamic Replication

In a hybrid CM server we use a dynamic approach to reacting to changes in user data access

patterns Since the numb er of copies of ob ject i partly determines the amount of resources available

for servicing requests for that ob ject we adjust the numb er of replicas maintained by the system

dynamical ly Of course the systems p erformance dep ends on its ability to make such adjustments

rapid ly which can require a nonnegligible amount of resources and accurately Thus when

adjusting the numb er of replicas in the system we essentially have conicting goals of a using as

few resources as p ossible to p erformance the replication in order not to interfere with normal

system op eration while b trying to complete the replication pro cess as so on as p ossible

In an attempt to reach a compromise b etween these conicting goals we use early acceptance

of customers as prop osed in our previous work where admitted customers are allowed to use

incomplete replicas while the replication pro cess continues That is once the system completes

i 4

replication of the rst T time units of a new replica of a CM ob ject i it will treat it as a

ea

virtually complete copy and b egin using it in serving customer requests for ob ject i

The issue that we need to consider is that a user might attempt to access a p ortion of an

incomplete copy which has not b een replicated yet eg by fastforwarding past the replication

p oint To allow customers to have full use of VCR functionality when viewing CM ob jects we need

i i i

to determine a safe value for T Clearly one safe value is T T full length of the CM

ea ea

leng th

i

ob ject However the intuition is that a smaller value of T should result in a higher at least in

ea

the short term acceptance rate of customer requests

i

In order to lower T and improve system p erformance in we employ a sto chastic mo del

ea

of user b ehavior at the cost of lowering the probability that a user will not access data b eyond the

replication p oint of course this probability still has to b e high but less than Sp ecially we mo del

the combination of the b ehavior of a user watching a display of a partially replicated ob ject

including hisher p ossible use of VCRtyp e functionality where that user is allowed to b egin the

i

display after the rst T time units of the ob ject have b een replicated and the corresp onding

ea

replication pro cess using a Discrete Time Markov Chain DTMC We then p erform transient

analysis on this Markov chain to compute the probability that the user attempts to access a

part of the ob ject that has not b een replicated yet If this probability results in a reasonable quality

i

of service QoS then the corresp onding value of T is acceptable Otherwise we recompute with

ea

a new value Although the general approach is applicable to a variety of multimedia applications

the particular values of parameters for the DTMC mo del as well as what constitutes acceptable

QoS dep end on the applications using the CM server

i

Results given in indicate that relatively small values of T result in go o d QoS For instance

ea

i

in we show that for a system with minute continuous media ob jects and T minutes

ea

i the corresp onding probability as computed from the DTMC mo del that a user will attempt

to access an unnished p ortion of a replica is b elow Simulation results of the same system

indicate that this probability is nearly This is due to the conservative nature of our mo del

which is a go o d characteristic when QoS is of imp ortance

In we also show that this mo del is not very sensitive to the accuracy of its parameters

eg such as the exact probability of a user requesting VCRtyp e functionality and thus is of

4

For ease of presentation we measure the amount of replication completed in time units of normal playback time

of that ob ject from the b eginning of the ob ject rather than eg in bytes

reasonably practical use Based on these results in this work we use the following p olicies as

describ ed b elow for a replication ie the pro cess of adding another copy of an ob ject to our

system b dereplication ie the pro cess of removing a copy of an ob ject from our system and

c triggering of addition or removal of a copy of an ob ject ie determining when a copy should

b e added or removed

Replication p olicy

We use the SREA sequential replication early acceptance p olicy where the replication is p er

formed sequentially That is a single stream is injected into b oth the source no de from which

the replication is p erformed and the target no de to which the replication is p erformed Then

i

given the value of T which is determined through the use of the mathematical mo del of user

ea

b ehavior as describ ed ab ove newly arrived users are admitted to the new incomplete replica as

i

so on as T time units of that ob ject have b een replicated on the target no de

ea

Dereplication p olicy

We use the DM delay migration p olicy which removes a replica of ob ject i only when there is no

customer currently viewing ob ject i This is motivated by the p ossible implementation complexity

of migrating customers from one no de where the removal o ccurs to another no de which contains

a copy of the ob ject b eing removed

Replicationderepli cat ion triggering

We use a thresholdbased approach to triggering continuous media ob ject replication and derepli

cation in order to react to an environment where data access patterns change over time That

is when a request for ob ject i arrives the system checks if the available service capacity for that

ob ject is ab ove the replication threshold If not the system triggers a replication pro cess for ob ject

i using the SREA p olicy describ ed ab ove On the other hand if at the time of nishing service

of a request for ob ject i the total available storage space capacity has fallen b elow the space

threshold and the excess remaining capacity for serving requests for that ob ject go es ab ove

the dereplication threshold the system triggers the dereplication pro cess using the DM p olicy

describ ed ab ove The computation of all threshold values is describ ed b elow

Thresholdbased techniques for reacting to changes in workload are employed often for improving

the costp erformance characteristics of systems eg in communication proto cols Here as in

other systems the main motivation for using a thresholdbased scheme is that there is a non

negligible cost for creating or removing a replica That is it takes a nonnegligibl e amount of time

and resources to replicate an ob ject or remove a copy and thus it should b e done sparingly

Now we state more formally how the hybrid CM server triggers replication and dereplication

pro cesses When a customer request for ob ject i arrives to the system at time t replication of

ob ject i is initiated if and only if all of the following criteria are satised

A t ReT h where ReT h is the replication threshold and A t is the available service

i i i i

P

B L t capacity for ob ject i at time t ie A x

x x i

x2R (t)

i

Ob ject i in not currently under replication

There is sucient available service capacity on the source no de

There is sucient available storage space capacity and sucient available service capacity

on the target no de

There is sucient available service capacity in the global switch ie interconnection net

work

Once the replication is triggered we must select a source and a target no de for the replication

pro cess The choice of a source no de for replication of ob ject i is simple we select the leastloaded

no de in the set R t For the target no de we cho ose the no de which has the highest estimated

i

residual capacity and has available storage capacity More formally we cho ose the no de x such that

B L t

y y

x R t L t min

i x

t

y 2(S R (t))

i

y

and the remaining storage capacity on x is sucient for the new replica where t corresp onds

y

to the numb er of replication pro cesses already in progress on no de y at time t Intuitively such

a choice should avoid replication of multiple relatively p opular ob jects on the same target no de

which may later comp ete for that no des capacity

As already stated dereplication is invoked at the customer departure instances when an ob ject

is determined to have excess service capacity and when the total available storage space capacity in

the entire system falls b elow D which is measured in units of CM ob jects Since the replication

S

and in general the dereplication pro cesses are not instantaneous proactive dereplication is

needed to reduce the probability of the system going into a bad state ie a state where we have

to reject requests over a long p erio d of time More formally a replica of ob ject i at no de x will b e

removed at time t if and only if the following conditions are satised

A t max fA t ReT h g The motivation for this condition is that the numb er of

i j 2S j i

replicas for ob ject i at time t is more than its current workload demand and at this time it

has the greatest excess of replicas among all relatively cold ob jects

i has crossed the dereplication threshold ie

A t B L t C t D eT h

i x x ix i

where C t denotes the numb er of customers viewing ob ject i at no de x at time t With

ix

the deletion of ob ject i at no de x A t would b e decreased by B L t In general ie

i x x

for a general dereplication p olicy since a customer viewing ob ject i at no de x will have to

b e migrated to other replica no des in R t A t would b e further decreased by C t In

i i ix

the case of the dereplication p olicy we use in this pap er ie the Delayed Migration DM

dereplication p olicy there is an additional constraint namely that C t must b e equal to

ix

P

D t D

x S

x2S

Lastly to prevent the system from oscillating b etween replication and dereplication a dierence

of H is intro duced b etween ReT h and D eT h ie D eT h ReT h H That is we intro duce

i i i i i i

hysteresis into the system

One of the more imp ortant issues in designing thresholdbased resource activation schemes is

the choice of replication and dereplication threshold values In the following section we give the

details of how we use dynamic threshold value adjustment to improve the systems p erformance

Note that we do this without collecting and maintaining statistics of user data access patterns

Dynamic Threshold Adjustment

Intuitively we would like the amount of service capacity available to each ob ject i to b e prop ortional

to its demand which is changing with time Thus we could attempt to maintain a numb er of copies

of each ob ject prop ortional to p t However in practice p t is unknown and varies over time

i i

Although we could try to collect statistics on access demands for the various ob jects many questions

would remain op en over which p erio d to collect the statistics when to make the decision that the

probabilities have changed suciently to reect this change in the systems conguration likely

we do not want to do this continuously how much condence to have in the collected statistics

and thus how aggressively or cautiously to evolve the system from an old state ie with old

access probabilities to a new state ie with new access probabilities

Furthermore in such an environment having the amount of service capacity prop ortional to

the access probabilities even if we knew them would not necessarily insure acceptance of newly

arrived customers An imp ortant factor in the p erformance of the system is the mixture of requests

that arrives and is ultimately serviced by the no des of the CM server That is we may reject

requests for ob ject i on no de j due to an inux of requests for other ob jects residing on no de j

ie other than ob ject i

Thus in our work we use dynamic data replication techniques which do not assume knowledge

of access probabilities More sp ecically we use the last interarrival time for ob ject i to very

coarsely approximate p t and compute threshold values as follows

i

For each ob ject i we keep its last request access time at At the time of arrival t of a request

i

for ob ject i we compute the latest interarrival time ie t at for that ob ject We use

i

this latest interarrival time as a very coarse approximation of p opularity of ob ject i at time

t Whenever a new request for ob ject i arrives we up date the replication and dereplication

5

threshold values for all ob jects in the system accordingly as describ ed next as well as record

the last access time at at time t

i

i



T

D

ea

e where D is the average Then the replication threshold for ob ject i is ReT h d

i

w



tat

D

i

w

storage capacity of no des in a hybrid system and D is the average storage capacity of no des

in the wide data striping system That is the replication threshold ReT h represents the

i

amount of workload corresp onding to requests for ob ject i that we exp ect to receive in the

i 6

next T time units which is the amount of time needed to create a new virtual replica of

ea

ob ject i should we deem it necessary Thus the motivation for this setting of the replication

i

T

ea

threshold value is that corresp onds to the exp ected numb er of requests for ob ject i that

tat

i

i

can arrive in the next T time units Note that dep ending on the amount of excess storage

ea

space ie in excess of minimum needed to hold at least one copy of each ob ject we can

aord to b e more or less aggressive ab out triggering replication of ob jects Thus we also

i



T

D

ea

normalize by the factor of excess storage capacity at the target no de x ie by

w



tat

D

i

5

We discussed the issue of overheads related to recomputation of thresholds in Section

6

Recall that we only need to replicate a fraction of the ob ject b efore allowing users to view it as describ ed in

Section

That is the more excess storage capacity we have the more aggressive we can aord to b e

in triggering the replication pro cess

The constraint relating the replication and the dereplication threshold values is as follows

D eT h ReT h H where H represents the intro duction of hysteresis into the system

i i i i

i

T



leng th

D

The hysteresis for ob ject i is computed as follows H d e The motivation for

i

w



tat

D

i

this setting of the hysteresis value is similar to the motivation given ab ove for ReT h except

i

i

T

leng th

corresp onds to the exp ected numb er of requests for ob ject i that can arrive in the that

tat

i

i

next T time units ie during the entire duration in normal playback of the display of

leng th

i

ob ject i That is T corresp onds to an estimate of the amount of time that will elapse

leng th

b efore some of the currently allo cated resource which can b e used to service requests for

ob ject i are released

PerformanceScalabilityReliability Evaluation

In this section we present results of our study of scalable designs for continuous media servers

using the following costp erformance and reliability metrics as given in Section

the systems acceptance rate which is dened as the p ercentage of all arriving customer

requests that are accepted by the system with zero waiting time

the capacity in units of access streams of the global switch required to supp ort a particular

architecture and corresp onding acceptance rate

the capacities in units of access streams as well as the numb er of lo cal switches required to

supp ort a particular architecture and corresp onding acceptance rate

the amount of disk storage in units of continuous media ob jects required to supp ort a

particular architecture and corresp onding acceptance rate

the mean time to failure MTTF of a particular architecture

This study is p erformed via simulation with the following simulation parameters The arrival



aBN

where a pro cess of requests for ob jects is Poisson with a mean arrival rate of

i

T

leng th

is the relative arrival rate For ease of presentation in the remainder of this section we discuss

the results in terms of the relative arrival rate a ie relative to the total service capacity of the

system eg a corresp onds to the maximum service capacity of the system

Parameter Default Value

Arrival rate a

i



T

D

ea

e ReT h d

i

w



tat

D

i

i

T



D leng th

H d e

i

w



tat

D

i

D eT h ReT h H

i i i

i

T mins

leng th

K

System capacity streams

Replication p olicy SREA

Dereplication p olicy DM

Access Probability change gradual and abrupt

Skewness distribution Zipf

i

q t uniformly distributed b etween and N for each ob ject i t

j

Architecture arch arch group arch group arch group

arch group for details see Table

Table Parameters for our simulation study

In the results presented in this section we consider the design of a CM server with the following

capacity requirements also given in Table a total service capacity of N B streams

i

a total storage capacity of K distinct ob jects with each ob jects length T

leng th

minutes Several of the entries in Table require a few words of clarication which are as follows

Since the main motivation for using dynamic replication p olicies is the need to react to changes

in data access patterns we consider the p erformance of the system as a function of such changes

That is the workload will have the characteristic that every rotation time p erio d of X mins

p ts change The change in access probabilities is describ ed by Equation which is intended

i

7

to emulate a relatively gradual increasedecrease in p opularities

p t if i is o dd and i K

i+2

p t if i is o dd and i K

K

0

p t

i

p t if i is even and i K

i2

p t if i is even and i

1

0

where t and t refer to two consecutive rotation time p erio ds and for ease of presentation we

assume that K is even To test our p olicies further we also consider a more abrupt change in

7

This is to illustrate that even under a relatively gradual change dynamic p olicies are still useful Furthermore

we b elieve this is a reasonable emulation of changes in access patterns for many CM applicati ons

access probabilities as describ ed by Equation which is used to emulate an abrupt increase

in p opularity of currently unp opular ob jects as well as a gradual decrease in p opularity of the

currently more p opular ob jects

p t if i K

1

0

p t

i

p t if i K

i+1

Furthermore at any xed value of t we use the Zipf distributions to describ e the skewness

c

i K and of the access probabilities where Probrequest for ob ject i

(1 )

i

P

(1 )

K

1 1

where c In our exp eriments we set which corresp onds and M

(1 ) (1 )

j =1

K

j

M

K

to the measurements p erformed in for a moviesondemand application Lastly we use the

following interactivity settings NPFFRWPAUSE this is the ratio b etween the

mean amount of time sp ent in various user playback mo des where NP refers to normal playback

FF refers to fastforward with preview RW refers to reward with preview and PAUSE refers

to pause Based to these setting and our mathematical mo del of user b ehavior as summarized in

i i

Section we obtain T mins given that T mins

ea

leng th

The architectural settings considered in this study are summarized in Table where the p ossible

architectures dier in the numb er of no des service capacity of each no de and available storage

8

capacity of each no de Recall that we need to meet the overall capacity requirements given at

the b eginning of this section Architecture corresp onds to wide data striping Architecture

groups corresp ond to various congurations of a hybrid CM server where the groups dier

in the size of the no de capacities and subsequently the total numb er of no des For each hybrid

CM group architecture b elow we exp eriment with dierent amounts of p er no de storage space

capacity in order to illustrate the tradeo b etween storage space capacity lo cal to a no de and the

corresp onding required capacity of the global switch

Moreover in the exp eriments that follow we consider the aect on the overall system p erfor

mance of limitations of communication network resources Let nc represent the ratio of the global

switch or interconnection network capacity to storage system service capacity ie nc rep

resents equal service capacities in the storage and communication subsystems Then we vary the

service capacity of the global switch ie nc and compute the subsequent degredation in

p erformance exp erienced by the various architectures The motivation for doing these exp eriments

8

Note that for a hybrid architecture that requires more storage space than the corresp onding wide data striping

architecture we only increase the storage space p er disk of that architecture ie not the numb er of disks as that

would also increase the service capacity of the system and hence would not make for a fair comparison

Arch typ e No of no des Serv capacityno de Stor spaceno de

Lo cal switch capacity

in streams in ob jects

arch

arch group or or or or

arch group or or or or

arch group or or or or

arch group or or or or

Table Parameters for dierent architecture groups

is two fold to observe the p erformance degredation characteristics of the p ossible CM server

designs as this is an indication of their scalability characteristics and to assess whether reduc

tions in overall required global switch capacity are p ossible without signicant loss in the overall

system p erformance ie whether we can op erate the system with a smaller network which should

lead to lower a cost system

Lastly in the exp eriments that follow upp er b ound on the acceptance rate refers to the

acceptance rate that a wide data striping system can achieve without considering network capacity

constraints thus this is the only curve in the following gures that is not a function of nc

Wide Data Striping System vs Hybrid System

acceptance rate 1.00

0.80

0.60

0.40 space 24 per node upper bound space 26 per node 0.20 data stripping space 28 per node space 22 per node space 30 per node 0.00 0.20 0.40 0.60 0.80 1.00

network constraint

Figure Architecture group under dierent network constraints acceptance rate 1.00

0.80

0.60

0.40 upper bound space 52 per node data stripping 0.20 space 56 per node space 44 per node space 60 per node space 48 per node 0.00 0.20 0.40 0.60 0.80 1.00

network constraint

Figure Architecture group under dierent network constraints

Figures through and Figure b illustrate that a hybrid system has b etter overall p erformance

as well as p erformance degradation characteristics than the wide data striping system under lower

network capacities ie with nc More imp ortantly the hybrid architecture allows us to

tradeo capacities of the various system resources in order to achieve a more costeective system

overall Sp ecically we can tradeo lo cal storage space capacity and lo cal switch capacity with

global switch capacity and achieve nearly the same p erformance characteristics For instance for a

particular architectural setting ie with a xed numb er of no des and corresp onding no de service

and lo cal switch capacities the larger the lo cal storage space capacity is the smaller the global

switch capacity need b e in order to achieve the same overall system p erformance As an example

consider the arch group in Figure in the case of the architecture with a storage space

9

capacity of ob jects p er no de the corresp onding required service capacity of the global switch

is streams whereas in the case of the architecture with a storage space capacity of ob jects

p er no de the corresp onding required service capacity of the global switch is only streams

Conversely the larger the lo cal switch is the more we can reduce the storage space and global

switch capacities As an example consider the arch group in Figure in this case with a

9

The needed global switch capacity is determined from Figure by rst xing the acceptance rate that

we would like to achieve In this example we x the required acceptance rate to b e at least 

acceptance rate of upp er b ound result Given the acceptance rate we can determine using Figure the smallest

network constraint that satises that acceptance rate for the appropriate architecture curve Let that constraint b e

c then the required global switch capacity is c  streams where streams corresp onds to maximum required

system capacity in our exp eriments as describ ed b efore acceptance rate 1.00

0.80

0.60

0.40 space 92 per node upper bound space 96 per node 0.20 data stripping space 100 per node space 88 per node space 104 per node 0.00 0.20 0.40 0.60 0.80 1.00

network constraint

Figure Architecture group under dierent network constraints

10

lo cal switch capacity of streams the corresp onding required total storage space capacity is

ob jects ie ob jects p er no de and the corresp onding required service capacity of the global

11

switch is streams Consider now the arch group in Figure although in this case

the lo cal switch capacity increases to streams the corresp onding required service capacity of

the global switch drops down to streams and the corresp onding required total storage space

capacity drops down to ob jects ie ob jects p er no de These results are due to the fact

that with larger lo cal switch capacities we can service more customer requests lo cally hence the

reason for the corresp onding smaller required global switch capacity Furthermore larger lo cal

switches and corresp onding larger no de service capacities also provide more opp ortunities to take

advantage of the load balancing characteristics data striping within a no de hence the reason for

the smaller required storage space capacity p er no de

System Sizing Issues

Quantitatively evaluating the tradeos b etween lo cal switch capacity no de storage space capacity

and global switch capacity is no easy task as it is not immediately clear how to tradeo one

resource for another Ideally one would like to evaluate these tradeos based on cost ie one

would like to size the system resources so as to achieve the b est p erformance p ossible for the lowest

10

These values can b e determined from Tables and

11

The determination of the required global switch capacity is done as in the previous example acceptance rate 1.00

0.80

0.60

0.40 space 210 per node upper bound space 215 per node 0.20 data stripping space 220 per node space 205 per node space 225 per node 0.00 0.20 0.40 0.60 0.80 1.00

network constraint

Figure Architecture group under dierent network constraints

cost p ossible However cost considerations are a complex issue given that costs dep end on many

factors including the particular technology used for the various comp onents of the system Thus

next we instead evaluate the dierent hybrid designs based on the amountcapacity of each resource

they require relative to the wide data striping system Such an evaluation quantitatively illustrates

to the designer the relative merits of the dierent architectures without the need for picking a

12

sp ecic technology to use for each part of the system The purp ose of these exp eriments is to

illustrate how a CM server designer can deal with these fairly complex system sizing issues

In order to investigate system sizing issues we further rene our set of architectural settings

as describ ed in Table As b efore these settings dier in the numb er of no des p er no de ser

vice capacity p er no de storage space capacity lo cal switch capacity and global switch capac

ity We cho ose the p er no de storage space capacity and the corresp onding global switch ca

pacity of each architectural setting based on the results obtained in the previous section More

sp ecically we cho ose those architectures that can achieve an acceptance rate of at least

acceptance rate of the upp er b ound result with reasonably small p er no de storage space and

13

global switch capacities Recall that the upp er b ound is computed without considering any

12

Characterizing a resource using only its capacity may result in a simplicatio n for certain typ es of resources

however this is still a go o d abstraction for evaluating costeectiveness of designs without having to cho ose a

sp ecic technology for each system comp onent

13

If costs of sp ecic system comp onents are known then one would cho ose to investigate those architectural settings

with lowest costs Since we do not wish to make technologycostrela ted assumptions we use a heuristic ie we

cho ose architectures that have reasonably small storage and global switch capacities acceptance rate acceptance rate 1.00 1.00

0.80 0.80

0.60 0.60 data striping arch2 0.40 0.40 space 24 per node arch3 upper bound space 26 per node arch4 0.20 0.20 data stripping space 28 per node arch5 space 22 per node space 30 per node 0.00 0.00 0.20 0.40 0.60 0.80 1.00 1.20 0.20 0.40 0.60 0.80 1.00 rotatio time period x 103 network constraint

(a) arch2, arch3, arch4 and arch5 (b) arch 2 group

Figure Abrupt increase and gradual decrease in access probabilities

network capacity constraints Since the upp er b ound results are not always achievable by archi

tectures we are studying here we cho ose a p erformance goal ie acceptance rate that is reasonably

close to the upp er b ound result

Arch typ e No of no des Serv capacityno de Stor spaceno de Global switch capacity

Lo cal switch capacity

in streams in ob jects in streams

arch

arch

arch

arch

arch

arch

arch

Table Parameters for dierent testing architectures

Figure depicts the comparison in resource requirements b etween the various hybrid architec

tures and the wide data striping architecture We rep ort each resource requirement as the ratio

b etween the resource requirement of a particular hybrid architecture and the wide data striping

architecture arch in in Table ie each graph in Figure represents

resource requirement of arch i

i

resource requirement of arch

Hence the straight line at the value of in each of the graphs of Figure corresp onds to the

scaled resource requirement of the wide data striping architecture ie arch in Table

As already stated these results illustrate to the designer the relative merits of the dierent

architectures by quantifying the tradeos b etween the various resources of the CM server ie lo cal

no de storage space capacity lo cal no de service capacity and lo cal switch capacity as well as global

switch capacity Based on these results and current costs the designer of a largescale CM server

can make system sizing decisions in conjunction with decisions of choice of technologies to use for

each of the systems comp onents

global switch size ratio local switch size ratio 1.00 10.00

0.80 8.00

0.60 6.00

0.40 4.00

0.20 2.00

0.00 0.00 arch 2 arch 2.1 arch 3 arch 4 arch 5 arch 5.1 arch 2 arch 2.1 arch 3 arch 4 arch 5 arch 5.1 arch type arch type total storage space ratio number of local switches ratio

1.40 1.00

1.20 0.75 1.00

0.80 0.5 0.60

0.40 0.25 0.20

0.00 0.00 arch 2 arch 2.1 arch 3 arch 4 arch 5 arch 5.1 arch 2arch 2.1 arch 3 arch 4 arch 5 arch 5.1 arch type

arch type

Figure System sizing

Heterogeneous Systems

Next we illustrate the ease of dealing with heterogeneous systems when using hybrid CM server

designs without loss of p erformance as compared to an equivalent ie same overall capacity

homogeneous case For this purp ose we consider a hybrid CM architecture with no des and a

total service capacity of streams We use two test cases in the following exp eriments b oth

based on the homogeneous version of arch with the storage space capacity of ob jects

p er no de refer to Table Sp ecically we intro duce and dierences in storage space

and service capacities b etween the no des of the system as well as corresp onding dierences in

lo cal switch capacities eg to emulate a system that gradually grows as well as exp eriences

replacements due to failures and thus is forced to use heterogeneous resources Hence we have

one test case of a no de system with each no de having storage space capacity of

and ob jects resp ectively and service capacity of and streams

resp ectively And we have another test case of a no de system with each no de having storage

space capacity of and ob jects resp ectively and service capacity of

and streams resp ectively

The rusults depicted in Figure show that using a hybrid CM server design we can achieve

heterogeneous system p erformance that is comparable to homogeneous system p erformance We do

not show results for heterogeneous wide data striping systems for the following reasons Although

some research on adaptation of wide data striping over heterogeneous disks do es exist eg as in

to the b est of our knowledge such schemes either give up some amount of disk storage space

or some amount of disk bandwidth capacity in order to achieve this adapatation and consequently

usually at b est p erformance as well as the comparable homogeneous counterparts Thus in the

case of heterogeneous systems our evaluation of the go o dness of hybrid designs as compared to

wide data striping is conservative

Dynamic Threshold Adjustment

Next we would like to illustrate that our choice of dynamic threshold adjustment p olicy in con

junction with replication and dereplication p olicies results in a hybrid CM server that can react to

changes in data access patterns accurately and rapidly This results in the systems p erformance

that app ears to b e insensitive to the changes in data access patterns ie the system maintains

nearly the same p erformance regardless of how frequently the access patterns change Moreover acceptance rate 1.00

0.80

0.60

homogeneous 0.40 heterogeneous (10%) heterogeneous (5%) 0.20

0.00 0.20 0.40 0.60 0.80 1.00

network constraint

Figure Heterogeneous system

this p erformance is comparable to the p erformance of the wide data striping system which is

naturally insensitive to changes in data access patterns

To illustrate this we depict the systems p erformance ie acceptance rate in Figures and

a as a function of the rotation time p erio d ie the amount of time that passes b efore the

access probabilities of the various ob jects change Recall that the precise denition of changes in

access probabilities for b oth gradual changes as in Figure and the more abrupt changes as in

Figure a are given earlier in this section

Reliability

Lastly we discuss the characteristics of the wide data striping architecture as well

as our hybrid system We use the mean time to failure MTTF as our reliability metric where the

mean time to failure is dened as the mean time until some combination of disk failures results in

loss of some data from the storage subsystem ie losses that can no longer b e recovered through

the use of redundant information such as parity computation or access of replicas which in turn

is due to losses of to o many disks The details of the derivations of MTTF equations used b elow

are given in the App edix

If we employ the use of paritybased redundant information as in disk arrays and assume that acceptance rate 1.00

0.80

0.60 Data Striping arch2

0.40 arch3 arch4 arch5 0.20

0.00 0.20 0.40 0.60 0.80 1.00 1.20

rotation time period x 103

Figure Dynamic threshold adjustment

disks fail indep endently with exp onential failure rates as in then the MTTF of an N disk

d

14

wide data striping system with cluster or sizes of C disks each is approximately

2

MTTF

disk

MTTF

N C MTTR

d disk

where MTTF is the mean time to failure of a single disk and MTTR is the mean time to

disk disk

repair of a single disk Note that the MTTF dep ends on the architectural characteristics of the

disk

disk and the MTTR dep ends on the reconstruction algorithms used eg as in Also note

disk

that the value of C determines the amount of redundant information stored in the system For the

sake of fairness in computing MTTFs we congure all architectures with the same amount of

redundant information

N

d

Likewise the MTTF of a hybrid system with N no des disks p er no de and cluster sizes of

N

C disks each is approximately

2

N MTTF

disk

MTTF

N C MTTR

d disk

Note that the ab ove MTTF for hybrid systems is conservative in the sense that it computed based

on the assumption of only a single copy p er ob ject in the entire system For an ob ject i that has

k copies in a hybrid system the mean time until its data is lost is approximately

2k

N MTTF

k

disk

MTTF k copy ob ject

2k 1

N C

MTTR

d

disk

14

Here cluster size refers to the numb er of data disks plus a parity disk in a disk array ie in a cluster of C disks

there are C data disks and parity disk

Given these equations we now show a MTTF comparison b etween the seven architectures given

in Table using the conservative estimate for the hybrid system of the MTTF equation based on a

single copy p er ob ject only ie using Equations and These results are depicted in Figure

again as a ratio b etween MTTF of a particular hybrid architecture and the wide data striping

architecture ie the results in Figure represent

MTTF of arch i

i

MTTF of arch

Hence the straight line at the value of in Figure corresp onds to the scaled MTTF of the

wide data striping architecture

These results clearly show that we can achieve higher reliability in a hybrid system even for

ob jects that only have a single copy as compared to the wide data striping system The increase in

reliability even for the single copy ob jects is due to the isolation of fault aects ie the wider

we strip e an ob ject the more disk failures can aect the loss of data corresp onding to that ob ject

A quantitative expression of this intuition is given in the derivations of the App endix

Of course the reliability is even higher for ob jects with multiple copies as is natural in a

system which employs data replication An imp ortant p oint though is that in a hybrid system we

can provide signicantly higher reliabili ty for the popular ob jects as there will always b e multiple

replicas of such ob jects in a hybrid system Lastly this brings us to another interesting p oint

Another reliabilityrel ated advantage of hybrid systems is that under network failures or network

partitioning andor high workload conditions at remote no des lo cal no des given that they still

have some capacity remaining can at least deliver some ob jects eg given a moviesondemand

application even if a movie b eing requested by the user is not available due to ab ove sp ecied

conditions the user has the option to cho ose another movie that may b e available This is not

the case for wide data striping architectures as all no des and hence network capacity must b e

available in order to service a request for any ob ject

Conclusions

In this work we studied the scalability of large continuous media endtoend server designs as a

function of their costp erformance and reliability characteristics under various workload and system

constraints We fo cused on data placement issues and compared the scalability characteristics of

hybrid vs wide data striping architectures We have showed that hybrid designs in conjunction MTTF ratio 20.00

15.00

10.00

5.00

0.00 arch type

arch 2 arch 2.1 arch 3 arch 4 arch 5 arch 5.1

Figure MTTF comparison for single copy ob jects

with dynamic replication techniques are less dep endent on interconnection network constraints

provide higher reliability and can b e prop erly sized so as to result in costeective endtoend

systems Thus we b elieve that hybrid designs result in scalable large distributed continuous media

servers

References

Personal communication with miscrosoft research

S Berson S Ghandeharizadeh R R Muntz and X Ju Staggered Striping in Multimedia

Information Systems SIGMOD

MM Buddhikot GM Parulkar and JR Cox Design of a Large Scale Multimedia Stoarge

Server Journal on Computer Networks and ISDN Systems

A L Chervenak Tertiary Storage An Evaluation of New Applications PhD Thesis UC

Berkeley

C F Chou L Golub chik and J CS Lui A p erformance study of dynamic replication tech

niques in continuous media servers Technical Rep ort CSTR University of Maryland

Octob er

A Dan M Kienzle and D Sitaram A Dynamic Policy of Segment Replication for Load

Balancing in VideoonDemand Servers ACM Multimedia Systems

A Dan and D Sitaram An Online Video Placement Policy Based on Bandwidth to Space

Ratio BSR In Proceedings of ACM SIGMOD

Bolosky et al The Tiger Video Fileserver Technical Rep ort MSRTR Michrosoft

Research

R Haskin and F Schmuck The Tiger Shark File System In COMPCON

RL Haskin Tiger Shark A Scalable File System for Multimedia Technical rep ort IBM

Research

PJB King Computer and Communication Systems Performance Modeling PrenticeHall

D E Knuth The Art of Computer Programming Volume AddisonWesley

P W K Lie J CS Lui and L Golub chik ThresholdBased Dynamic Replication in

LargeScale VideoonDemand Systems RIDE February

Richard R Muntz and John CS Lui Performance Analysis of Disk Arrays Under Failure

VLDB Conference pages

David A Patterson Garth Gibson and Randy H Katz A Case for Redundant Arrays of

Inexp ensive Disks RAID ACM SIGMOD Conference pages

S Ross A First Course in Probability PrenticeHall Inc Upp er Saddle River NJ

JR Santos and R Muntz Performance Analysis of the Rio Multimedia Storage System

with Heterogeneous Disk Congurations Proc of The th ACM International Multimedia

Conference Septemb er

W J Stewart Introduction to Numerical Solution of Markov Chains Princeton University

Press

N Venkatasubramanian and S Ramanathan Load Management in Distributed Video

Servers In Proceedings of ICDCS pages Baltimore MD May

M Vernick C Venkatramani and T Chiueh Adventures in Building the Stony Bro ok

Video Server ACM Multimedia pages

J Wolf H Shachnai and P Yu DASD Dancing A Disk Load Balancing Optimization

Scheme for VideoonDemand Computer Systems In ACM SIGMETRICSPerformance

Conf

App endix Derivation of MTTF Equations

In this app endix we give the derivation of the mean time to failure equations used in Section We

will use the following notation in this derivation

MTTF mean time to failure of a disk

disk

MTTR mean time to repair of a disk

disk

MTTF mean time to failure of a cluster or array of disks

cluster

Our goal is to compute the MTTF under the following assumptions each disk has inde

cluster

1

and the total numb er of disks in the p endent and exp onential failure rate equal to

MTTF

disk

cluster is C

We rst compute the mean time to failure of some disk in a cluster of C disks Let X b e the

i

random variable corresp onding to the mean time to failure of disk i with an exp onential failure

rate where i C Let Y minX X X which corresp onds to the mean time until

1 2 C

some disk in a cluster of C disks fails or the mean time b etween failures in a cluster of C disks

Then given that the disk failures are independent we have

ProbY a ProbX a X a X a

1 2 C

ProbX a ProbX a ProbX a

1 2 C

a C C a

e e

1

Since the failure rate of each disk in our case is we have that the mean time until

MTTF

disk

MTTF

disk

some disk fails or the mean time b etween failures in a cluster of C disks is

C

Now a cluster of C disks is considered failed when disks in that cluster have failed Recall

that a disk array is able to continue delivery of data under a single failure once a second failure

in an array of C disks o ccurs some data is lost and we term this failure of the array or cluster

Thus to compute MTTF we need to determine the probability that given that one failure

cluster

has already o ccured in that cluster a second disk failure will o ccur within MTTR time units

disk

Hence we have that

MMTR

disk

MTTF

disk

Proba disk fails within MTTR e

disk

and

Proba disk do es not fail within MTTR

disk

MMTR

disk

MTTF

disk

Proba disk fails within MTTR e

disk

Then given that one of the disks in a cluster of C has already failed we have that

Probat least one of the remaining disks fails within MTTR

disk

Probnone of the remaining disks fail within MTTR

disk

h i

MTTF

MMTR (C 1)

disk

disk

MMTR

disk

C 1

MTTF

disk

e e

MTTF MTTF

disk disk

Given a well designed system we can assume that MTTR where is the

disk

C C

x

mean time to failure of some disk in a cluster of C It is also well known that e x is a

reasonable approximation when x Then

MMTR

disk

C Probat least one of the remaining disks fails within MTTR

disk

MTTF

disk

Now we are ready to compute MTTF Given that one disk in an array of C disks has

cluster

failed we can next observe an event which has one of two p ossible outcomes

a second disk fails b efore the rst is repaired and thus the entire arraycluster fails

the rst disk is repaired b efore the second failure and thus the arraycluster continues to

op erate normally

we refer to the rst outcome of this event as a failure and to the second outcome as a success

ie this is a Bernoulli trial We also know that

Probfailure

Proba second disk failure o ccurs b efore the rst is repaired

Probat least one of the remaining disks fails within MTTR

disk

Probsuccess

Probthe rst failure is repaired b efore a second failure o ccurs

Probfailure

Thus in determining the mean time to failure of a disk array we are interested in a sequence of

event outcomes or trials where all outcomes but the last one corresp ond to success and the last

1

one corresp onds to a failure It is well known that on the average it takes trials

Probfailure

b efore we obtain a failure Now we have that

MTTF

cluster

Exp ectedtime b etween failures in a cluster Exp ectednumb er of trials b efore obtain a failure

Exp ectedtime b etween failures in a cluster Probfailure

2

MTTF MTTF

disk disk

MMTR

disk

C C C MMTR

C

disk

MTTF

disk

Given the ab ove derivation of MTTF we can now compute the MTTF equations used in

cluster

Section as follows Let N b e the total numb er of disks in the wide data striping as well as the

d

hybrid architectures

Wide data striping architecture

Recall that this architecture only keeps a single copy of each ob ject i in the entire system Thus

using reasoning similar to the one used ab ove to derive MTTF we have that

cluster

2

MTTF MTTF

cluster disk

MTTF wide data striping arch

number of cl uster s in the sy stem N C MTTR

d disk

Hybrid architecture

N

d

disks and as b efore we assume that the no des are indep endent in terms of Here each no de has

N

their failures First we consider the mean time to data loss for those ob jects which only have a

single copy in the entire system which is as follows again using reasoning similar to the one used

ab ove to derive MTTF

cluster

2

MTTF N MTTF

cluster disk

MTTF hybrid archsingle copy

number of cl uster s in one node N C MTTR

d disk

Next we consider the mean time to data loss for ob ject i which has k copies in the system ie

15

in k dierent no des Once the rst disk failure o ccurs in the rst no de containing a copy of

ob ject i let Event A corresp ond to the event where a second disk fails in the cluster of the rst

16

no de in R t already op erating under failure and at least one cluster fails in each of the other

i

no des in R t Then as b efore we are lo oking at a Bernoulli trial where now Probfailure

i

15

This is a general derivation as the numb ering of the no des is logical

16

That is this is the cluster where the rst failure o ccured

ProbEvent A o ccurs within MTTR Consequently we have

disk

MTTF k copy ob ject

Exp ectedtime b etween failures in a cluster Exp ectednumb er of trials b efore obtain a failure

Exp ectedtime b etween failures in a cluster Probfailure

MTTF

disk

C

N MTTR N MTTR N

MTTR (C 1) MTTR (C 1) MTTR (C 1)

d disk d disk d

disk disk disk

CN MTTF N MTTF N

MTTF MTTF MTTF

disk disk

disk disk disk

2k

MTTF N

k

disk

2k 1

N C

MTTR

d

disk