1

Differentiated latency in data center networks with erasure coded files through traffic engineering Yu Xiang, Vaneet Aggarwal, Senior Member, IEEE, Yih-Farn R. Chen, and Tian Lan, Member, IEEE

Abstract—This paper proposes an algorithm to minimize these days. Google and Amazon have published that every weighted service latency for different classes of tenants (or 500 ms extra delay means a 1.2% user loss [1]. As latency service classes) in a data center network where erasure-coded is becoming a key performance metric in datacenter storage, files are stored on distributed disks/racks and access requests are scattered across the network. Due to the limited bandwidth taming service latency and providing elastic latency levels available at both top-of-the-rack and aggregation switches, and that fit tenants different preferences in real storage systems differentiated service requirements of the tenants, network band- is crucial for storage providers to offer differentiated services. width must be apportioned among different intra- and inter- In particular, minimizing latency in HDFS through VM place- rack data flows for different service classes in line with their ment optimization is proposed in [2], [3], and an evaluation of traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling policies to latency levels is provided in [4] for real applications such as derive a closed-form upper-bound of service latency for erasure- the Amazon EC2, Doom 3 game server and Apple’s Darwin coded storage with arbitrary file access patterns and service Streaming Server, which have heterogeneous requirements. time distributions. The result enables us to propose a joint There are also works on latency-aware pricing [5], and multi- weighted latency (over different service classes) optimization over tier pricing models for charging tenants with differentiated three entangled “control knobs”: the bandwidth allocation at top-of-the-rack and aggregation switches for different service service requirements [6], [7]. classes, dynamic scheduling of file requests, and the placement Modern data centers usually provide differentiated service of encoded file chunks (i.e., data locality). The joint optimization tiers and ensuring that tenant service level agreement (SLA) is shown to be a mixed-integer problem. We develop an iterative objectives are met, by tailoring storage performance to specific algorithm which decouples and solves the joint optimization needs. This is because users have different latency require- as 3 sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm is ments, for instance, time-sensitive jobs that need to complete prototyped in an open-source, distributed file system, Tahoe, and in a flash, and insensitive jobs that can experience longer evaluated on a cloud testbed with 16 separate physical hosts in waiting time without significant performance degradation, may an OpenStack cluster using Cisco switches. Experiments validate have different priorities and service classes, determined by our theoretical latency analysis and show significant latency the price tiers. They often adopt a hierarchical architecture reduction for diverse file access patterns. The results provide valuable insights on designing low-latency data center networks with multiple layers of network switches (Top-of-Rack (TOR), with erasure coded storage. aggregation and core switches) [8]. To ensure differentiated latency performance for each tenant in an erasure-coded storage system on such a clear architecture might seem I.INTRODUCTION straightforward, however, quantifying access latency in erasure coded storage is an open problem. Some Markov Chain based Data center storage is growing at an exponential speed and approaches have been proposed, which have large state space customers are increasingly demanding more flexible services and thus hard to get closed form expressions [9], [10]. As exploiting the tradeoffs among the reliability, performance, encoded chunks are distributed across the system, accessing and storage cost. Erasure coding (forward error correction the files in a storage system requires - in addition to reasonable coding which transforms a content of k chunks into a longer data chunk placements and dynamic request scheduling - content with n chunks such that the original content can be intelligent traffic engineering solutions to handle differentiated recovered from a subset of the n chunks) has seen itself quickly service of 1) data transfers between racks that go through emerged as a very promising technique to reduce the cost of aggregation switches and 2) data transfers within a rack that storage while providing similar system reliability as replicated go through Top-of-Rack (TOR) switches. Hence, although systems. The effect of coding on service latency to access providing optimal resource allocation to ensure latency per- files in an erasure-coded storage is drawing more attention formance would certainly be appealing to storage providers, it Y. Xiang and Y. R. Chen are with AT&T Labs-Research, Bedminster, can be quite difficult to analyze exact access latency for each NJ 07921 (email: {yxiang,chen}@research.att.com). V. Aggarwal is with tenant in such multi-tier storage systems, which is essential for the School of IE, Purdue University, West Lafayette, IN 47907 (email: accurate latency optimization. To the best of our knowledge, [email protected]). T. Lan is with Department of ECE, George Washington University, DC 20052 (email: [email protected]). This work was supported in this is the first work jointly minimizing service latency with a part by the National Science Foundation under Grant no. CNS-1618335. quantified latency model for a multi-tenant environment with This paper was presented in part at the 15th IEEE/ACM International differentiated latency performance needs. Symposium on Cluster, Cloud and Grid Computing 2015 [13] and the 35th International Conference on Distributed Computing Systems (ICDCS) 2015 In this paper, we utilize the probabilistic scheduling policy [14]. developed in [11], [12] ( where each request has a certain 2 probability to be assigned to each storage server in the system) TOR and aggregation switches, which partition tenants into and analyze service latency of erasure-coded storage with different service classes based on their delay requirement. The respect to data center topology and traffic engineering. Service file requests submitted by tenants in each class are buffered latency, also known as downlink latency (DCL) [15], is the and processed in a separate FCFS queue at storage servers in total time from when the request was served from the server each designated rack, we apportion bandwidth at TOR and to when the user gets the file. To the best of our knowledge, aggregation switches so that each flow has its own FCFS this is the first work that accounts for both erasure coding and queue for the data transfer. The service rates of these queues data center networking in multi-tenant latency analysis. Our are tunable and tuning these weights allows us to provide result provides a tight upper bound on mean service latency differentiated service rate (latency) to tenants in different of erasure coded-storage and applies to general data center classes. Section IV provides the latency bound. Section V topology. It allows us to construct quantitative models and jointly optimizes bandwidth allocation and data locality (i.e., find solutions to a novel latency optimization leveraging both placement of encoded file chunks) to achieve service latency the placement of erasure coded chunks on different racks and minimization, the latency minimization is decoupled into 3 the bandwidth reservations at different switches. sub-problems; two of which are proven to be convex. We We consider a data center storage system with a hierarchical propose an algorithm which iteratively minimizes service structure where files are encoded and spread across the system latency over the three engineering degrees of freedom with using erasure coding. Each rack has a TOR switch that is guaranteed convergence. Finally Section VI provides a proto- responsible for routing data flows between different disks and type of the proposed algorithms in an erasure coded distributed associated storage servers in the rack, while data transfers file system. The implementation results also show that our between different racks are managed by an aggregation switch proposed approach significantly improved service latency in that connects all TOR switches. File access requests may storage systems compared to native storage settings of the be submitted by tenants from any servers inside the same testbed. datacenter, which is common in todays’ datacenter storage. Tenants have differentiated latency preferences. Our goal is Symbol Meaning to minimize the differentiated average latency for each tenant N Number of racks, indexed by i = 1,...,N in such storage systems. This paper quantifies average latency for each tenant in the system, and with such quantification m Number of storage servers in each rack of latency, a differentiated latency optimization problem with R Number of files in the system, indexed by r = 1,...,R limited resources is formulated. Also, a system diagram is (n, k) Erasure code for storing files shown in Section III. Customizing elastic service latency for the tenants in such D Number of services classes a realistic environment (most common data center structure, B Total available bandwidth at the aggregate switch and fully distributed data access) comes with great technical b Total available bandwidth at each top-of-the-rack switch challenges and calls for a new framework for delivering, eff quantifying and optimizing differentiated service latency in bi Effective bandwidth for all traffic within rack i general erasure coded storage: (i) Due to limited bandwidth Beff Effective bandwidth for service class d from rack i to rack j available at both the TOR and aggregation switches, a simple i,j,d First Come First Serve (FCFS) policy to schedule all file wi,d Bandwidth fraction at TOR switch of rack i for service class d requests fails to deliver satisfactory latency performance, not Wi,j,d Bandwidth fraction at aggregate switch for service class d only because the policy lacks the ability to adapt to vary- λi Arrival rate of file r of service class d from rack i ing traffic patterns, but also due to the need to apportion rd π i j r d networking resources with respect to heterogeneous service i,j,rd Probability of route rack- - for file- of service class requirements. (ii) The latency of each file request is determined Srd Set of racks for placing encoded chunks of file r of service class d by the maximum delay in retrieving encoded chunks. Without N Connection delay for service from rack i to rack j proper coordination in processing each batch of chunk requests i,j that jointly reconstructs a file, service latency is dominated by Qi,j,d Queuing delay for service of class d from rack i to rack j staggering chunk requests with the worst access latency (e.g., ¯ Ti,rd Expected latency of a request of file r from rack i chunk requests processed by heavily congested switches), significantly increasing overall latency in the data center. (iii) The joint optimization of average service latency (weighted by TABLE I MAIN NOTATION. the rate of arrival of requests, and weight of the service class) over the placement of contents of each file, the bandwidth reservation between any pair of racks (at TOR and aggregation switches), and the scheduling probabilities, is shown to be II.BACKGROUND a mixed-integer optimization, especially due to the integer constraints for the content placement, which is hard to solve. A. Erasure Code The rest of this paper is organized to tackle these challenges Erasure coding has been widely studied for distributed accordingly, Section III studies a weighted queuing policy at storage [16]–[30] and used by companies like Facebook and 3

Google since it provides space-optimal data redundancy to Recently, there has been a number of attempts at finding protect against data loss. For Maximum Distance Separable latency bounds for an erasure-coded storage system [11], [12], (MDS) code considered in this paper, each file is partitioned [14], [21]–[23], [40], overall, there has been two sets of into k chunks, and an (n,k) MDS erasure code is used to analysis studied in the literature, and both are restricted to generate another n-k chunks, thus obtaining a total of n coded the special case of a single file or homogeneous files. The chunks. Using the properties of MDS code, any k out of n first set is a block-one-scheduling policy that only allows the such chunks are enough to recover the original file [Some request at the head of the buffer to move forward [21]. An reference]. In this work, we place each chunk on a different upper bound on the average latency of the storage system is storage node to provide high reliability in the event of node or provided through queuing-theoretic analysis for MDS codes network failures. An example of how erasure coding is applied with k = 2. Later, the approach is extended in [22] to general (n, k) erasure codes, yet for homogeneous files. A family of MDS-Reservation(t) scheduling policies that block

A3/B2 B2 all except the first t of file requests are proposed and lead to A1 B1 numerical upper bounds on the average latency. It is shown that as t increases, the bound becomes tighter while the number A2 B1 of states concerned in the queuing-theoretic analysis grows File A File B exponentially. The second set is based on the -join queue [41]. The authors of [23] used (n, k) fork-join queue to model A1 B1 A1 B1 the latency performance of erasure-coded storage, a closed- Encoding Replication B1 File A A2 File B form upper bound of service latency is derived for the case of B2 homogeneous files (i.e., files with identical size, placement, A2 B2 A3=A1+A2 B2 erasure code and request arrival rates) and exponentially- File A: (3,2) Erasure Code File B: Dual Replication distributed service time. However, the approach cannot be applied to a multiple-heterogeneous file storage where each Fig. 1. Our Tahoe testbed with ten racks and distributed clients, each rack has 10 Tahoe storage servers. file has a separate folk-join queue and the queues of different files are highly dependent due to shared storage nodes and to distributed storage systems is shown in Fig 1, for two files joint request scheduling. In another work [40], the authors A and B. Each of the two files are partitioned into k=2 chunks applied fork-join queue to optimize threads allocation to of equal size. File A is encoded using (3,2) MDS codes, such each request where the servers cannot serve multiple requests that obtaining any 2 out of the 3 chunks ensures recovery of simultaneously. In addition, the authors of [42] proposed a file A. Both chunks of File B are replicated. Thus, file A is distributed storage system with heterogeneous jobs which is stored on three nodes while File B is stored on four nodes analyzed through the fork-join queue framework, and lower (each chunk is stored on a distinct node). Even though two and upper bounds on the average latency are provided for chunks are needed to recover File B, these two chunks cannot jobs of different classes under various scheduling policies. be selected arbitrarily. We note that while both the files can be However, the fork-join queue framework restricts that the recovered from only a single-node failure and thus provides the requests have to be sent to all servers rather than only k, same reliability, erasure coding achieves lower storage space leading to a waste of resources. overhead than replication (i.e., 50% vs. 100%). In this work, we will use a latency analysis that is based on probabilistic scheduling developed in [11], [12]. To the best B. Related Work of our knowledge, this is the first analysis that accounts for To our best knowledge, however, while latency in erasure multiple files and arbitrary file access patterns in quantifying coded storage systems has been widely studied, quantifying service latency. Further, the analysis applies to general service exact latency for erasure-coded storage system in data-center time distributions by modeling the delay at each storage node network is an open problem. Prior works focusing on asymp- via the latency of M/G/1 queue. Knowing the exact delay from totic queuing delay behaviors [31], [32] are not applicable each storage node, a tight upper bound on the average service because redundancy factor in practical data centers typically latency could be found by extending ordered statistic analysis remains small due to storage cost concerns. Due to the lack in [43]. Not only does this result supersede previous latency of analytic latency models, most of the literature is focused analysis [21]–[23] by incorporating multiple heterogeneous on reliable distributed storage system design, and latency is files and arbitrary service time distribution, it was also shown only presented as a performance metric when evaluating the to be tighter for a wide range of workloads even in the proposed erasure coding scheme, e.g., [16]–[19], [30], which homogeneous files case. We also note that this analysis can demonstrate latency improvement due to erasure coding in be seen also in a multi-tenant environment where each storage different system implementations. Related design can also be node performs a first come, first serve (FCFS) scheduling. found in data access scheduling [33]–[35], access collision However, although we can employ probabilistic scheduling avoidance [36], [37], and encoding/decoding time optimization policy from [12] for homogeneous files and general service [38], [39] and there are also some work using the LT erasure time distribution, [12] only considers limited service capacity codes to adjust the system to meet user requirements such as at storage nodes and does not take into account the impact of availability, integrity and confidentiality [20]. shared datacenter networking and bandwidth constraints. The 4 latency model in this work requires to take the impact of multi- of k-out-of-n chunks. Therefore, a file request generated by a tenant environment and data center networking into account, as client application must be routed to a subset of k racks in Sr the goal is to ensure each tenant’s latency requirement through for successful processing. To minimize latency, the selection of optimal resource allocation and load balancing in a tiered data these k racks needs to be optimized for each file request on the center network. fly for load-balancing among different racks, in order to avoid creating “hot” switches that suffer from heavy congestion and III.SYSTEM MODEL result in high access latency. We refer to this routing and rack selection problem as the request scheduling problem. We consider a data center consisting of N racks, denoted by In this paper, we assume that the tenants’ files are divided N = {i = 1, 2,...,N}, each equipped with m homogeneous into D service classes, denoted by class d = 1, 2, ··· ,D. servers that are available for data storage and running applica- Files in different service classes have different sensitivity and tions. (All main notations used in this paper are listed in Table latency requirement. There are a series of requests generated I) There is a Top-of-Rack (TOR) switch connected to each rack by client applications to access the R files of different classes. to route intra-rack traffic and an aggregate switch that connects We model the arrival of requests for each file r in class d as all TOR switches for routing inter-rack traffic. A set of files a Poisson process. To emphasize on each file’s service class, R = {1, 2,...,R} are distributively stored among these m∗N we denote a file r in class d by file r . Let λ be the rate servers and are accessed by client applications. Our goal is to d i,rd of file r requests that are generated by a client application minimize the overall file access latency of all applications to d running in rack i. It means that a file r request can be improve latency performance of the cloud system. A system d generated from a client application in any of the N racks diagram is shown in Fig 2. λi,rd with a probability P . The overall request arrival for file i λi,rd rd is a superposition of Poisson process with aggregate rate Aggregate Switch P λrd = i λi,rd . Since solving the optimal request scheduling Bandwidth Allocation At The Ports to Control Service Bandwidth Between Each Pair of Rack for latency minimization is still an open problem for general erasure-coded storage systems [12], [21]–[23], we employ ToR ToR ToR ToR the probabilistic scheduling policy proposed in [12] to derive an outer bound of service latency, which further enables a Bandwidth Allocation At The Ports of ToR Switch to Control Service Bandwidth Between Each Pair of Hosts Within the Same Rack practical solution to the request scheduling problem. Upon the arrival of each request, a probabilistic scheduler selects k-out-of-n racks in Sr ( which host the desired file chunks) according to some known probability and route the

Rack 1 Rack 2 Rack 3 Rack N resulting traffic to the client application accordingly. It is shown that determining the probability distribution of each Fig. 2. Architecture hierarchy with bandwidth allocation at the switches. k-out-of-n combination of racks is equivalent to solving the

This work focuses on erasure-coded file storage systems, marginal probabilities for scheduling requests λi,rd , where files are erasure-coded and stored distributedly to enable π = [j ∈ Sr is selected | k racks are selected], (1) space saving while allowing the same data durability as i,j,rd P i replication systems. More precisely, each file r is split into P under constraints πi,j,rd ∈ [0, 1] and j πi,j,rd = k [12]. k fixed-size chunks and then encoded using an (n, k) erasure Here πi,j,rd represents the scheduling probability to route a code to generate n chunks of the same size. The encoded class d request for file r from rack j to rack i. The constraint chunks are assigned to and stored in n out of m ∗ N distinct P j πi,j,rd = k ensures that exactly distinct k racks are servers in the data center. While it is possible to place multiple selected to process each file request. It is easy to see that the 1 chunks of the same file in the same rack , we assume that chunk placement problem can now be rewritten with respect these n servers to store file r belong to distinct racks, so as to πi,j,rd . For example, πi,j,rd = 0 equivalently means that no to maximize the distribution of chunks across all racks, which r chunk of file r exists on rack j, i.e., j∈ / Si , while a chunk is in turn provides the highest reliability against the independent placed on rack j only if πi,j,rd > 0. rack failures. This is indeed a common practice adopted by file systems such as QFS [44], an erasure-coded storage file system B. Bandwidth Allocation that is designed to replace HDFS for Map/Reduce processing. To accommodate network traffic generated by applications in different service classes, bandwidth available at ToR and A. Probabilistic Scheduling Aggregation Switches must be apportioned among different For each file r, we use Sr to denote the set of racks that host flows in a coordinated and optimized fashion. We propose a encoded chunks of file r, satisfying Sr ⊆ N and |Sr| = n. To weighted queuing model for bandwidth allocation and enable access file r, an (n, k) Maximum-Distance-Separable (MDS) differentiated latency QoS for requests belonging to different erasure code allows the file to be reconstructed from any subset classes. In particular, at each rack j, we manage all incoming class d requests generated by applications in rack i in a 1Using the techniques in [12], our results in this paper can be easily extended to the case where multiple chunks of the same file are placed on (virtual) local queue, denoted by qi,j,d. Therefore, each rack the same rack. j contains N ∗ D independent queues, including D queues 5

that manage intra-rack traffic traveling through only the TOR encoded file chunks Sr, scheduling probabilities πi,j,rd for switch and (N −1)∗D queues that manage inter-rack traffic of load-balancing, and allocation of system bandwidth through class d originated from other racks i 6= j. Requests belonging weights for inter-rack and intra-rack traffic from different to each virtual queue are managed independently using a First- classes Wi,j,d and wi,d. To the best of our knowledge, this is In-First-Out policy. In practice, the incoming requests can be the first work jointly minimizing service latency of an erasure- buffered distributively at their intended storage servers in rack coded system over all three dimensions. j and scheduled by a centralized (virtual) controller in the same rack. IV. ANALYZING SERVICE LATENCY FOR DATA REQUESTS Assume the total bi-direction bandwidth at the aggregate In this section, we will derive an upper bound on the service switch is B, which needs to be apportioned among the latency of a file of class d for each client application running N(N − 1) ∗ D queues for inter-rack traffic, and the port in the data center. Under the probabilistic scheduling policy, it bandwidth capacity is defined as B . Let {W , ∀i 6= j} p i,j,d is easy to see that requests for file r chunk from rack i to rack be a set of N(N − 1) ∗ D non-negative weights satisfying j form a (decomposed) Poisson process with rate λ π . P W ≤ 1, the sum of weights at all ports at i,rd i,j,rd i,j,d:i6=j i,j,d Thus, the aggregate arrival of requests from rack i to rack j the aggregation switch can be smaller than 1 because total becomes a (superpositioned) Poisson process with rate bandwidth B may not be fully utilized due to of individual X port bandwidth constraints. We assign to each queue qi,j,d Λi,j,d = λi,rd πi,j,rd . (5) a share of B that is proportional to Wi,j,d, i.e., queue qi,j,d r Beff receives a dedicated service bandwidth i,j,d on the aggregate It is easy to see that each queue buffering requests from rack switch, i.e., i to rack j for service class d is an M/G/1 queue with Poisson eff Bi,j,d = B · Wi,j,d, ∀i 6= j. (2) arrival and general service distribution. The total number of such queues over choices of i, j, and d, is N 2d. Service Because inter-rack traffic traverses two ToR switches, the time per chunk is determined by the allocation of bandwidth same amount of bandwidth has to be reserved on the TOR B · Wi,j,d to each queue qi,j,d handling inter-rack traffic or switches of both racks i and j. Then, any remaining bandwidth the allocation of bandwidth bj to qj,j,d for different classes on the TOR switches will be made available for intra-rack handling intra-rack traffic. Then service latency for each file traffic routing. After accommodating all inter-rack bandwidth request can be obtained by the largest delay to retrieve k assignments on rack j, the bandwidth remaining for intra-rack probabilistically-selected chunks of the file [12]. However, traffic is given by the total TOR bandwidth b minus aggregate due to the dependency of different queues (since their chunk 2 incoming and outgoing inter-rack traffic , i.e., request arrivals are coupled), it poses another challenge for D deriving the exact service latency for file requests. eff X X X bi = b − ( Wi,j,dB − Wj,i,dB), ∀i. (3) For each chunk request, we consider two latency compo- d=1 j:j6=i j:j6=i nents: connection delay Ni,j that includes the overhead to eff set up the network connection between client application in Next, we further apportion bi among D different service classes (i.e., d queues) of intra-rack traffic in rack i, each rack i and storage server in rack j and queuing delay Qi,j,d eff that measures the time a chunk request experiences at the receiving bandwidth Bi,i,d that is proportional to weights wi,d. The efficient bandwidth for each class of intra-rack requests bandwidth service queue qi,j,d, i.e., the time to transfer a chunk with the allocated network bandwidth. Let T¯ be would be: i,rd the average service latency for accessing file r of class d eff eff Bi,j,d = wi,dbi , ∀i = j. (4) by an application in rack i. Due to the use of (n, k) erasure coding for distributed storage, each file request is mapped to By optimizing W and w , the weighted queuing model i,j,d i,d a batch of k parallel chunk requests. The chunk requests are provides an effective allocation of data center bandwidth scheduled to k different racks Ar ⊂ S that host the desired among different classes of data flows both within and across i r file chunks. A file request is completed if all k = |Ar| chunks racks. Bandwidth under-utilization and congestion needs to i are successfully retrieved, which means a mean service time: be addressed by optimizing these weights, so that queues with heavy workload will receive more bandwidth and those ¯ r Ti,rd = E[EA [max(Ni,j + Qi,j,d)]], (6) i j∈Ar with light workload will get less. In the mean time, different i classes of requests should also be assigned different shares where the expectation is taken over queuing dynamics in the r of resources according to their class levels and to jointly mode and over a set Ai of randomly selected racks with minimize overall latency in the system. probabilities πi,j,rd for all j. Our goal in this work is to quantify service latency under Mean latency in (6) could have been easily calculated if this model through a tight upper bound and to minimize aver- r (a) Ni,j + Qi,j,d’s are independent and (2) rack selection Ai age service latency for all traffic in the data center by solving is fixed. However, due to coupled arrivals of chunk requests, an optimization problem over three dimensions: placement of the queues at different switches are indeed correlated. We deal with this issue by employing a tight bound of highest order 2Our model, as well as the subsequent analysis and algorithms, can be trivially modified depending on whether the TOR switch is non-blocking statistic and extend it to account for randomly selected racks and/or duplex. [12]. This approach allows us to obtain a closed-form upper 6

bound of average latency using only first and second moments Theorem 1: For arbitrary set of zi,rd ∈ R, the expected ¯ of Ni,j + Qi,j,d, which are easier to analyze. The result is latency Ti,rd of a request of file r, requested from rack i is summarized in Lemma 1. upper bounded by Lemma 1: (Bound for the random order statistic [12].) For π ¯ X i,j,rd eff arbitrary set of z ∈ , the average service time is bounded Ti,rd = zi,rd + [ · f(zi,rd , Λi,j,d,B )], (11) i,rd R 2 i,j,d by j∈Sr  f(z , Λ ,Beff )  π where function i,rd i,j,d i,j is an auxiliary function ¯ X i,j,rd X Ti,rd ≤ zi,rd + ( [Di,j,d] − zi,rd ) + Λ 2 E depending on aggregate rate i,j,d and effective bandwidth  j∈S j∈S eff r r Bi,j,d of queue qi,j,d, i.e.,   πi,j,r q q d ( [D ] − z )2 + Var[D ] (7) f(z , Λ ,Beff ) = H + H2 + G (12) 2 E i,j,d i,rd i,j,d i,rd i,j,d i,j,d i,j,d i,j,d i,j,d 2 Λi,j,dΓ2B where Di,j,d = Ni,j + Qi,j,d is the combined delay. The Hi,j,d = ηi,j + eff eff − zi,rd(13) 2B (B − Λi,j,dµB) tightest bound is obtained by optimizing zi,rd ∈ Z and is tight i,j,d i,j,d 3 in the sense that there exists a distribution of Di,j,d to achieve Λ Γ B G = ξ2 + i,j,d 3 the bound with exact equality. i,j,d i,j eff 2 eff 3(Bi,j,d) (Bi,j,d − Λi,j,dµB) We assume that connection delay Ni,j is independent of 2 4 Λi,j,dΓ2B queuing delay Qi,j,d. With the proposed system model, the + (14) 4(Beff )2(Beff − Λ µB)2 chunk requests arriving at each queue qi,j,d form a Poisson i,j,d i,j,d i,j,d P process with rate Λi,j,d = r λi,rd πi,j,rd . Therefore, each queue qi,j,d at the aggregate switch can be modeled as an V. JOINT LATENCY MINIMIZATION IN CLOUD M/G/1 queue that processes chunk requests in an FCFS manner. For intra-rack traffic, the d queues at the top-of-rack We consider a joint latency minimization problem over 3 switch at each rack can also be modeled as M/G/1 queues since design degrees of freedom: (1) placement of encoded file the intra-rack chunk requests form a Poisson process with rate chunks {Sr} that determines data center traffic locality, (2) P allocation of bandwidth at aggregation/TOR switches through Λj,j,d = r λj,rd πj,j,rd . Due to the fixed chunk size in our system, we denote X as the standard (random) service time weights for different classes {Wi,j,d} and {wi,d} that affect per chunk when bandwidth B is available. We assume that chunk service time for different data flows, and (3) schedul- {π } the service time is inversely proportional to the bandwidth ing probabilities i,j,rd to optimize load-balancing under q Beff probabilistic scheduling policy and to select racks/servers for allocated to i,j,d, i.e., i,j,d. We obtain the distribution of P P retrieving files in different classes. Let λall = λi,r actual service time Xi,j,d considering bandwidth allocation: d i d be the total file request rate in the datacenter. Let Fd be the eff Xi,j,d ∼ X · B/Bi,j,d, ∀i, j (8) set of files belonging to class d. The optimization objective With the service time distributions above, we can derive the is to minimize the mean service latency in the erasure-coded storage system, which is defined by mean and variance of queuing delay Qi,j,d using Pollaczek- 2 Khinchine formula. Let µ = E[X], σ = Var[X], and Γt = D N t th X X X λi,r [X ], be the mean, variance, t order moment of X, ηi,j ¯ ¯ d E CdTd, where Td = Ti,rd (15) 2 λall and ξi,j are mean and variance for connection delay Ni,j. d=1 i=1 r∈Fd These statistics can be readily available from existing work on network delay [39], [45] and file-size distribution [46], [47]. where Cd is a tuning factor indicating the importance of the average latency of tenant class d in the optimization goal, Lemma 2: The mean and variance of combined delay Di,j for any i, j is given by when Cd increases, tenant class d becomes more important and we need to improve the latency performance of class d 2 Λi,j,dΓ2B requests more. Substituting Ti,r in Equation (11), and after E[Di,j,d] = ηi,j + eff eff (9) d 2Bi,j,d(Bi,j,d − Λi,j,dµB) some simplifications, we obtain T¯d as: 3 2 Λi,j,dΓ3B N " Var[Di,j,d] = ξ + + i,j eff 2 eff X X λi,rd zi,rd 3(Bi,j,d) (Bi,j,d − Λi,j,dµB) T¯d = + 2 4 λall Λ (Γ ) B i=1 r∈Fd i,j,d 2 (10) eff 2 eff 2 N  4(B ) (B − Λi,j,dµB) i,j,d i,j,d X Λi,j,d eff f(zi,rd , Λi,j,d,Bi,j,d) (16) eff 2λ where Bi,j,d is the effective bandwidth assigned to the queue j=1 all of requests for class d files from rack j to rack i. P Combining these results, we derive an upper bound for In the simplification above, we used r in class d λi,rd πi,j,rd = ¯ average service latency Ti,rd as a function of chunk placement Λi,j,d and πi,j,rd = 0 for all j∈ / Sr.

Sr, scheduling probability πi,j,rd , and bandwidth allocation Now we are ready to define our optimization problem eff Bi,j,d ( which is a function of the bandwidth weights Wi,j,d which minimizes the differentiated service latency in three and wi,d in (2) and (4)). The main result of our latency analysis dimensions, optimal chunk placement, request scheduling, and is summarized in the following theorem. weight assignment for all tenants. 7

We now define the Joint Latency and Weights Optimization convex in {πi,j,rd }, {wi,d}, {Wi,j,d} when other variables are (JLWO) problem as follows: fixed. The convexity will be shown by a sequence of lemmas D as follows. X ¯ Lemma 3: (Convexity of the scheduling sub-problem [12].) min. CdTd (17) d=1 When {zi,rd ,W (w)i,j,d,Sr} are fixed, Problem JLWO is a N convex optimization over probabilities {π }. λ z i,j,rd ¯ X X i,rd i,rd s.t. Td = [ + Proof: The proof is straightforward due to the convexity λall i=1 r∈Fd eff of Λi,j,df(zi,rd , Λi,j,d,Bi,j,d) over Λi,j,d (which is a lin- N {π } X Λi,j,d eff ear combination of i,j,rd ) as shown in [12], and the f(zi,rd , Λi,j,d,Bi,j,d)] (18) 2λ fact that all constraints are linear with respect to πi,j,rd . j=1 all eff  X Bi,j,d Λ = λ π ≤ µ , Lemma 4: (Convexity of the bandwidth allocation sub- i,j,d i,rd i,j,rd B r∈Fd problem.) When {zi,rd , πi,j,rd ,Sr, wi,d} are fixed, Problem ∀i, j, d (19) JLWO is a convex optimization over inter-rack bandwidth

N allocation {Wi,j,d}. When {zi,rd , πi,j,rd ,Sr,Wi,j,d} are fixed, X Problem JLWO is a convex optimization over intra-rack band- πi,j,rd = k and πi,j,rd ∈ [0, 1], ∀i, j, rd (20) j=1 width allocation {wi,d}. Detailed Proof is provided in Appendix A. |Sr| = n and πi,j,rd = 0 ∀j∈ / Si, ∀i, rd (21)  D Next, we consider the placement sub-problem that mini- X r wi,d = 1. (22) mizes average latency over {Sr} for fixed {zi,rd , πi,j, wi,j}. d=1 In this problem, for each file r in class d we permute the N D set of racks that contain each file r to have a new placement X X X 0 Wi,j,d ≤ 1 (23) Sr = {β(j). ∀j ∈ Sr} where β(j) ∈ N is a permutation. The i=1 j6=i d=1 new probability of accessing file r from rack β(j) when client eff is at rack i becomes π . Our objective is to find such a Bi,j,d = Wi,j,dB ≤ Bp, ∀i 6= j (24) i,βj,rd permutation that minimizes the average service latency of class Beff = w (b − i,j,d i,d d files, which can be solved via a matching problem between D X X X the set of scheduling probabilities {πi,j,r , ∀i} and the racks, ( W B − W B)) ≤ C, ∀i =(25)j d i,l,d l,i,d with respect to their load excluding the contribution of file r. d=1 l:l6=i l:l6=i −r Let Λ = Λi,j,d − λi,r πi,j,r be the total request rate for var. z , {Sr}, {π }, {w }, {W }. i,j,d d d i,rd i i,j,rd i,d i,j,d class d files between racks i and j excluding the contribution Here we minimize service latency derived in Theorem 1 over of file r. We define a complete bipartite graph Gr = (U, V, E) with disjoint vertex sets U, V of equal size N and edge weights zi,rd ∈ R for each file access to get the tightest upper bound. Feasibility of Problem JLWO is ensured by (19), which given by requires arrival rate to be no greater than chunk service rate N Λ−r + λ π X X λi,rd zi,rd i,j,d i,rd i,j,rd received by each queue. Encoded chunks of each file are Ki,j,d = [ λall λall placed on a set Si of servers in (21), and each file request i=1 r is in classd is mapped to k chunk requests and processed by k servers −r f(zi,rd , Λi,j,d + λi,rd πi,j,rd ,W (w)i,j,d)]. (26) in Si with probabilities πi,j,rd in (20). Finally, both weights Wi,j,d (for inter-rack traffic) and wi,d(for intra-rack traffic) It is easy to see that a minimum-weight matching on Gr finds should add up to 1 so there is no bandwidth left underutilized. β(j) ∀j to minimize Bandwidth assigned to each queue Beff ∀i, j is determined i,j,d N N N N by our bandwidth allocation policy in (2) and (4) for intra- and X X X λi,rd zi,rd X X Ki,β(j),d = + inter-rack bandwidth allocations. Any effective flow bandwidth λall j=1 i=1 r is in classd j=1 i=1 resulted from the bandwidth allocation have to be less than Λ−r + λ π , r or equal to a port capacity C allowed by the switches. For i,j,d i.rd i,β(j) d example, the Cisco switch used in our experiment, which λall −r has a port capacity constraint at C = 1Gbps, which we will f(zi,rd , Λi,j,d + λi,rd πi,β(j), rd,W (w)i,j,d) (27) elaborate in Section VI. Problem JLWO is a mixed-integer optimization and hard which is exactly the optimization objective of Problem JLWO if a chunk request of class d is scheduled with probability to compute in general. In this work, we develop an iterative −r optimization algorithm that alternatively optimizes over the πi,β(j), rd to a rack with existing load Λi,j,d. three dimension (chunk placement, request scheduling and Lemma 5: (Bipartite matching equivalence of the placement inter/intra-rack weights assignment) of problem JLWO and sub-problem.) When {zi,rd , πi,j,rd ,W (w)i,j,d} are fixed, the solves each sub-problem repeatedly to generate a sequence of optimization of Problem JLWO over placement variables Sr monotonically decreasing objective values. To introduce the is equivalent to a balanced Bipartite matching problem of size proposed algorithm, we first recognize that Problem JLWO is N. 8

Proof: Based on the above statements, we can easily see that weights assignment) are convex optimization which will gener- the placement sub-problem when other variables are all fixed ates monotonically decreasing objective values. The placement is equivalent to find a minimum-weight matching on a bipartite sub-problem is an integer optimization and which can be graph whose of size N whose edge weights Ki,j,d are defined solved by Hungarian algorithm, each iteration algorithm JLWO in 26, since the minimum weight matching is equal to the is solving the three sub-problems which converge individually, objective function of problem JLWO. and since latency is bounded below, algorithm JLWO is The proposed problem in (17-25) has five sets of variables guaranteed to converge to a fixed point of Problem JLWO. r zi,rd , {Si }, {πi,j,rd }, {wi,d}, {Wi,j,d}. In order to solve the problem, we use an alternating minimization approach where Algorithm JLWO : in each iteration, one of the variables is optimized assuming the rest are fixed. The five sub-problems in each iteration are Initialize t = 0,  > 0. Initialize feasible {z (0), π (0),S (0)}. given as: i,rd i,j,rd r Initialize feasible O(0), O(−1) from (17)

1) Solve inter-rack bandwidth allocation for given {zi,rd (t), User input Cd while O(t) − O(t−1) >  πi,j,r (t) wi,d(t), Sr(t)}: d // Solve Sub-problem 1 // Solve Sub-problem 2 Wi,j,d(t + 1) = arg min (17) Wi,j,d // Solve Sub-problem 3 // Solve Sub-problem 4 s.t. (18), (19), (23), (24). for d = 1,...,D This is a convex optimization as shown in Lemma 4, and for r in class d Calculate Λ−r using {π (t + 1)}. can be efficiently computed by any off-the-shelf convex i,j,d i,j,rd Calculate Ki,j,d from (26). solvers, e.g., MOSEK [48]. (β(j)∀j ∈ N )=Hungarian Algorithm({Ki,j,d}). π (t + 1) = π (t) ∀i, j 2) Solve intra-rack bandwidth allocation for given Update i,β(j),rd i,j,rd . Initialize S (t + 1) = {}. {zi,r (t), πi,j,r (t), wi,d(t),Sr(t)}: r d d for j = 1,...,N if ∃i s.t. πi,j,r (t + 1) > 0 wi,d(t + 1) = arg min(17) d wi,d Update Sr(t + 1) = Sr(t + 1) ∪ {j}. end if s.t. (18), (19), (22), (25). end for end for max This is a convex optimization as shown in Lemma 4. x end for 3) Solve scheduling for given {zi,r (t),Sr(t), wi,d(t + d // Solve Sub-problem 5 1),Wi,j,d(t + 1)}: // Update bound Ti,rd given // {wi,d(t + 1),Wi,j,d(t + 1), πi,j,rd (t + 1),Sr(t + 1), πi,j,rd (t + 1) = arg min (17) // zi,r (t + 1)} πi,j,r d d Update objective value O(t+1)=(17). s.t. (19), (20). Update t = t + 1. convex optimization as shown in Lemma 3. end while Output: {Sr(t), πi,j,rd (t),Wi,j,d(t), wi,d(t), zi,rd (t)} 4) Solve placement for given {zi,rd (t), wi,d(t +

1),Wi,j,d(t + 1), πi,j,rd (t + 1)}: This problem can be solved using Bipartite Matching as described in Lemma 5. VI.IMPLEMENTATION AND EVALUATION 5) Solve zi,r given {wi,d(t + 1),Wi,j,d(t + 1), πi,j,r (t + d d To validate the weighted queuing model for d different 1),Sr(t + 1)} classes of files in our single data-center system model and zi,r (t + 1) = arg min (17) evaluate the performance, we first show that algorithm JLWO d z ∈ i,rd R converges to an optimal solution, and then create a test-bed Our proposed algorithm that solves Problem JLWO by itera- aligned with our system model to evaluate the performance tively solving the 3 sub-problems is summarized in Algorithm of the proposed algorithm, in particular, to validate the effec- JLWO. It generates a sequence of monotonically decreasing tiveness on providing differentiated latency levels for different objective values and therefore is guaranteed to converge to tenants. a stationary point. Notice that scheduling and bandwidth allocation sub-problems as well as the minimization over zi,rd A. Tahoe Testbed for each file are convex and can be efficiently computed We implemented the proposed algorithms in Tahoe [50], by any off-the-shelf convex solvers, e.g., MOSEK [48]. The which is an open-source, distributed file-system based on the placement sub-problem is a balanced bipartite matching that zfec erasure coding library. It provides three special instances can be solved by Hungarian algorithm [49] in polynomial time. of a generic node: (a) Tahoe Introducer: it keeps track of a Theorem 2: The proposed algorithm generates a sequence of collection of storage servers and clients and introduces them monotonically decreasing objective values and is guaranteed to each other. (b) Tahoe Storage Server: it exposes attached to converge. storage to external clients and stores erasure-coded shares. Proof: Algorithm JLWO is working on three sub-problems (c) Tahoe Client: it processes upload/download requests and iteratively and two of them (the scheduling and inter/intra-rack connects to storage servers through a Web-based REST API 9 and the Tahoe-LAFS (Least-Authority ) storage chunks and each chunk consist of multiple blocks3. For chunk protocol over SSL. placement, the Tahoe client randomly selects a set of available storage servers with enough storage space to store n chunks. For server selection during file retrievals, the client first asks Aggregate Switch all known servers for the storage chunks they might have, again regardless of which racks the servers reside in. Once it Rack 1 Rack 2 Rack 3 Rack 10 knows where to find the needed k chunks (from among the VM Instances VM Instances VM Instances VM Instances fastest servers with a pseudo-random algorithm), it downloads at least the first segment from those servers. This means that it tends to download chunks from the “fastest” servers purely based on round-trip times (RTT). However, we consider RTT plus expected queuing delay and transfer delay as a measure of latency. We had to make several changes in Tahoe in order to conduct our experiments. First, we need to have the number of racks N ≥ n in order to meet the system requirement that Storage Server Introducer Client each rack can have at most one chunk of the original file. In addition, since Tahoe has its own rules for chunk placement Fig. 3. Our Tahoe testbed with ten racks and distributed clients, each rack and request scheduling, while our experiment requires client- has 10 Tahoe storage servers. defined chunk placement in different racks, and also with Our experiment is done on a Tahoe testbed that consists of our server selection algorithms for placement and request 10 separate physical hosts in an OpenStack cluster, we simu- scheduling in order to minimize joint latency for both classes lated each host as a rack in the cluster. Each machine is hosting of files, we modified the upload and download modules in 11 VM instances, out of which 10 of them as running as Tahoe the Tahoe storage server and client to allow for customized storage servers inside each rack and 1 one of them is running and explicit server selection for both chunk placement and as a client node. Host 1 has 12 VM instances as there is one retrieval, which is specified in the configuration file that is more instance running as Introducer in the Tahoe file system. read by the client when it starts. Finally, Tahoe performance We effectively simulated an OpenStack cluster with 10 racks suffers from its single-threaded design on the client side; we and each with 11 or 12 Tahoe storage servers on each rack. The had to use multiple clients (multiple threads on one client cluster uses a Cisco Catalyst 4948 switch, which has 48 ports. node) in each rack with separate ports to improve parallelism Each port supports non-blocking, 1 Gbps bandwidth in the full and bandwidth usage during our experiments. duplex mode. Our model is very general and can be applied to the case of d classes of files. To simplify the implementation B. Basic Experiment Setup process, for experiments we consider only d = 2 classes of files; each class has a different service rate, so the results will We use (7,4) erasure code in the Tahoe testbed we in- still be representative for cases of differentiated services in troduced above throughout the experiments described in the the single data center. The 2-class weighted-queuing model is implementation section. The algorithm first calculates the supposed to have 2 ∗ N(N − 1) = 180 queues for inter-rack optimal chunk placement through different racks for each class traffic at the aggregate switch; however, bandwidth reservation of files, which will be set up in the client configuration file through ports of the switch is not possible since the Cisco for each write request. File retrieval request scheduling and switch does not support the OpenFlow protocol, so we made weight assignment decisions of each class of files for inter-rack pairwise bandwidth reservations (using a bandwidth control traffic also come from Algorithm JLWO. The system calls a tool from our Cloud QoS platform) between different Tahoe bandwidth reservation tool to reserve the assigned bandwidth Clients and Tahoe storage servers. The Tahoe introducer node BWi,j,d for each class of files based on optimal weights of resides on rack 1 and each rack has a client node with multiple each inter-rack pair; this is also done for intra-rack bandwidth eff Tahoe ports to simulate multiple clients to initiate requests for reservation for each class bi,j wi,d. Total bandwidth capacity different classes of files coming from rack i. Our Tahoe testbed at the aggregate switch is 96 Gbps, 48 ports with 1 Gbps in is shown in Fig 3. each direction since we are simulating host machines as racks Tahoe is an erasure-coded distributed storage system with and VM’s as storage servers. Intra-rack bandwidth as measured some unique properties that make it suitable for storage from iPerf measurement is 792 Mbps, disk read bandwidth for system experiments. In Tahoe, each file is encrypted and the sequential workload is 354 Mbps, and write bandwidth is is then broken into a set of segments, where each segment 127 Mbps. Requests are generated based on arrival rates at consists of k blocks. Each segment is then erasure-coded to each rack and submitted from client nodes at all racks. A produce n blocks (using an (n, k) encoding scheme) and then table summary of basic experiment setup is shown in Table II. distributed to (ideally) n storage servers regardless of their server placement, whether in the same rack or not. The set of 3If there are not enough servers, Tahoe will store multiple chunks on one server. Also, the term “chunk” we used in this paper is equivalent to the blocks on each storage server constitute a chunk. Thus, the file term “share” in Tahoe terminology. The number of blocks in each chunk is equivalently consists of k chunks which are encoded into n equivalent to the number of segments in each file. 10

TABLE II and two classes of service) would be 0.25/s. From the clients TABLEOF EXPERIMENT SETUP distributed in the data center, using the algorithm output πi,j,d Racks(Hosts) 10 for retrieval request scheduling with the same erasure code. Storage Servers per Rack 10 Based on the optimal sets for retrieving chunks of each file Client per Rack 1 request provided by our algorithm, we get measurements of Bandwidth Capacity 96 Gbps Erasure Code (7,4) service time for both the inter-rack and intra-rack processes. Number of Files 1000 per Class The average inter-rack bandwidth over all racks for request Class 1 Weight C1 = 1 of class 1 files is 674 Mbps and the intra/inter-rack band- Class 2 Weight C2 ∈ {0, 1} width ratio is 889Mbps/674Mbps = 1.318. The average inter-rack bandwidth over all racks for request of class 2 files is 684 Mbps and the intra/inter-rack bandwidth ratio is C. Experiments and Evaluation 632Mbps/389Mbps = 1.625. Figure 5 depicts the Cumulative Convergence of Algorithm. We implemented Algorithm Distribution Function (CDF) of the chunk service time for JLWO using MOSEK, a commercial optimization solver. With both intra-rack and inter-rack traffic. We note that intra-rack 10 racks and 10 simulated distributed storage servers on each requests for class 1 have a mean chunk service time of 22 s rack, there is a total of 100 Tahoe servers in our testbed. Fig- and inter-rack chunk requests for class 1 have a mean chunk ure 4 demonstrates the convergence of our algorithms, which service time of 30 s, which is a ratio of 1.364 which is very optimizes the latency of all file requests of the two classes close to the bandwidth ratio of 1.318. Also, intra-rack requests coming from different racks for the weighted queuing model for class 2 have a mean chunk service time of 50 s and inter- at the aggregate switch: chunk placement Si, load balancing rack chunk requests for class 2 have a mean chunk service πi,j and bandwidth weights distribution at the aggregate switch time of 83 s, which is a ratio of 1.66 which is very close and top-of-rack switch for each class Wi,j,d and wi,d. We to the bandwidth ratio of 1.625.This means the chunk service fix C1 = 1 and then set C2 = 1 and C2 = 0.3 to see time is nearly proportional to the bandwidth reservation on the performance of algorithm JLWO when the importance inter/intra-rack traffic for both classes of files. of the second class varies. The JLCM algorithm, which has Validation of algorithms and joint optimization. In order been applied as part of our JLWO algorithm, was proven to to validate that Algorithm JLWO works for our system model converge in Theorem 2 of [12]. In this paper, first, we see the (with a weighted queuing model for 2 classes of files for inter- convergence of the proposed optimized queuing algorithms in rack traffic at the switch), we compare the average latency Fig. 4. By performing similar speedup techniques as in [12], performance for the two file classes in our model in the cases our algorithms efficiently solve the optimization problem with of different access patterns. In this experiment, we have 1000 r = 1000 files of each class at each of the 10 racks. Second, files for each class, all files have the same file size 200MB, we can also see that as class 2 becomes as important as the and the files are evenly distributed in all racks. The aggregate class 1, the algorithm converges faster as it is actually solving arrival rate for both classes from all racks is 0.25/s, i.e., there the problem for one single class of files. It is observed that is one request for each unique file, so the ratio of requests for the normalized objective converges within 140 iterations for a the two classes are 1:1, and we set C1 = 1 and C2 = 0.4. We tolerance  = 0.01, where each iteration has an average run measure the latency for the two classes of the following access time of 1.19 s when running on an 8-core, 64-X86 machine, patterns: 100% concentration means 100% of the requests for therefore the algorithm converges within 2.77 min on average both classes of files concentrate on only one of the racks. from observation. To achieve dynamic file management, our Similarly, 80% or 50% concentration means 80% or 50% of optimization algorithm can be executed repeatedly upon file the total requests of each file come from one rack, the rest arrivals and departures. of the requests spread uniformly among other racks. Uniform Validation of Experiment Setup. While our service delay access means that the requests for both classes are uniformly bound applies to arbitrary distribution and works for systems distributed across all racks. We compare average latency for hosting any number of files, we first run an experiment to these requests for the two classes in each case of the access understand actual service time distribution for both intra-rack patterns with our weighted queuing model. and inter-rack retrieval in weighted queuing for the 2 classes As shown in Fig 7, experimental results indicate that our of files on our testbed. We uploaded r = 1000 files for weighted queuing model can effectively mitigate the long each class of size 100MB file using a (7, 4) erasure code latency due to congestion at one rack for both class 1 and class from the client at each rack based on the algorithm output 2, as there is not a significant increase in average latency for of chunk placement from algorithm JLWO. And we have set both classes as the access pattern becomes more concentrated C1 = 1 and C2 = 0.4. Inter-rack and intra-rack bandwidth on one rack. Also, we can see that class 1 has relatively lower were reserved based on the weights for each class output from latency than class 2 files since algorithm JLWO effectively the algorithm as well. We then initiated 1000 file retrieval assigned more bandwidth to requests for class 1 files since requests for each class of files, which are evenly distributed C1 has a larger value which means class 1 files play a more across 10 racks, i.e, each rack has 100 requests for each important role in the optimization objective. It can also be class of files, and the requests for each file forms a Poisson observed that the difference in average latency for the two process with equal arrival rate=.000125. Thus, the aggregate classes becomes less as the requests become more distributed, Poisson arrival rate for all the 2000 requests (for 1000 files which is because our algorithm JLWO tentatively assigns 11

1 Empirical CDF

0.9

3000 0.8 T1+0.3T2 T1+T2 0.7 Class 1 Class 2 2500 400 0.6 350 2000 0.5

0.4 300

1500 ) s 0.3 ( 250 Cumulative Distributed Function

Average Latency (s) Inter-rack, Class 1 0.2 200 1000 Inter-rack, Class 2 Latency Intra-rack, Class 1 0.1 Intra-rack, Class 2 150

500 Average 0 10 20 30 40 50 60 70 80 90 100 Latency (s) 0 0 50 100 150 200 250 300 350 400 50 Number of Iterations Fig. 5. Actual service time distribution of chunk retrieval through intra-rack and inter- 0 Fig. 4. Convergence of Algorithm JLWO 0 0.5 1 1.5 2 with r=1000 requests for each of the two rack traffic for both classes. Each of them has C2 file classes for heterogeneous files from each 1000 files of size 100MB using erasure code rack on our 100-node testbed, with different (7,4) with the aggregate request arrival rate set Fig. 6. File requests for different files of weights on the second file class. Algorithm to λi = 0.25 /s. Service time distributions size 100MB, aggregate request arrival rate for JLWO efficiently computes the solution in indicated that chunk service time of the two both classes is 0.25/s; varying C2 to validate 140 iterations in both cases. classes is nearly proportional to the bandwidth our algorithms. reservation based on weight assignments for M/G/1 queues. Class 1,Experiment Class 2,Experiment Class 1,Bound Class 2,Bound

800 Class 1,Experiment Class 2,Experiment Class 1,Bound Class 2,Bound Class 1,Experiment Class 2,Experiment Class 1,Bound Class 2,Bound 600 350 700

600 500 300 ) s ( 500 250

) 400 ) s s ( ( 400 Latency 200 300 Latency 300 Latency

Average 150

Average 200 200 Average 100 100 100 50 0 100% 80% 50% Uniform Request Access Concentration (Percentage On One Track) 0 0 0.25 0.3 0.35 0.4 0.45 50 100 150 200 250 Aggregate Request Arrival Rate File Size (MB) Fig. 7. Comparison of average latency with Fig. 9. Evaluation of different file sizes in different access patterns. The experiment is Fig. 8. Evaluation of latency performance our weighted queuing model for both class 1 set up for 1000 heterogeneous files for each for both classes of files with different re- and class 2 files with C1 = 1 and C2 = class, there is one request for each file and quest arrival rate in weighted queuing. File 0.4. Aggregate rate for all file requests at 2000 file requests in total. The ratio for each size 200M, with C1 = 1 and C2 = 0.4. 0.25/s. Compared with class 2 files, our al- class is 1:1. The figure shows the percentage Compared with class 2 files, our algorithm gorithm provides class 1 files relatively lower that these 2000 requests are concentrated on provides class 1 files relatively lower latency latency with heterogeneous file sizes. La- the same rack. Aggregate arrival rate 0.25/s, with heterogeneous request arrival rates. La- tency increases as file size increases in both file size 200 MB. Latency improved sig- tency increases as requests arrive more fre- classes. Our analytic latency bound taking nificantly with weighted queuing. Analytic quently for both classes. Our analytic latency both network and queuing delay into account bound for both classes tightly follows actual bound taking both network and queuing delay tightly follows actual service latency for both latency as well. into account tightly follows actual service classes. latency for both classes. more bandwidth to class 2 requests when the system is less of class 2 files increases as C2 decreases; i.e., when class 2 congested than when it’s highly congested, which matches our becomes less important. Also, the average latency of class 1 goal to minimize weighted average latency for both classes of requests decreases as C2 decreases. This shows the expected files. We also calculated our analytic bound of average latency result that when class 2 becomes more important, more weight for the two classes when the above access patterns are applied. is allocated to class 2, and since C2 is always smaller than C1, From the figure, we can also see that our analytic bound is class 1 gets more bandwidth. This also explains why when tight enough to follow the actual average latency for both class C1 = C2 = 1 the average latency of the two classes are almost 1 and class 2 files. equal as shown in Fig. 6, which is a special case, where In order to further validate that the algorithms work for the two tenants have equal weights and can be considered our weighted queuing models for the two classes, we choose as one single tenant. Thus, Fig 4 also shows a comparison r = 1000 files of size 100MB for each class and aggregate between our approach with multiple tenants and a single tenant arrival rate 0.25/s for all file requests. the ratio of requests in previous work [11]–[13]. It can also be observed that when C2 = 0 the system is only minimizing the latency of class 1 for the two classes is 1:1. We fix C1 = 1 and vary C2 from 0 to 2 and run Algorithm JLWO to generate the optimal files, and thus class 2 files have a significantly large latency on solution for both file classes. The algorithms provide chunk average. The difference in latency between class 1 and class C = 2 C = 0 placement, request scheduling and inter/intra-rack weight as- 2 become less when 2 than when 2 since in the former case algorithm JLWO is optimizing class 1 files as signment (Wi,j,d and wi,d). From Fig. 6 we can see the latency 12

well, though at less importance with C1 = 1. D. Summary of Evaluation Results Evaluation of the performance of our solution To demon- strate the effectiveness of our algorithms, we vary file size in A summary of main results for the above experiments is the experiments from 50MB to 250MB. In this experiment shown in Table III.The convergence speed of algorithm JLWO we still have 1000 files for each class, aggregate arrival rate is relatively short overhead for 2000 files, and it’s expected to for both classes from all racks is 0.25/s, so again the ratio converge at an acceptable speed for large storage systems. of requests for the two classes are 1:1. We assume uniform Intra/inter-rack bandwidth and service time ratios are very random access, i.e., each file will be uniformly accessed from close for both the classes, means the chunk service time is ten racks in the data center with a certain request arrival nearly proportional to the bandwidth reservation on inter/intra- rate. Upload/download server selection is based on the algo- rack traffic for both classes of files, which validates the rithm output Si/πi,j,d, and inter/intra-rack bandwidth reserved experiment setup, algorithm working on this platform should according to output Wi,j,d and wi,d from the optimization also work on other erasure coded storage systems. While algorithm. Then we submit r = 1000 requests for each testing latency with various access patterns, class 1 requests class of files from the clients distributed among the racks. outperform class 2 requests in each access pattern, which Results in Fig 9 show that our algorithm provides class validates the optimization (algorithm JLWO). Latency bound is 1 files relatively lower latency with heterogeneous file sizes very tight, which verifies our latency analysis is accurate. And compared to that of class 2 files, which means algorithm JLWO in the experiment while testing average latency with varying effectively assigned more bandwidth to requests for class 1 file sizes and workloads, as service time increases, queuing files since C1 has a larger value which means class 1 files delay and latency increases, class 1 requests outperforms class plays a more important role in the optimization objective. For 2 requests in each file size/workload, validates that algorithm instance, latency of class 1 files has a 30% improvement on JLWO assigned more resources for more important tenants. average over that of class 2 files for the 5 sample file sizes in Tight upper bound validates our latency analysis. this experiment. Latency increases as requested the file size increases for both classes when arrival rates are set to be the same. Since larger file size means longer service time, it increases queuing delay and thus average latency. We also VII.CONCLUSIONS observe that our analytic latency bound follows actual average service latency for both classes of files in this experiment. This paper considers an optimized erasure-coded single We note that the actual service latency involves other aspects data center storage solution with differentiated services. The of delay beyond queuing delay, and the results show that weighted latency of all file requests is optimized over the optimizing the metric of the proposed latency upper bound placement and access of the erasure-coded contents of the file, improves the actual latency with the weighted queuing models. and the bandwidth reservation at the switches for different Similarly, we vary the aggregate arrival rate at each rack service classes. The proposed algorithm shows significant from 0.25/s to 0.45/s. This time we fix all file requests for reduction in average latency with the knowledge of the file- file size 200MB. All other settings of this experiment are the access patterns. Thus, this paper demonstrates that knowing same as the previous one whose result is shown in Fig 9. In the access probabilities of different files from different racks this experiment, we also compare the latency of requests for can lead to an optimized differentiated storage system solu- the two file classes when we have different weights of their tion, that works significantly better as compared to a storage latency in optimization objective. (i.e., C1 = 1 and C2 = 0.4). solution which ignores the file access patterns. We assume the same uniform access as before. For weighted queuing model, we use optimized server selection for upload and download for each file request, and optimal inter/intra- rack bandwidth reservation from Algorithm JLWO. Clients APPENDIX across the data center submit r = 1000 requests for each PROOFOF LEMMA 4 file class with an aggregate arrival rate varying from 0.25/s to 0.45/s. From Fig 8 we can see that our algorithm provides Since all constraints in Problem JLWO are linear with class 1 files relatively lower latency with heterogeneous file respect to weights {Wi,j,d}, we only need to show that the sizes compared to that of class 2 files, which means algorithm ¯ Λi,j,d eff optimization objective CdTd = Cd f(zi,rd , Λi,j,d,Bi,j,d)] JLWO effectively assigned more bandwidth to requests for 2λall is convex in Wi,j,d.Then we need to show that class 1 files. For instance, the latency of class 1 files has a eff f(zi,rd , Λi,j,d,Bi,j,d) is convex in {Wi,j,d} with other 32% improvement on average over that of class 2 files for the variables fixed. Notice that effective bandwidth Beff ) is a 5 sample aggregate arrival rate in this experiment. Further, as i,j,d linear function of the bandwidth allocation weights {Wi,j} for the arrival rate increases and there is more contention at the both inter-rack traffic queues (23) and intra-rack traffic queues queues, this improvement becomes more significant. Thus our eff (24) when {wi,d} is fixed. Therefore, f(zi,rd , Λi,j,d,Bi,j,d) is algorithm can mitigate traffic contention and reduce latency eff convex in {Wi,j,d} if it is convex in B ). very efficiently. Also, the average latency increases as the i,j,d Toward this end, we consider f(z , Λ ,Beff ) = request arrival at each rack increases since the waiting time i,rd i,j,d i,j,d q 2 increases with the workload for both classes. Hi,j,d + Hi,j,d + Gi,j,d given in (12), (13) and (14). Finally, 13

TABLE III SUMMARY OF EVALUATION RESULTS

Rate of Convergence 140 iterations, 2.77 min Intra/Inter-rack Bandwidth Ratio for class 1, 2 1.318, 1.625 Intra/Inter-rack Service Time Ratio for class 1, 2 1.314, 1.55 latency v.s. access patterns Latency decreases as workload becomes less concentrated. latency v.s. file sizes/workload latency of both classes increases as file size/workload increases q 2 eff to prove that f = Hi,j,d+ Hi,j,d + Gi,j,d is convex in Bi,j,d, REFERENCES we have: [1] E. Schurman and J. Brutlag, “ The user and business impact of server 2 2 delays, additional bytes and http chunking in web search,” OReilly H ∂ Hi,j,d + G ∂ Gi,j,d ∂2f ∂2H i,j,d ∂(Beff )2 i,j,d ∂(Beff )2 Velocity Web performance and operations conference, June 2009. = i,j,d + i,j,d i,j,d eff 2 eff 2 2 2 1/2 [2] M. Alicherry and T. V. Lakshman, “Optimizing data access latencies in ∂(Bi,j,d) ∂(Bi,j,d) (Hi,j,d + Gi,j,d) cloud systems by intelligent virtual machine placement, In Proceedings dGi,j,d dHi,j,d 2 of IEEE INFOCOM, April 2013. (Hi,j,d dBeff + Gi,j,d dBeff ) [3] B. Dong, Q. Zheng, F. Tian, K.M. Chao, R. Ma and R. Anane, “An + i,j,d i,j,d 2 2 3/2 (28) optimized approach for storing and accessing small files on cloud (Hi,j,d + Gi,j,d) storage, Journal of Network and Computer Applications, vol. 35, issue 6, Nov 2012. ∂2f [4] S. K. Barker and P. Shenoy, “Empirical evaluation of latency-sensitive From where we can see that in order for eff 2 to be d(Bi,j,d) application performance in the cloud, In Proceedings of the first annual 2 2 ∂ Hi,j,d ∂ Gi,j,d ACM SIGMM conference on Multimedia systems(MMSys), 2010. positive we only need eff 2 and eff 2 to be positive. d(Bi,j,d) d(Bi,j,d) [5] M. Vrable, S. Savage and G. M. Geoffrey, “BlueSky: A Cloud-backed We find the second order derivatives of Hi,j,d with respect to File System for the Enterprise,” Proceedings of the 10th USENIX Beff : Conference on File and Storage Technologies, 2012. i,j,d [6] A. Prahlad, M. S. Muller, P. kottomtharayil, S. Kavuri, P. Gokhale, “Performing data storage operations with a cloud storage environment, 2 eff 2 eff ∂ H Λi,j,dΓ2(3(B ) − 3Λi,j,dµB − 1) including automatically selecting among multiple cloud storage sites,” i,j,d = i,j,d i,j,d (29) ∂(Beff )2 (Beff )3(Beff − Λ µ)3 US Patent 20100332401 A1, 2010. i,j,d i,j,d i,j,d i,j,d [7] D. Shue, M. J. Freedman and A. Shaikh, “Performance isolation and fairness for multi-tenant cloud storage, in Proc. 10th USENIX Sympo- eff which is positive as long as 1 − Λi,j,dµ/Bi,j,d > 0. This is sium on Operating Systems Design and Implementation, Oct 2012. indeed true because ρ = Λ µ/Beff < 1 in M/G/1 queues. [8] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “The cost of a i,j,d i,j,d cloud: research problems in data center networks”, ACM SIGCOMM eff Thus, Hi,j,d is convex in Bi,j,d). Next, considering Gi,j we computer communication review, vol. 39(1), pp. 68-73, 2008. have [9] M. Herlich and H. Karl, “Average and Competitive Analysis of Latency and Power Consumption of a Queuing System with a Sleep Mode,” In 2 eff 3 eff 2 eff Proc. 3rd International Conference on Future Energy Systems: Where ∂ Gi,j,d p(Bi,j,d) + q(Bi,j,d) + sBi,j,d + t = (30) Energy, Computing and Communication Meet, 2012. eff 2 eff 4 eff 4 ∂(Bi,j,d) 6(Bi,j,d) (Bi,j,d − Λi,j,dµ) [10] L. Huang, S. Pawar, H. Zhang and K. Ramchandran, “Codes can reduce queueing delay in data centers,” in Proc. IEEE International Symposium where the auxiliary variables are given by where we have: on Information Theory Proceedings (ISIT), 2012. [11] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Joint Latency and Cost Optimization for Erasure-coded Data Center Storage,” IEEE/ACM p = 24Λi,j,dΓ3 Transactions on Networking, vol. 24, no. 4, pp. 2443-2457, Aug. 2016. [12] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Joint Latency q = 2Λ (15Γ2 − 28Λ µΓ ) and Cost Optimization for Erasure-coded Data Center Storage,” ACM i,j,d 2 i,j,d 3 SIGMETRICS Performance Evaluation Review 42, 2, pp. 3-14, 2014. [13] Y. Xiang, V. Aggarwal, Y. R. Chen, T. Lan, “Taming Latency in 2 2 Data Center Networking with Erasure Coded Files”, In Proceedings of s = 2Λ µ(22Λi,j,dµΓ3 − 15Γ ) i,j,d 2 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 2015. 3 2 2 [14] Y. Xiang, T. Lan, V. Aggarwal and Y. F. R. Chen, “Multi-tenant Latency t = 3Λi,j,dµ (3Γ2 − 4Λi,j,dµΓ3) Optimization in Erasure-Coded Storage with Differentiated Services,” eff 3 eff 2 In Proc. IEEE 35th International Conference on Distributed Computing which give out the solution for p(Bi,j,d) + q(Bi,j,d) + Systems (ICDCS), 2015. eff eff s(Bi,j,d) + t as Bi,j,d > Λi,j,dµ, which is equivalent to [15] C. Fiandrino; D. Kliazovich, P. Bouvry and A. Zomaya, “Performance 1 − Λ µ/Beff > 0, which has been approved earlier. Thus and Energy Efficiency Metrics for Communication Systems of Cloud i,j,d i,j,d Computing Data Centers,” IEEE Transactions on Cloud Computing, doi: eff Gi,j,d is also convex in Bi,j,d. 10.1109/TCC.2015.2424892, April 2015. ∂2f [16] A. G. Dimakis, V. Prabhakaran, and K. Ramchandran, “Distributed data Now that we can see eff 2 is positive and conclude that ∂(Bi,j,d) storage in sensor networks using decentralized erasure codes,” in Proc. eff eff Asilomar, 2004. their composition f(zi,r , Λi,j,d,B ) is convex in B and d i,j,d i,j,d [17] M. K. Aguilera, R. Janakiraman, and L. Xu, “Using Erasure Codes Effi- thus convex in Wi,j,d. So the objective is convex in Wi,j,d as ciently for Storage in a Distributed System,” in Proc. 2005 International well. Conference on DSN, pp. 336-345, 2005. [18] H. Kameyam and Y. Sato, “Erasure Codes with Small Overhead Factor Similarly, when we have {zi,rd , πi,j,rd ,Sr,Wi,j,d} as con- eff and Their Distributed Storage Applications,” in Proc. 41st Annual stants we will again have Bi,j,d convex in {wi,d}. And the Conference on Information Sciences and Systems, 2007. eff eff above proof for f(zi,rd , Λi,j,d,Bi,j,d is convex in Bi,j,d still [19] J. Li, “Adaptive Erasure Resilient Coding in Distributed Storage,” in eff Proc. IEEE International Conference on Multimedia and Expo, 2006. works, so that we can also conclude that f(zi,r , Λi,j,d,B ) d i,j,d [20] C. Angllano, R. Gaeta, and M. Grangetto, “Exploiting Rateless Codes in is convex in intra-rack bandwidth allocation {wi,d} as well. Cloud Storage Systems,” IEEE Transactions on Parallel and Distributed This completes the proof. Systems, vol. 26, pp. 1313-1322, 2014. 14

[21] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, “Codes Can [46] F. Paganini, A. Tang, A. Ferragut, and L. L. H. Andrew, “Network Reduce Queueing Delay in Data Centers,” Journals CORR, 2012. Stability Under Alpha Fair Bandwidth Allocation With General File [22] N. Shah, K. Lee, and K. Ramachandran, “The MDS queue: analyzing Size Distribution,” IEEE Transactions on Automatic Control, vol. 57, latency performance of erasure codes,” Information Theory (ISIT), 2014 no. 3, pp. 579-591, March 2012. IEEE International Symposium on, July 2014. [47] A. B. Downey, “The structural cause of file size distributions,” In Proc. [23] G. Joshi, Y. Liu, and E. Soljanin, “On the Delay-Storage Trade- Ninth International Symposium on MASCOTS, 2011. off in Content Download from Coded Distributed Storage Systems,” [48] MOSEK, “MOSEK: High performance software for large-scale LP, QP, arXiv:1305.3945v1, May 2013. SOCP, SDP and MIP,” available online at http://www.mosek.com/. [24] C. Tian, B. Sasidharan, V. Aggarwal, V. A. Vaishampayan, and P.V. [49] Hungarian Algorithm, available online at Kumar, “Layered Exact-Repair Regenerating Codes via Embedded Error http://www.hungarianalgorithm.com Correction and Block Designs,” IEEE Transactions on Information [50] B. Warner, Z. Wilcox-O’Hearn, and R. Kinninmont, “Tahoe-LAFS Theory, vol. 61, pp. 1933-1947, April 2015. docs,” available online at https://tahoe-lafs.org/trac/tahoe-lafs. [25] D. Boru, D. Klizaovich, F. Granelli, P. Bouvry and A. Y. Zomaya, “Energy-efficient Data Replication in Cloud Computing Datacenters,” Cluster Computing, vol. 18, pp. 385-402, March 2015. [26] V. Aggarwal, C. Tian, V. A. Vaishampayan, and R. Chen, “Distributed Data Storage Systems with Opportunistic Repair,” In Proc. IEEE Inter- national Conference on Computer Communications (INFOCOM), April 2014. [27] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li Yu Xiang received the B.A.Sc. degree from Harbin Institute of Technology and S. Yekhanin, “Erasure Coding in Windows Azure Storage,”In Proc. and is currently a doctoral student in the Department of Electrical and USENIX Conference on Annual Technical Conference, 2012. Computer Engineering at George Washington University. Her current research [28] A.G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright and K. Ram- interest are in cloud resource optimization, and distributed storage systems. chandran, “Network Coding for Distributed Storage Systems,” IEEE Transactions on Information Theory, vol. 56, Issue 9, September 2010. [29] Z. Zhang, A. Deshpande, X. Ma, E. Thereska and D. Narayanan, “Does erasure coding have a role to play in my data center?,” Microsoft research MSR-TR-2010 52, 2010. [30] X. Wang, Z. Xiao, J. Han, and C. Han, “Reliable Multicast Based on Erasure Resilient Codes over InfiniBand,” in Proc. First International Vaneet Aggarwal (S’08 - M’11 - SM15) received the B.Tech. degree in 2005 Conference on Communications and Networking in China, 2006. from the Indian Institute of Technology, Kanpur, India, and the M.A. and Ph.D. [31] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, and A. Greenberg, “Join- degrees in 2007 and 2010, respectively from Princeton University, Princeton, Idle-queue: A novel load balancing algorithm for dynamically scalable NJ, USA, all in Electrical Engineering. He is currently an Assistant Professor web services,” ACM Performance Evaluation, vol. 68, pp. 1056-1071, at Purdue University, West Lafayette, IN. Prior to this, he was a Senior 2011. Member of Technical Staff Research at AT&T Labs-Research, NJ, and an [32] M. Bramson, Y. Lu, and B. Prabhakar, “Randomized load balancing with Adjunct Assistant Professor at Columbia University, NY. His research interests general service time distributions,” In Proc. ACM Sigmetrics, 2010. are in applications of information and coding theory to distributed storage [33] A. Fallahi and E. Hossain, “Distributed and energy-Aware MAC for systems, machine learning, and wireless systems. Dr. Aggarwal was the differentiated services wireless packet networks: a general queuing recipient of Princeton University’s Porter Ogden Jacobus Honorific Fellowship analytical framework,” IEEE Transactions on Mobile Computing, vol. in 2009. In addition, he received AT&T Key Contributor award in 2013, AT&T 6, pp. 381-394, 2007. Vice President Excellence Award in 2012, and AT&T Senior Vice President [34] A. S. Alfa, “Matrix-geometric solution of discrete time MAP/PH/1 Excellence Award in 2014. priority queue,” Naval research logistics, vol. 45, 00. 23-50, 1998. [35] N. E. Taylor and Z. G. Ives, “Reliable storage and querying for collaborative data sharing systems,” Data Engineering (ICDE), IEEE 26th International Conference on, 2010. [36] J. H. Kim and J. K. Lee, “Performance of carrier sense multiple access with collision avoidance in wireless LANs,” Wireless Personal Communications, vol. 11, pp. 161-183, 1999. [37] E. Ziouva and T. Antoankopoulos, “CSMA/CA Performance under high Yih-Farn (Robin) Chen is a Director of Inventive Science, leading the Cloud traffic conditions: throughput and delay analysis,” Computer Communi- Platform Software Research department at AT&T Labs - Research. His current cations, vol. 25, pp. 313-321, 2002. research interests include cloud computing, software-defined storage, mobile [38] S. Mochan and L. Xu, “Quantifying Benefit and Cost of Erasure computing, distributed systems, World Wide Web, and IPTV. He holds a Ph.D. Code based File Systems,” Technical report available at http : in Computer Science from University of California at Berkeley, an M.S. in //nisl.wayne.edu/P apers/T ech/cbefs.pdf, 2010. Computer Science from University of Wisconsin, Madison, and a B.S. in [39] H. Weatherspoon and J. D. Kubiatowicz, “Erasure Coding vs. Repli- Electrical Engineering from National Taiwan University. Robin is an ACM cation: A Quantitative Comparison,” Revised Papers from the First Distinguished Scientist and a Vice Chair of the International World Wide Web International Workshop on Peer-to-Peer Systems, IPTPS ’01, pp. 328- Conferences Steering Committee (IW3C2). He also serves on the editorial 338, 2002 board of IEEE Internet Computing. [40] S. Chen, Y. Sun, U. C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu, and N. B. Shroff, “ When Queuing Meets Coding: Optimal-Latency Data Retrieving Scheme in Storage Clouds,” in Proc. IEEE INFOCOM, 2014. [41] F. Baccelli, A. Makowski, and A. Shwartz, “The fork-join queue and related systems with synchronization constraints: stochastic ordering and computable bounds, Advances in Applied Probability, vol. 21, pp. Tian Lan (S’03-M’10) received the B.A.Sc. degree from the Tsinghua 629660, 1989. University, China in 2003, the M.A.Sc. degree from the University of Toronto, [42] A. Kumar, R. Tandon, and T. C. Clancy, “On the Latency of Erasure- Canada, in 2005, and the Ph.D. degree from the Princeton University in Coded Cloud Storage Systems,” arXiv:1405.2833, 2014. 2010. Dr. Lan is currently an Associate Professor of Electrical and Computer [43] D. Bertsimas and K. Natarajan, “Tight bounds on Expected Order Engineering at the George Washington University. His research interests Statistics,” Probability in the Engineering and Informational Sciences, include cloud resource optimization, mobile networking, storage systems and 2006. cyber security. Dr. Lan received the 2008 IEEE Signal Processing Society [44] M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly, “The Best Paper Award, the 2009 IEEE GLOBECOM Best Paper Award, and the quantcast file system,” Proceedings of the VLDB Endowment, vol. 6, pp. 2012 INFOCOM Best Paper Award. 1092-1101, 2013. [45] A. Abdelkefi and J. Yuming, “A Structural Analysis of Network Delay,” in Proc. Ninth Annual CNSR, 2011.