<<

Differentiated in data center networks with erasure coded files through traffic engineering

Yu Xiang, Vaneet Aggarwal, Yih-Farn R. Chen, and Tian Lan

Abstract—This paper proposes an algorithm to minimize system [13], [16], [14], [28], [23]. However, little effort has weighted service latency for different classes of tenants (or been made to address the impact of data ceter networking on service classes) in a data center network where erasure-coded service latency. files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth In this paper, we utilize the probabilistic policy available at both top-of-the-rack and aggregation switches and developed in [28] and analyze service latency of erasure- tenants in different service classes need differentiated services, coded storage with respect to data center topologies and network bandwidth must be apportioned among different intra- traffic engineering. To the best of our knowledge, this is the and inter-rack data flows for different service classes in line with first work that accounts for both erasure coding and data their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling center networking in latency analysis. Our result provides a policies to derive a closed-form upper-bound of service latency tight upper bound on mean service latency of erasure coded- for erasure-coded storage with arbitrary file access patterns and storage and applies to general data center topology. It allows service time distributions. The result enables us to propose a us to construct quantitative models and find solutions to a joint weighted latency (over different service classes) optimization novel latency optimization leveraging both the placement of over three entangled “control knobs”: the bandwidth allocation at top-of-the-rack and aggregation switches for different service erasure coded chunks on different racks and the bandwidth classes, dynamic scheduling of file requests, and the placement reservations at different switches. of encoded file chunks (i.e., data locality). The joint optimization We consider a data center storage system with a hierarchical is shown to be a mixed-integer problem. We develop an iterative structure. Each rack has a TOR switch that is responsible algorithm which decouples and solves the joint optimization for routing data flows between different disks and associated as 3 sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm storage servers in the rack, while data transfers between is prototyped in an open-source, distributed file system, Tahoe, different racks are managed by an aggregation switch that and evaluated on a cloud testbed with 16 separate physical hosts connects all TOR switches. Multiple client files are stored in an Openstack cluster using 48-port Cisco Catalyst switches. distributively using an (n, k) erasure coding scheme, which Experiments validate our theoretical latency analysis and show allows each file to be reconstructed from any k-out-of-n significant latency reduction for diverse file access patterns. The results provide valuable insights on designing low-latency data encoded chunks. We assume that file access requests may center networks with erasure coded storage. be generated from anywhere inside the data center, e.g., a virtual machine spun up by a client on any of the racks. I.INTRODUCTION We consider an erasure-coded storage with multiple tenants Data center storage is growing at an exponential speed and and differentiated delay demands. While customizing elastic customers are increasingly demanding more flexible services service latency for the tenants is undoubtedly appealing to exploiting the tradeoffs among the reliability, performance, and cloud storage, it also comes with great technical challenges storage cost. Erasure coding has seen itself quickly emerged as and calls for a new framework for delivering, quantifying, a very promising technique to reduce the cost of storage while and optimizing differentiated service latency in general erasure providing similar system reliability as replicated systems. The coded storage. arXiv:1602.05551v1 [cs.DC] 17 Feb 2016 effect of coding on service latency to access files in an erasure- Due to limited bandwidth available at both the TOR and coded storage is drawing more attention these days. Google aggregation switches, a simple First Come First Serve (FCFS) and Amazon have published that every 500 ms extra delay policy to schedule all file requests fails to deliver satisfactory means a 1.2% user loss [1]. Modern data centers often adopt latency performance, not only because the policy lacks the a hierarchical architecture [37]. As encoded file chunks are ability to adapt to varying traffic patterns, but also due to the distributed across the system, accessing the files requires - need to apportion networking resources with respect to het- in addition to data chunk placement and dynamic request erogeneous service requirements. Thus, we study a weighted scheduling - intelligent traffic engineering solutions to handle: queueing policy at top-of-the-rack and aggregation switches, 1) data transfers between racks that go through aggregation which partition tenants into different service classes based on switches and 2) data transfers within a rack that go through their delay requirement and apply differentiated management Top-of-Rack (TOR) switches. policy to file requests generated by tenants in each service Quantifying exact latency for erasure-coded storage system class. We assume that the file requests submitted by tenants is an open problem. Recently, there has been a number of in each class are buffered and processed in a separate first- attempts at finding latency bounds for an erasure-coded storage come-first-serve queue at each switch. The service rate of these queues are tunable with the constraints that they are non- Symbol Meaning negative, and sum to at most the switch/link capacity. Tuning these weights allows us to provide differentiated service rate to N Number of racks, indexed by i = 1,...,N tenants in different classes, therefore assigning differentiated m Number of storage servers in each rack service latency to tenants. R Number of files in the system, indexed by r = 1,...,R In particular, a performance hit in erasure-coded storage (n, k) Erasure code for storing files systems comes when dealing with hot network switches. The D Number of services classes latency of each file request is determined by the maximum delay in retrieving k-out-of-n encoded chunks. Without proper B Total available bandwidth at the aggregate switch coordination in processing each batch of chunk requests that b Total available bandwidth at each top-of-the-rack switch jointly reconstructs a file, service latency is dominated by beff Effective bandwidth for requests from node i to node j in the same rack staggering chunk requests with the worst access latency (e.g., i,j eff chunk requests processed by heavily congested switches), sig- Bi,j,d Effective bandwidth for servicing requests from rack i to rack j nificantly increasing overall latency in the data center. To avoid wi,j,d Weight for apportioning top-of-rack switch bandwidth for service class d this, bandwidth reservation can be made for routing traffic W d among racks [24], [25], [26]. Thus, we apportion bandwidth i,j,d Weight for apportioning aggregate switch bandwidth for service class λi Arrival rate of request for file r of service class d from rack i at TOR and aggregation switches so that each flow has its rd own FCFS queue for the data transfer. We jointly optimize πi,j,r Probability of routing rack-i file-r of service class drequest to rack j bandwidth allocation and data locality (i.e., placement of d S Set of racks for placing encoded chunks of file r of service class d encoded file chunks) to achieve service latency minimization. rd Ni,j Connection delay for service from rack i to rack j For a given set of pair-wise bandwidth allocations among racks, and placement of different erasure-coded files in dif- Qi,j,d Queuing delay for service of class d from rack i to rack j ¯ ferent racks, we first find an upper bound on mean latency Ti,rd Expected latency of a request of file r from rack i of each file by applying the probabilistic scheduling policy proposed in [28]. Having constructed a quantitative latency TABLE I model, we consider a joint optimization of average service MAIN NOTATION. latency (weighted by the rate of arrival of requests, and weight of the service class) over the placement of contents of each file, the bandwidth reservation between any pair of racks (at TOR and aggregation switches), and the scheduling II.SYSTEM MODEL probabilities. This optimization is shown to be a mixed-integer We consider a data center consisting of N racks, denoted optimization, especially due to the integer constraints for by N = {i = 1, 2,...,N}, each equipped with m homoge- the content placement. To tackle this challenge, the latency neous servers that are available for data storage and running minimization is decoupled into 3 sub-problems; two of which applications. There is a Top-of-Rack (TOR) switch connected are proven to be convex. We propose an algorithm which to each rack to route intra-rack traffic and an aggregate switch iteratively minimizes service latency over the three engineering that connects all TOR switches for routing inter-rack traffic. A degrees of freedom with guaranteed convergence. set of files R = {1, 2,...,R} are distributedly stored among m ∗ N To validate our theoretical analysis and joint latency op- these servers and are accessed by client applications. timization for different tenants, we provide a prototype of Our goal is to minimize the overall file access latency of all the proposed algorithms in Tahoe [38], which is an open- applications to improve performance of the cloud system. source, distributed file system based on the zfec erasure coding This work focuses on a erasure-coded file storage systems, library for fault tolerance. A Tahoe storage system consisting where files are erasure-coded and stored distributedly to enable of 10 racks is deployed on hosts of virtual machines in an space saving while allowing the same data durability as OpenStack-based data center environment, with each rack replication systems. More precisely, each file r is split into hosting 10 Tahoe storage servers. Each rack has a client k fixed-size chunks and then encoded using an (n, k) erasure node deployed to issue storage requests. The experimental code to generate n chunks of the same size. The encoded results show that the proposed algorithm converges within chunks are assigned to and stored on n out of m ∗ N distinct servers in the data center. While it is possible to place multiple a reasonable number of iterations. We further find that the 1 service time distribution is nearly proportional to the band- chunks of the same file in the same rack , we assume that width of the server, which is an assumption used in the these n servers to store file r belong to distinct racks, so as to latency analysis, and implementation results also show that maximize the distribution of chunks across all racks, which in our proposed approach significantly improved service latency 1Using the techniques in [28], our results in this paper can be easily in storage systems compared to native storage settings of the extended to the case where multiple chunks of the same file are placed on testbed. the same rack. turn provides the highest reliability against independent rack flows in a coordinated fashion. We propose a weighted queu- failures. This is indeed a common practice adopted by file ing model for bandwidth allocation and enable differentiated systems such as QFS[12], an erasure-coded storage file system latency QoS for requests belonging to different classes. In that is designed to replace HDFS for Map/Reduce processing. particular, at each rack j, we buffer all incoming class d For each file r, we use Sr to denote the set of racks that host requests generated by applications in rack i in a local queue, encoded chunks of file r, satisfying Sr ⊆ N and |Sr| = n. To denoted by qi,j,d. Therefore, each rack j manages N ∗ D access file r, an (n, k) Maximum-Distance-Separable (MDS) independent queues, among which include D queues that erasure code allows the file to be reconstructed from any subset manages intra-rack traffic traveling through the TOR switch of k-out-of-n chunks. Therefore, a file request generated by a and (N − 1) ∗ D queues that manages inter-rack traffic of client application must be routed to a subset of k racks in Sr class d originated from other racks i 6= j. for successful processing. To minimize latency, the selection of Assume the total bi-direction bandwidth at the aggregate these k racks needs to be optimized for each file request on the switch is B, which needs to be apportioned among the N(N − fly for load-balancing among different racks, in order to avoid 1) ∗ D queues for inter-rack traffic. Let {Wi,j,d, ∀i 6= j} creating “hot” switches that suffer from heavy congestion and be a set of N(N − 1) ∗ D non-negative weights satisfying P result in high access latency. We refer to this routing and rack i,j:i6=j Wi,j,d = 1. We assign to each queue qi,j,d a share selection problem as the request scheduling problem. of B that is proportional to Wi,j,d, i.e., queue qi,j,d receives eff In this paper, we assume that the tenants’ files are divided a dedicated service bandwidth Bi,j,d on the aggregate switch, into D service classes, denoted by class d = 1, 2, ··· ,D. i.e., Files in different service classes have different sensitivity and eff latency requirement. There are series of requests generated by Bi,j,d = B · Wi,j,d, ∀i 6= j. (2) client applications to access the R files of different classes. D X We model the arrival of requests for each file r in class d as Wi,j,d = Wi,j a Poisson . To emphasize on each file’s service class, d=1 we denote a file r in class d by file r . Let λ be the rate d i,rd where Wi,j is the total weights assigned for traffic from rack j of file rd requests that are generated by a client application to rack i. Because inter-rack traffic traverses two ToR switches, running in rack i. It means that a file rd request can be the same amount of bandwidth has to be reserved on the TOR generated from a client application in any of the N racks switches of both racks i and j. Then, any remaining bandwidth λi,rd with a probability P . The overall request arrival for file i λi,rd on the TOR switches will be made available for intra-rack rd is a superposition of Poisson process with aggregate rate traffic routing. On rack j, the bandwidth assigned for intra-rack λ = P λi rd i rd . Since solving the optimal request scheduling traffic is given by the total TOR bandwidth b minus aggregate for latency minimization is still an open problem for general incoming and outgoing inter-rack traffic 2, i.e., erasure-coded storage systems [13], [16], [14], [28], we em- D ploy the probabilistic scheduling policy proposed in [28] to X X X beff = b − ( W B − W B), ∀i = j. (3) derive an outer bound of service latency, which further enables i,j i,l,d l,i,d a practical solution to the request scheduling problem. d=1 l:l6=i l:l6=i Upon the arrival of each request, a probabilistic scheduler For different classes of intra-rack requests, we assume that selects k-out-of-n racks in Sr ( which host the desired file requests of the same class will have the same service band- chunks) according to some known probability and route the width, regardless the origin and destination ranks i and j. eff resulting traffic to the client application accordingly. It is Then this total available bandwidth at the TOR switch bi,j shown that determining the probability distribution of each will be apportioned among d queues, proportional to weights k-out-of-n combination of racks is equivalent to solving the wi,j,d. Then the efficient bandwidth for each class of intra-rack marginal probabilities for scheduling requests λi,rd , requests would be: r eff eff πi,j,rd = P[j ∈ Si is selected | k racks are selected], (1) Bi,j,d = wi,j,dbi,j, ∀i = j. (4) P under constraints πi,j,rd ∈ [0, 1] and j πi,j,rd = k [28]. By optimizing Wi,j,d and wi,j,d, the weighted queuing model Here πi,j,rd represents the scheduling probability to route a class d request for file r from rack j to rack i. The constraint provides an effective allocation of data center bandwidth P among different classes of data flows both within and across j πi,j,rd = k ensures that exactly distinct k racks are selected to process each file request. It is easy to see that the racks. Bandwidth under-utilization and congestion needs to chunk placement problem can now be rewritten with respect be addressed by optimizing these weights, so that queues with heavy workload will receive more bandwidth and those to πi,j,rd . For example, πi,j,rd = 0 equivalently means that no chunk of file r exists on rack j, i.e., j∈ / Sr, while a chunk is with light workload will get less. In the mean time, different i classes of requests should also be assigned different shares placed on rack j only if πi,j,rd > 0. To accommodate network traffic generated by applications 2Our model, as well as the subsequent analysis and algorithms, can be in different service classes, bandwidth available at ToR and trivially modified depending on whether the TOR switch is non-blocking Aggregation Switches must be apportioned among different and/or duplex. of resources according to their class levels and to jointly is fixed. However, due to coupled arrivals of chunk requests, minimize overall latency in the system. the queues at different switches are indeed correlated. We deal Our goal in this work is to quantify service latency under with this issue by employing a tight bound of highest order this model through a tight upper bound and to minimize aver- statistic and extend it to account for randomly selected racks age service latency for all traffic in the data center by solving [28]. This approach allows us to obtain a closed-form upper an optimization problem over three dimensions: placement of bound of average latency using only first and second moments encoded file chunks Sr, scheduling probabilities πi,j,rd for of Ni,j + Qi,j,d, which are easier to analyze. The result is load-balancing, and allocation of system bandwidth through summarized in Lemma 1. weights for inter-rack and intra-rack traffic from different Lemma 1: (Bound for the random order statistic [28].) For W w classes i,j,d and i,j,d. To the best of our knowledge, this is arbitrary set of zi,rd ∈ R, the average service time is bounded the first work jointly minimizing service latency of an erasure- by coded system over all three dimensions.   X πi,j,r X III.ANALYZING SERVICE LATENCY FOR DATA REQUESTS T¯ ≤ z + d ( [D ] − z ) + i,rd i,rd 2 E i,j,d i,rd In this section, we will derive an outer bound on the service  j∈Sr j∈Sr latency of a file of class d for each client application running π q  i,j,rd ( [D ] − z )2 + Var[D ] (7) in the data center. Under the probabilistic scheduling policy, it 2 E i,j,d i,rd i,j,d is easy to see that requests for file r chunk from rack i to rack where Di,j,d = Ni,j + Qi,j,d is the combined delay. The j form a (decomposed) Poisson process with rate λi,rd πi,j,rd . Thus, the aggregate arrival of requests from rack i to rack j tightest bound is obtained by optimizing zi,rd ∈ Z and is tight becomes a (superpositioned) Poisson process with rate in the sense that there exists a distribution of Di,j,d to achieve X the bound with exact equality. Λ = λ π . (5) i,j,d i,rd i,j,rd We assume that connection delay Ni,j is independent of r queuing delay Qi,j,d. With the proposed system model, the It implies that if we limit our focus to a single queue, it can chunk requests arriving at each queue qi,j,d form a Poison be modeled as N 2 ∗ d M/G/1 queues. Service time per chunk P process with rate Λi,j,d = r λi,rd πi,j,rd . Therefore, each is determined by the allocation of bandwidth B · Wi,j,d to queue qi,j,d at the aggregate switch can be modeled as a M/G/1 each queue qi,j,d handling inter-rack traffic or the allocation queue that processes chunk requests in an FCFS manner. For of bandwidth bj to qj,j,d for different classes handling intra- intra-rack traffic the d queues at the top-of-rack switch at each rack traffic. Then service latency for each file request can be rack can also be modeled as M/G/1 queues since the intra- obtained by the largest delay to retrieve k probabilistically- rack chunk requests forms a Poison process with rate Λj,j,d = selected chunks of the file [28]. However, due to dependency P r λj,rd πj,j,rd . Due to the fixed chunk size in our system, of different queues (since their chunk request arrivals are we denote X as the standard (random) service time per chunk coupled), it poses another challenge for deriving the exact when bandwidth B is available. We assume that the service service latency for file requests. time is inversely proportional to the bandwidth allocated to eff For each chunk request, we consider two latency compo- qi,j,d, i.e., Bi,j,d. We obtain the distribution of actual service nents: connection delay Ni,j that includes the overhead to time Xi,j,d considering bandwidth allocation: set up the network connection between client application in eff rack i and storage server in rack j and queuing delay Qi,j,d Xi,j,d ∼ X · B/Bi,j,d, ∀i, j (8) that measures the time a chunk request experiences at the With the service time distributions above, we can derive the bandwidth service queue qi,j,d, i.e., the time to transfer a mean and variance of queuing delay Q using Pollaczek- chunk with the allocated network bandwidth. Let T¯i,r be i,j,d d 2 the average service latency for accessing file r of class d Khinchine formula. Let µ = E[X], σ = Var[X], and Γt = t th by an application in rack i. Due to the use of (n, k) erasure E[X ], be the mean, variance, t order moment of X, ηi,j 2 coding for distributed storage, each file request is mapped to and ξi,j are mean and variance for connection delay Ni,j. a batch of k parallel chunk requests. The chunk requests are These statistics can be readily available from existing work on r network delay [32], [10] and file-size distribution [34], [33]. scheduled to k different racks Ai ⊂ Sr that host the desired r Lemma 2: The mean and variance of combined delay Di,j file chunks. A file request is completed if all k = |Ai | chunks are successfully retrieved, which means a mean service time: for any i, j is given by 2 ¯ r Λi,j,dΓ2B Ti,rd = E[EA [max(Ni,j + Qi,j,d)]], (6) i j∈Ar [Di,j,d] = ηi,j + (9) i E eff eff 2Bi,j,d(Bi,j,d − Λi,j,dµB) where the expectation is taken over queuing dynamics in the Λ Γ B3 mode and over a set Ar of randomly selected racks with Var[D ] = ξ2 + i,j,d 3 + i i,j,d i,j eff 2 eff 3(Bi,j,d) (Bi,j,d − Λi,j,dµB) probabilities πi,j,rd for all j. Mean latency in (6) could have been easily calculated if Λ (Γ )2B4 i,j,d 2 (10) (a) N + Q ’s are independent and (2) rack selection Ar eff 2 eff 2 i,j i,j,d i 4(Bi,j,d) (Bi,j,d − Λi,j,dµB) eff ¯ where Bi,j,d is the effective bandwidth assigned to the queue Substitute Ti,rd in Equation (11), we obtain Td as: of requests for class d files from rack j to rack i. N λ Combining these results, we derive an upper bound for ¯ X X i,rd Td = (zi,rd average service latency T¯i,r as a function of chunk placement λall d i=1 r∈Fd

Sr, scheduling probability πi,j,rd , and bandwidth allocation X πi,j,rd eff eff + f(zi,rd , Λi,j,d,Bi,j,d)) Bi,j,d (which is a function of the bandwidth weights Wi,j,d and 2 j∈Sr wi,j,d in (2) and (4)). The main result of our latency analysis N is summarized in the following theorem. X X λi,r zi,r = [ d d + Theorem 1: z ∈ λall For arbitrary set of i,rd R, the expected i=1 r∈Fd latency T¯i,r of a request of file r, requested from rack i is d X πi,j,rd λi,rd eff upper bounded by f(zi,rd , Λi,j,d,Bi,j,d)] 2λall j∈Sr N X X λi,r zi,r π = [ d d + ¯ X i,j,rd eff λ Ti,rd = zi,rd + [ · f(zi,rd , Λi,j,d,Bi,j,d)], (11) all 2 i=1 r∈Fd j∈Sr N X Λi,j,d f(z , Λ ,Beff )] (16) 2λ i,rd i,j,d i,j,d j=1 all eff where function f(zi,rd , Λi,j,d,Bi,j ) is an auxiliary function In the third step, we use the condition that π = 0 for depending on aggregate rate Λi,j,d and effective bandwidth i,j,rd eff all j∈ / Sr to extend the limit of summation (because a rack Bi,j,d of queue qi,j,d, i.e., not hosting a desired file chunk will never be scheduled) and exchange the order of summation. Step two follows from the q P eff 2 fact that r in class d λi,rd πi,j,rd = Λi,j,d. f(zi,r , Λi,j,d,B ) = Hi,j,d + H + Gi,j,d(12) d i,j,d i,j,d We now define the Joint Latency and Weights Optimization 2 Λi,j,dΓ2B (JLWO) problem as follows: Hi,j,d = ηi,j + eff eff − zi,rd(13) 2Bi,j,d(Bi,j,d − Λi,j,dµB) D 3 Λi,j,dΓ3B X ¯ G = ξ2 + min. CdTd (17) i,j,d i,j eff 2 eff 3(Bi,j,d) (Bi,j,d − Λi,j,dµB) d=1 2 4 N Λi,j,dΓ2B λ z + (14) ¯ X X i,rd i,rd eff 2 eff 2 s.t. Td = [ + 4(Bi,j,d) (Bi,j,d − Λi,j,dµB) λall i=1 r∈Fd N X Λi,j,d eff f(zi,rd , Λi,j,d,Bi,j,d)] (18) 2λall IV. JOINT LATENCY MINIMIZATION IN CLOUD j=1 eff X Bi,j,d Λ = λ π ≤ µ , i,j,d i,rd i,j,rd B r∈Fd We consider a joint latency minimization problem over 3 ∀i, j, d (19) design degrees of freedom: (1) placement of encoded file N X chunks {Sr} that determines datacenter traffic locality, (2) πi,j,rd = k and πi,j,rd ∈ [0, 1], ∀i, j, rd (20) allocation of bandwidth at aggregation/TOR switches through j=1 weights for different classes {W } and {w } that affect i,j,d i,j,d |Sr| = n and πi,j,rd = 0 ∀j∈ / Si, ∀i, rd (21) chunk service time for different data flows, and (3) scheduling D X probabilities {πi,j,rd } to optimize load-balancing under proba- wi,j,d = 1. (22) bilistic scheduling policy and select racks/servers for retrieving d=1 files in different classes. Let λ = P P λ be the total N D all d i i,rd X X X file request rate in the datacenter. Let Fd be the set of files Wi,j,d = 1 (23) belonging to class d. The optimization objective is to minimize i=1 j6=i d=1 the mean service latency in the erasure-coded storage system, eff Bi,j,d = Wi,j,dB ≤ 1Gbps, ∀i 6= j (24) which is defined by eff Bi,j,d = wi,j,d(b − D X X X ( Wi,l,dB − Wl,i,dB)) ≤ C, ∀i =(25)j D N λ X ¯ ¯ X X i,rd d=1 l:l6=i l:l6=i CdTd, where Td = Ti,rd (15) r λall var. zi,r , {S }, {πi,j,r }, {wi,j,d}, {Wi,j,d}. d=1 i=1 r∈Fd d i d q 2 eff Here we minimize service latency derived in Theorem 1 over to prove that f = Hi,j,d+ Hi,j,d + Gi,j,d is convex in Bi,j,d, zi,rd ∈ R for each file access to get the tightest upper we have: bound. Feasibility of Problem JLWO is ensured by (19), which 2 2 H ∂ Hi,j,d + G ∂ Gi,j,d requires arrival rate to be no greater than chunk service rate ∂2f ∂2H i,j,d ∂(Beff )2 i,j,d ∂(Beff )2 = i,j,d + i,j,d i,j,d + received by each queue. Encoded chunks of each file are eff 2 eff 2 2 2 1/2 ∂(Bi,j,d) ∂(Bi,j,d) (Hi,j,d + Gi,j,d) placed on a set Si of servers in (21), and each file request (H dGi,j,d + G dHi,j,d )2 is mapped to k chunk requests and processed by k servers i,j,d dBeff i,j,d dBeff i,j,d i,j,d (26) in Si with probabilities πi,j,r in (20). Finally, both weights 2 2 3/2 d (Hi,j,d + Gi,j,d) Wi,j,d (for inter-rack traffic) and wi,j,d(for intra-rack traffic) ∂2f should add up to 1 so there is no bandwidth left underutilized. From where we can see that in order for eff 2 to be d(Bi,j,d) eff 2 2 Bandwidth assigned to each queue Bi,j,d ∀i, j is determined ∂ Hi,j,d ∂ Gi,j,d positive we only need eff 2 and eff 2 to be positive. by our bandwidth allocation policy in (2) and (4) for intra- and d(Bi,j,d) d(Bi,j,d) inter-rack bandwidth allocations. Any effective flow bandwidth We find the second order derivatives of Hi,j,d with respect to eff resulted from the bandwidth allocation have to be less than Bi,j,d: or equal to a port capacity C allowed by the switches. For 2 eff 2 eff ∂ H Λi,j,dΓ2(3(B ) − 3Λi,j,dµB − 1) example, the Cisco switch used in our experiment, which has i,j,d = i,j,d i,j,d (27) eff 2 eff 3 eff 3 a port capacity constraint at C = 1Gbps, which we will ∂(Bi,j,d) (Bi,j,d) (Bi,j,d − Λi,j,dµ) elaborate in Section V. which is positive as long as 1 − Λ µ/Beff > 0. This is Problem JLWO is a mixed-integer optimization and hard i,j,d i,j,d indeed true because ρ = Λ µ/Beff < 1 in M/G/1 queues. to compute in general. In this work, we develop an iterative i,j,d i,j,d Thus, H is convex in Beff ). Next, considering G we optimization algorithm that alternatively optimizes over the i,j,d i,j,d i,j have three dimension (chunk placement, request scheduling and inter/intra-rack weights assignment) of problem JLWO and ∂2G p(Beff )3 + q(Beff )2 + sBeff + t i,j,d = i,j,d i,j,d i,j,d (28) solves each sub-problem repeatedly to generate a sequence of eff 2 eff 4 eff 4 ∂(Bi,j,d) 6(Bi,j,d) (Bi,j,d − Λi,j,dµ) monotonically decreasing objective values. To introduce the proposed algorithm, we first recognize that Problem JLWO is where the auxiliary variables are given by where we have: convex in {π }, {w }, {W } when other variables i,j,rd i,j,d i,j,d p = 24Λi,j,dΓ3 are fixed. The convexity will be shown by a sequence of 2 lemmas as follows. q = 2Λi,j,d(15Γ2 − 28Λi,j,dµΓ3) Lemma 3: (Convexity of the scheduling sub-problem [28].) 2 2 s = 2Λi,j,dµ(22Λi,j,dµΓ3 − 15Γ2) When {zi,rd ,W (w)i,j,d,Sr} are fixed, Problem JLWO is a convex optimization over probabilities {πi,j,r }. 3 2 2 d t = 3Λi,j,dµ (3Γ2 − 4Λi,j,dµΓ3) Proof: The proof is straightforward due to the convexity p(Beff )3 + q(Beff )2 + Λ f(z , Λ ,Beff ) Λ which give out the solution for i,j,d i,j,d of i,j,d i,rd i,j,d i,j,d over i,j,d (which is a lin- eff eff ear combination of {π }) as shown in [28], and the s(Bi,j,d) + t as Bi,j,d > Λi,j,dµ, which is equivalent to i,j,rd eff 1 − Λi,j,dµ/Bi,j,d > 0, which has been approved earlier. Thus fact that all constraints are linear with respect to πi,j,rd . G is also convex in Beff .  i,j,d i,j,d ∂2f Now that we can see eff 2 is positive and conclude that Lemma 4: (Convexity of the bandwidth allocation sub- ∂(Bi,j,d) problem. {z , π ,S , w } eff eff ) When i,rd i,j,rd r i,j,d are fixed, Problem their composition f(zi,rd , Λi,j,d,Bi,j,d) is convex in Bi,j,d and JLWO is a convex optimization over inter-rack bandwidth thus convex in Wi,j,d. So the objective is convex in Wi,j,d as allocation {Wi,j,d}. When {zi,rd , πi,j,rd ,Sr,Wi,j,d} are fixed, well.

Problem JLWO is a convex optimization over intra-rack band- Similarly, when we have {zi,rd , πi,j,rd ,Sr,Wi,j,d} as eff width allocation {wi,j,d}. constants we will again have Bi,j,d convex in {wi,j,d}. Proof: Since all constraints in Problem JLWO are linear with eff And the above proof for f(zi,rd , Λi,j,d,Bi,j,d is con- respect to weights {Wi,j,d}, we only need to show that the eff vex in Bi,j,d still works, so that we can also conclude Λi,j,d eff optimization objective CdT¯d = Cd f(zi,r , Λi,j,d,B )] eff 2λall d i,j,d that f(zi,rd , Λi,j,d,Bi,j,d) is convex in intra-rack band- is convex in Wi,j,d.Then we need to show that width allocation {wi,j,d} as well. This completes the proof. f(z , Λ ,Beff ) is convex in {W } with other i,rd i,j,d i,j,d i,j,d  variables fixed. Notice that effective bandwidth Beff ) is a i,j,d Next, we consider the placement sub-problem that mini- linear function of the bandwidth allocation weights {W } for i,j mizes average latency over {S } for fixed {z , πr , w }. both inter-rack traffic queues (23) and intra-rack traffic queues r i,rd i,j i,j eff In this problem, for each file r in class d we permute the (24) when {wi,j,d} is fixed. Therefore, f(zi,rd , Λi,j,d,Bi,j,d) eff set of racks that contain each file r to have a new placement is convex in {Wi,j,d} if it is convex in Bi,j,d). 0 S = {β(j). ∀j ∈ Sr} where β(j) ∈ N is a permutation. The Toward this end, we consider f(z , Λ ,Beff ) = r i,rd i,j,d i,j,d new probability of accessing file r from rack β(j) when client q 2 Hi,j,d + Hi,j,d + Gi,j,d given in (12), (13) and (14). Finally, is at rack i becomes πi,βj,rd . Our objective is to find such a permutation that minimizes the average service latency of class solved by Hungarian algorithm, each iteration algorithm JLWO d files, which can be solved via a matching problem between is solving the three sub-problems which converge individually, the set of scheduling probabilities {πi,j,rd , ∀i} and the racks, and since latency is bounded below, algorithm JLWO is with respect to their load excluding the contribution of file r. guaranteed to converge to a fixed point of Problem JLWO. −r Let Λi,j,d = Λi,j,d − λi,rd πi,j,rd be the total request rate for class d files between racks i and j excluding the contribution Algorithm JLWO : of file r. We define a complete bipartite graph Gr = (U, V, E) with disjoint vertex sets U, V of equal size N and edge weights Initialize t = 0,  > 0. Initialize feasible {z (0), π (0),S (0)}. given by i,rd i,j,rd r Initialize feasible O(0), O(−1) from (17) N −r User input Cd λ z Λ + λi,r πi,j,r X X i,rd i,rd i,j,d d d while O(t) − O(t−1) >  Ki,j,d = [ λall λall // Solve inter-rack bandwidth allocation for given i=1 r is in classd //{zi,rd (t), πi,j,rd (t), wi,j,d(t),Sr(t)} −r W (t + 1) = arg min (17) s.t. (18), (19), (23), (24). f(zi,rd , Λi,j,d + λi,rd πi,j,rd ,W (w)i,j,d)]. (29) i,j,d Wi,j,d // Solve intra-rack bandwidth allocation for given It is easy to see that a minimum-weight matching on Gr finds // {zi,rd (t), πi,j,rd (t),Wi,j,d(t + 1),Sr(t)} β(j) ∀j to minimize wi,j,d(t + 1) = arg min (17) s.t. (18), (19), (22), (25). wi,j,d N N λ z N N // Solve scheduling for given X X X i,rd i,rd X X {z (t),S (t), w (t + 1),W (t + 1)} Ki,β(j),d = + // i,rd r i,j,d i,j,d λall πi,j,r (t + 1) = arg min (17) s.t. (19), (20). j=1 i=1 r is in classd j=1 i=1 d π i,j,rd −r // Solve placement for given Λi,j,d + λi.rd πi,β(j), rd //{zi,rd (t), wi,j,d(t + 1),Wi,j,d(t + 1), πi,j,rd (t + 1)} λall for d = 1,...,D −r for r in class d f(zi,rd , Λi,j,d + λi,rd πi,β(j), rd,W (w)i,j,d) (30) −r Calculate Λi,j,d using {πi,j,rd (t + 1)}. which is exactly the optimization objective of Problem JLWO Calculate Ki,j,d from (29). (β(j)∀j ∈ N )=Hungarian Algorithm({Ki,j,d}). if a chunk request of class d is scheduled with probability Update π (t + 1) = π (t) ∀i, j. −r i,β(j),rd i,j,rd πi,β(j), rd to a rack with existing load Λi,j,d. Initialize Sr(t + 1) = {}. j = 1,...,N Lemma 5: (Bipartite matching equivalence of the placement for if ∃i s.t. πi,j,rd (t + 1) > 0 sub-problem.) When {zi,rd , πi,j,rd ,W (w)i,j,d} are fixed, the Update Sr(t + 1) = Sr(t + 1) ∪ {j}. end if optimization of Problem JLWO over placement variables Sr end for is equivalent to a balanced Bipartite matching problem of size end for max x N. end for

Proof: Based on the above statements, we can easily see that // Solve zi,rd given the placement sub-problem when other variables are all fixed // {wi,j,d(t + 1),Wi,j,d(t + 1), πi,j,rd (t + 1),Sr(t + 1)} zi,r (t + 1) = arg min . (17). d z ∈ is equivalent to find a minimum-weight matching on a bipartite i,rd R graph whose of size N whose edge weights Ki,j,d are defined // Update bound Ti,rd given // {w (t + 1),W (t + 1), π (t + 1),S (t + 1), in 29, since the minimum weight matching is equal to the i,j,d i,j,d i,j,rd r // zi,rd (t + 1)} objective function of prolbem JLWO. Update objective value O(t+1)=(17). Our proposed algorithm that solves Problem JLWO by itera- Update t = t + 1. end while tively solving the 3 sub-problems is summarized in Algorithm Output: {Sr(t), πi,j,rd (t),Wi,j,d(t), wi,j,d(t), zi,rd (t)} JLWO. It generates a sequence of monotonically decreasing objective values and therefore is guaranteed to converge to a stationary point. Notice that scheduling and bandwidth V. IMPLEMENTATION AND EVALUATION allocation sub-problems as well as the minimization over zi,rd for each file are convex and can be efficiently computed A. Tahoe Testbed by any off-the-shelf convex solvers, e.g., MOSEK [29]. The To validate the weighted queuing model for d different placement sub-problem is a balanced bipartite matching that classes of files in our single data-center system model and can be solved by Hungarian algorithm [30] in polynomial time. evaluate the performance, we implemented the algorithms in Theorem 2: The proposed algorithm generates a sequence of Tahoe [38], which is an open-source, distributed file-system monotonically decreasing objective values and is guaranteed based on the zfec erasure coding library. It provides three to converge. special instances of a generic node: (a) Tahoe Introducer: it Proof: Algorithm JLWO is working on three sub-problems keeps track of a collection of storage servers and clients and iteratively and two of them (the scheduling and inter/intra-rack introduces them to each other. (b) Tahoe Storage Server: it weights assignment) are convex optimization which will gener- exposes attached storage to external clients and stores erasure- ates monotonically decreasing objective values. The placement coded shares. (c) Tahoe Client: it processes upload/download sub-problem is an integer optimization and which can be requests and connects to storage servers through a Web-based REST API and the Tahoe-LAFS (Least-Authority ) server placement, whether in the same rack or not. The set of storage protocol over SSL. blocks on each storage server constitute a chunk. Thus, the file equivalently consists of k chunks which are encoded into n chunks and each chunk consists of multiple blocks3. For chunk Aggregate Switch placement, the Tahoe client randomly selects a set of available storage servers with enough storage space to store n chunks. Rack 1 Rack 2 Rack 3 Rack 10 For server selection during file retrievals, the client first asks VM Instances VM Instances VM Instances VM Instances all known servers for the storage chunks they might have, again regardless of which racks the servers reside in. Once it knows where to find the needed k chunks (from among the fastest servers with a pseudo-random algorithm), it downloads at least the first segment from those servers. This means that it tends to download chunks from the “fastest” servers purely based on round-trip times (RTT). However, we consider RTT plus expected queuing delay and transfer delay as a measure of latency. Storage Server Introducer Client We had to make several changes in Tahoe in order to conduct our experiments. First, we need to have the number Fig. 1. Our Tahoe testbed with ten racks and distributed clients, each rack of racks N ≥ n in order to meet the system requirement that has 10 Tahoe storage servers. each rack can have at most one chunk of the original file. In Our experiment is done on a Tahoe testbed that consists addition, since Tahoe has its own rules for chunk placement of 16 separate physical hosts in an Openstack cluster, 10 out and request scheduling, while our experiment requires client- of which have been used by our experiment. We simulated defined chunk placement in different racks, and also with each host as a rack in the cluster. Each host has 11 VM our server selection algorithms for placement and request instances, out of which 10 of them as running as Tahoe storage scheduling in order to minimize joint latency for both classes servers inside each rack and 1 one of them is running as a of files, we modified the upload and download modules in client node. Host 1 has 12 VM instances as there is one more the Tahoe storage server and client to allow for customized instance running as Introducer in the Tahoe file system. We and explicit server selection for both chunk placement and effectively simulated an Openstack cluster with 10 racks and retrieval, which is specified in the configuration file that is each with 11 or 12 Tahoe storage servers on each rack. The read by the client when it starts. Finally, Tahoe performance cluster uses a Cisco Catalyst 4948 switch, which has 48 ports. suffers from its single-threaded design on the client side; we Each port supports non-blocking, 1Gbps bandwidth in the full had to use multiple clients (multiple threads on one client duplex mode. Our model is very general and can be applied to node) in each rack with separate ports to improve parallelism the case of d classes of files. To simplify the implementation and bandwidth usage during our experiments. process, for experiments we consider only d = 2 classes of files; each class has a different service rate, so the results B. Basic Experiment Setup will still be representative for cases of differentiated services We use (7,4) erasure code in the Tahoe testbed we in- in single data center. The 2-class weighted-queuing model is troduced above throughout the experiments described in the supposed to have 2 ∗ N(N − 1) = 180 queues for inter-rack implementation section. The algorithm first calculates the traffic at the aggregate switch; however, bandwidth reservation optimal chunk placement through different racks for each class through ports of the switch is not possible since the Cisco of files, which will be set up in the client configuration file switch does not support the OpenFlow protocol, so we made for each write request. File retrieval request scheduling and pairwise bandwidth reservations (using a bandwidth control weight assignment decisions of each class of files for inter-rack tool from our Cloud QoS platform) between different Tahoe traffic also comes from Algorithm JLWO. The system calls a Clients and Tahoe storage servers. The Tahoe introducer node bandwidth reservation tool to reserve the assigned bandwidth resides on rack 1 and each rack has a client node with multiple BWi,j,d for each class of files based on optimal weights of tahoe ports to simulate multiple clients to initiate requests of each inter-rack pair; this is also done for intra-rack bandwidth eff different classes of files coming from rack i. Our Tahoe testbed reservation for each class bi,j wi,j,d. Total bandwidth capacity is shown in Fig 1. at the aggregate switch is 96 Gbps, 48 ports with 1Gbps in Tahoe is an erasure-coded distributed storage system with each direction since we are simulating host machines as racks some unique properties that make it suitable for storage and VM’s as storage servers. Intra-rack bandwidth as measured system experiments. In Tahoe, each file is encrypted, and from iPerf measurement is 792Mbps, disk read bandwidth for is then broken into a set of segments, where each segment consists of k blocks. Each segment is then erasure-coded to 3If there are not enough servers, Tahoe will store multiple chunks on one sever. Also, the term “chunk” we used in this paper is equivalent to the produce n blocks (using an (n, k) encoding scheme) and then term “share” in Tahoe terminology. The number of blocks in each chunk is distributed to (ideally) n storage servers regardless of their equivalent to the number of segments in each file. sequential workload is 354 Mbps, and write bandwidth is 127 actual service time distribution for both intra-rack and inter- Mbps. Requests are generated based on arrival rates at each rack retrieval in weighted queuing for the 2 classes of files rack and submitted from client nodes at all racks. on our testbed. We uploaded r = 1000 files for each class of size 100MB file using a (7, 4) erasure code from the C. Experiments and Evaluation client at each rack based on the algorithm output of chunk placement from algorithm JLWO. And we have set C = 1 and Convergence of Algorithm. We implemented Algorithm 1 C = 0.4. Inter-rack and intra-rack bandwidth was reserved JLWO using MOSEK, a commercial optimization solver. With 2 based on the weights for each class output from the algorithm 10 racks and 10 simulated distributed storage servers on each as well. We then initiated 1000 file retrieval requests for rack, there are a total of 100 Tahoe servers in our testbed. Fig- each class (each request for a unique file) from the clients ure 2 demonstrates the convergence of our algorithms, which distributed in the data center, using the algorithm output π optimizes the latency of all file requests of the two classes i,j,d for retrieval request scheduling with the same erasure code. coming from different racks for the weighted queuing model The experiment has 2000 file requests in total (from 10 racks), at the aggregate switch: chunk placement S , load balancing i with an aggregate request arrival rate of 0.25/sec for clients π and bandwidth weights distribution at the aggregate switch i,j at all racks and requests are evenly distributed across the and top-of-rack switch for each class W and w . We i,j,d i,j,d racks. Based on the optimal sets for retrieving chunks of fix C = 1 and then set C = 1 and C = 0.3 to see 1 2 2 each file request provided by our algorithm, we get measure- the performance of algorithm JLWO when the importance ments of service time for both the inter-rack and intra-rack of the second class varies. The JLCM algorithm, which has processes. The average inter-rack bandwidth over all racks been applied as part of our JLWO algorithm, was proven to for request of class 1 files is 674 Mbps and the intra/inter- converge in Theorem 2 of [28]. In this paper, fisrt, we see the rack bandwidth ratio is 889Mbps/674Mbps = 1.318. The convergence of the proposed optimized queuing algorithms in average inter-rack bandwidth over all racks for request of Fig. 2. By performing similar speedup techniques as in [28], class 2 files is 684 Mbps and the intra/inter-rack bandwidth our algorithms efficiently solve the optimization problem with ratio is 632Mbps/389Mbps = 1.625. Figure 3 depicts the r = 1000 files of each class at each of the 10 racks. Second, Cumulative Distribution Function (CDF) of the chunk service we can also see that as class 2 becomes as important as the time for both intra-rack and inter-rack traffic. We note that class 1, the algorithm converges faster as it’s actually solving intra-rack requests for class 1 have a mean chunk service time the problem for one single class of files. It is observed that of 22 sec and inter-rack chunk requests for class 1 have a the normalized objective converges within 140 iterations for a mean chunk service time of 30 sec, which is a ratio of 1.364 tolerance  = 0.01, where each iteration has an average run which is very close to the bandwidth ratio of 1.318. Also intra- time of 1.19 sec, when running on an 8-core, 64-X86 machine, rack requests for class 2 have a mean chunk service time of therefore the algorithm converges within 2.77 min on average 50 sec and inter-rack chunk requests for class 2 have a mean from observation. To achieve dynamic file management, our chunk service time of 83 sec, which is a ratio of 1.66 which optimization algorithm can be executed repeatedly upon file is very close to the bandwidth ratio of 1.625.This means the arrivals and departures. chunk service time is nearly proportional to the bandwidth reservation on inter/intra-rack traffic for both classes of files. Class 1 Class 2 Validate algorithms and joint optimization. In order to 400 validate that Algorithm JLWO works for our system model

350 (with weighted queuing model for 2 classes of files for inter- rack traffic at the switch), we compare the average latency 300 performance for the two file classes in our model in the cases 250 of different access patterns. In this experiment, we have 1000

200 files for each class, all files have the same file size 200MB, and aggregate arrival rate for both classes from all racks is 150 0.25/sec. i.e., there is one request for each unique file, so Average Latency (Sec) 100 the ratio of requests for the two classes are 1:1, and we set C = 1 and C = 0.4. We are measuring the latency 50 1 2 for the two classes of the following access patterns: 100% 0 concentration means 100% of the requests for both classes of 0 0.5 1 1.5 2 C2 files concentrate on only one of the racks. Similarly, 80% or 50% concentration means 80% or 50% of the total requests Fig. 4. file requests for different files of size 100MB, aggregate request arrival of each file come from one rack, the rest of the requests rate for both classes is 0.25/sec; varying C2 to validate our algorithms. spread uniformly among other racks. Uniform access means Validate Experiment Setup. While our service delay bound that the requests for both classes are uniformly distributed applies to arbitrary distribution and works for systems hosting across all racks. We compare average latency for these requests any number of files, we first run an experiment to understand for the two classes in each case of the access patterns with Fig. 3. Actual service time distribution of chunk retrieval Fig. 2. Convergence of Algorithm JLWO with r=1000 requests for through intra-rack and inter-rack traffic for both classes. Each each of the two file classes for heterogeneous files from each rack of them has 1000 files of size 100MB using erasure code on our 100-node testbed, with different weights on the second file (7,4) with the aggregate request arrival rate set to λi = 0.25 class. Algorithm JLWO efficiently compute the solution in 140 /sec. Service time distributions indicated that chunk service time iterations in both cases. of the two classes are nearly proportional to the bandwidth reservation based on weight assignments for M/G/1 queues.

Class 1, Experiment Class 2, Experiment Class 1, Bound Class 2, Bound Class 1, Experiment Class 2, Experiment Class 1, Bound Class 2, Bound 800 350

700 300 600 250 500 200 400

300 150 Average Latency (Sec) Average Latency (Sec) 200 100

100 50

0 100% 80% 50% Uniform 0 Request Access Concentraon (Percentage On One Track) 50 100 150 200 250 File Size (MB)

Fig. 5. Comparison of average latency with different access Fig. 6. Evaluation of different file sizes in our weighted queuing patterns. Experiment is set up for 1000 heterogeneous files for model for both class 1 and class 2 files with C1 = 1 and C2 = 0.4. each class, there is one request for each file and 2000 file Aggregate rate for all file requests at 0.25/sec. Compared with requests in total. Ratio for each class is 1:1. The figure shows class 2 files, our algorithm provides class 1 files relatively lower the percentage that these 2000 requests are concentrated on the latency with heterogeneous file sizes. Latency increases as file same rack. Aggregate arrival rate 0.25/sec, file size 200M. Latency size increases in both classes. Our analytic latency bound taking improved significantly with weighted queuing. Analytic bound for both network and queuing delay into account tightly follows actual both classes tightly follows actual latency as well. service latency for both classes. our weighted queuing model. classes becomes less as the requests becomes more distributed, which is because our algorithm JLWO tentatively assigns As shown in Fig 5, experimental results indicate that our more bandwidth to class 2 requests when the system is less weighted queuing model can effectively mitigate the long congested than when it’s highly congested, which matches our latency due to congestion at one rack for both class 1 and class goal to minimize weighted average latency for both classes of 2, as there is not a significant increase in average latency for files. We also calculated our analytic bound of average latency both classes as the access pattern becomes more concentrated for the two classes when the above access patterns are applied. on one rack. Also we can see that class 1 has relatively lower From the figure we can also see that our analytic bound is tight latency than class 2 files since algorithm JLWO effectively enough to follow the actual average latency for both class 1 assigned more bandwidth to requests for class 1 files since and class 2 files. C1 has a larger value which means class 1 files play a more important role in the optimization objective. It can also be In order to further validate that the algorithms work for observed that the difference in average latency for the two our weighted queuing models for the two classes, we choose Class 1, Experiment Class 2, Experiment Class 1, Bound Class 2, Bound of files from the clients distributed among the racks. Results in 600 Fig 6 show that our algorithm provides class 1 files relatively lower latency with heterogeneous file sizes compared to that 500 of class 2 files, which means algorithm JLWO effectively assigned more bandwidth to requests for class 1 files since 400 C1 has a larger value which means class 1 files plays a more important role in the optimization objective. For instance, 300 latency of class 1 files has a 30% improvement on average over that of class 2 files for the 5 sample file sizes in this 200 Average Latency (Sec) experiment. Latency increases as requested file size increases for both classes when arrival rates are set to be the same. 100 Since larger file size means longer service time, it increases queuing delay and thus average latency. We also observe 0 0.25 0.3 0.35 0.4 0.45 that our analytic latency bound follows actual average service Aggregate Request Arrival Rate (/Sec) latency for both classes of files in this experiment. We note Fig. 7. Evaluation of latency performance for both classes of files that the actual service latency involves other aspects of delay with different request arrival rate in weighted queuing. File size beyond queuing delay, and the results show that optimizing 200M, with C1 = 1 and C2 = 0.4. Compared with class 2 files, the metric of the proposed latency upper bound improves the our algorithm provides class 1 files relatively lower latency with heterogeneous request arrival rates. Latency increases as requests actual latency with the weighted queuing models. arrive more frequently for both classes. Our analytic latency bound Similarly, we vary the aggregate arrival rate at each rack taking both network and queuing delay into account tightly follows from 0.25/sec to 0.45/sec. This time we fix all file requests actual service latency for both classes. for file size 200MB. All other settings of this experiment are the same as the previous one whose result is shown in Fig 6. r = 1000 files of size 100MB for each class, and aggregate In this experiment we also compare the latency of requests for arrival rate 0.25/sec for all file requests. the ratio of requests the two file classes when we have different weights of their for the two classes are 1:1. We fix C1 = 1 and vary C2 latency in optimization objective. (i.e., C1 = 1 and C2 = 0.4). from 0 to 2 and run Algorithm JLWO to generate the optimal We assume the same uniform access as before. For weighted solution for both file classes. The algorithms provide chunk queuing model, we use optimized server selection for upload placement, request scheduling and inter/intra-rack weight as- and download for each file request, and optimal inter/intra- signment (Wi,j,d and wi,j,d). From Fig. 4 we can see latency rack bandwidth reservation from Algorithm JLWO. Clients of class 2 files increases as C2 decreases; i.e., when class 2 across the data center submit r = 1000 requests for each file becomes less important. Also, the average latency of class 1 class with an aggregate arrival rate varying from 0.25/sec to requests decreases as C2 decreases. This shows the expected 0.45/sec. From Fig 7 we can see that our algorithm provides result that when class 2 becomes more important, more weight class 1 files relatively lower latency with heterogeneous file is allocated to class 2, and since C2 is always smaller than C1, sizes compared to that of class 2 files, which means algorithm class 1 gets more bandwidth. This also explains why when JLWO effectively assigned more bandwidth to requests for C1 = C2 = 1 the average latency of the two classes are class 1 files. For instance, latency of class 1 files has a 32% almost equal as shown in Fig. 4. It can also be observed that improvement on average over that of class 2 files for the when C2 = 0 the system is only minimizing the latency of 5 sample aggregate arrival rate in this experiment. Further, class 1 files, and thus class 2 files have a significantly large as the arrival rate increases and there is more contention latency on average. The difference in latency between class at the queues, this improvement becomes more significant. 1 and class 2 become less when C2 = 2 than when C2 = 0 Thus our algorithm can mitigate traffic contention and reduce since in the former case algorithm JLWO is optimizing class latency very efficiently. Also the average latency increases as 1 files as well, though at less importance with C1 = 1. the request arrival at each rack increases since waiting time Evaluate the performance of our solution To demonstrate increases with the workload for both classes. the effectiveness of our algorithms, we vary file size in the experiments from 50MB to 250MB. In this experiment we VI.CONCLUSIONS still have 1000 files for each class, aggregate arrival rate for This paper considers an optimized erasure-coded single both classes from all racks is 0.25/sec, so again the ratio data center storage solution with differentiated services. The of requests for the two classes are 1:1. We assume uniform weighted latency of all file requests is optimized over the random access, i.e., each file will be uniformly accessed from placement and access of the erasure-coded contents of the file, ten racks in the data center with a certain request arrival rate. and the bandwidth reservation at the switches for different Upload/download server selection is based on the algorithm service classes. The proposed algorithm shows significant output Si/πi,j,d, and inter/intra-rack bandwidth reserved ac- reduction in average latency with the knowledge of the file- cording to output Wi,j,d and wi,j,d from the optimization access patterns. Thus, this paper demonstrates that knowing algorithm. Then we submit r = 1000 requests for each class the access probabilities of different files from different racks can lead to an optimized differentiated storage system solu- [24] O. N. C. Yilmaz, C. Wijting, P. Lunden, J. Hamalainen, “Optimized tion, that works significantly better as compared to a storage Mobile Connectivity for Bandwidth- Hungry, Delay-Tolerant Cloud Services toward 5G,” Wireless Communications Systems (ISWCS), 11th solution which ignores the file access patterns. International Symposium on, 2014. [25] D. Niu, C. Feng and B. Li, “Pricing cloud bandwidth reservations REFERENCES under demand uncertainty,” Proceedings of the 12th ACM SIGMET- [1] E. Schurman and J. Brutlag, “ The user and business impact of server RICS/PERFORMANCE joint international conference on Measurement delays, additional bytes and http chunking in web search,” OReilly and Modeling of Computer Systems, pp. 151-162, June 2012. Velocity Web performance and operations conference, June 2009. [26] S.Suganya and Dr.S.Palaniammal, “A Well-organized Dynamic Band- [2] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, and A. Greenberg, “Joinidle- width Allocation Algorithm for MANET,” International Journal of queue: A novel load balancing algorithm for dynamically scalable web Computer Applications, vol. 30(9), pp. 11-15, September 2011. services,” in Proc. IFIP Perforamnce, 2010. [27] A. Kumar, R. Tandon and T.C. Clancy, “On the Latency of Erasure- [3] Dell data center design, “Data Center Design Considerations with 40GbE Coded Cloud Storage Systems,” arXiv:1405.2833, May 2014. and 100GbE,”, Aug 2013. [28] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Joint Latency and [4] A. Fallahi and E. Hossain, “Distributed and energy-Aware MAC for Cost Optimization for Erasure-coded Data Center Storage,” Proc. IFIP differentiated services wireless packet networks: a general queuing Performance, Oct. 2014 (available at arXiv:1404.4975 ). analytical framework,” IEEE CS, CASS, ComSoc, IES, SPS, 2007. [29] MOSEK, “MOSEK: High performance software for large-scale LP, QP, [5] A.S. Alfa, “Matrix-geometric solution of discrete time MAP/PH/1 SOCP, SDP and MIP,” available online at http://www.mosek.com/. priority queue,” Naval research logistics, vol. 45, 00. 23-50, 1998. [30] Hungarian Algorithm, available online at [6] N.E. Taylor and Z.G. Ives, “Reliable storage and querying for collabo- http://www.hungarianalgorithm.com rative data sharing systems,” IEEE ICED Conference, 2010. [31] D. Bertsimas and K. Natarajan, “Tight bounds on Expected Order [7] J.H. Kim and J.K. Lee, “Performance of carrier sense multiple access Statistics,” Probability in the Engineering and Informational Sciences, with collision avoidance in wireless LANs,” Proc. IEEE IPDS., 1998. 2006. [8] E. Ziouva and T. Antoankopoulos, “CSMA/CA Performance under high [32] A. Abdelkefi and J. Yuming, “A Structural Analysis of Network Delay,” traffic conditions: and delay analysis,” Computer Comm, vol. Ninth Annual CNSR, 2011. 25, pp. 313-321, 2002. [33] F. Paganini, A. Tang, A. Ferragut and L.L.H. Andrew, “Network [9] S. Mochan and L. Xu, “Quantifying Benefit and Cost of Erasure Stability Under Alpha Fair Bandwidth Allocation With General File Code based File Systems.” Technical report available at http : Size Distribution,” IEEE Transactions on Automatic Control, 2012. //nisl.wayne.edu/P apers/T ech/cbefs.pdf, 2010. [34] A.B. Downey, “The structural cause of file size distributions,” Proceed- [10] H. Weatherspoon and J. D. Kubiatowicz, “Erasure Coding vs. Repli- ings of Ninth International Symposium on MASCOTS, 2011. cation: A Quantitative Comparison.” In Proceedings of the First [35] M. Bramson, Y. Lu, and B. Prabhakar, “Randomized load balancing IPTPS,2002 with general service time distributions,” Proceedings of ACM Sigmetrics, [11] C. Angllano, R. Gaeta and M. Grangetto, “Exploiting Rateless Codes in 2010. Cloud Storage Systems,” IEEE Transactions on Parallel and Distributed [36] D. Bertsimas and K. Natarajan, “Tight bounds on Expected Order Systems, Pre-print 2014. Statistics,” Probability in the Engineering and Informational Sciences, [12] M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter,S. Rao and J. Kelly, “The 2006. quantcast file system,” Proceedings of the VLDB Endowment, vol. 6, pp. [37] Greenberg, A., Hamilton, J., Maltz, D. A., and Patel, P., “The cost of 1092-1101, 2013. a cloud: research problems in data center networks”, ACM SIGCOMM [13] L. Huang, S. Pawar, H. Zhang and K. Ramchandran, “Codes Can Reduce computer communication review, 39(1), 68-73, 2008. Queueing Delay in Data Centers,” Journals CORR, vol. 1202.1359, [38] B. Warner, Z. Wilcox-O’Hearn, and R. Kinninmont, “Tahoe-LAFS 2012. docs,” available online at https://tahoe-lafs.org/trac/tahoe-lafs. [14] N. Shah, K. Lee, and K. Ramachandran, “The MDS queue: analyzing latency performance of erasure codes,” Information Theory (ISIT), 2014 IEEE International Symposium on, July. 2014. [15] F. Baccelli, A. Makowski, and A. Shwartz, “The -join queue and related systems with synchronization constraints: stochastic ordering and computable bounds, Advances in Applied Probability, pp. 629660, 1989. [16] G. Joshi, Y. Liu, and E. Soljanin, “On the Delay-Storage Trade- off in Content Download from Coded Distributed Storage Systems,” arXiv:1305.3945v1, May 2013. [17] B. Dekeris, T. Adomkus, A. Budnikas, “Analysis of qos assurance using weighted fair queueing (WQF) scheduling discipline with low latency queue (LLQ),” Information Technology Interfaces, 28th International Conference on, 2006. [18] M. Ashour, N. Le-Ngoc, “Performance Analysis of Weighted Fair Queues with Variable Service Rates,” Digital , In- ternational Conference on, 2006. ICDT ’06. [19] M.L. Ma, J.Y.B. Lee, “,” Peer-to-Peer Computing (P2P), IEEE Tenth International Conference on, 2010. [20] P. Xie, J.H. Cui, “An FEC-based Reliable Data Transport Protocol for Underwater Sensor Networks,” Computer Communications and Net- works, Proceedings of 16th International Conference on, 2007. ICCCN 2007. [21] Y. Yang, K.M.M. Aung, E.K.K. Tong, C.H. Foh, “Dynamic Load Balancing Multipathing in Data Center Ethernet,” Modeling, Analysis & Simulation of Computer and Systems (MASCOTS), IEEE International Symposium on,2010. [22] Bu. Lee, R. Kanagavelu, K.M.M. Aung, “An efficient flow algorithm with improved fairness in Software-Defined Data Center Networks,” Cloud Networking (CloudNet), IEEE 2nd International Con- ference on, 2013. [23] S. Chen, Y. Sun, U.C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu and N.B. Shroff, “ When Queuing Meets Coding: Optimal-Latency Data Retrieving Scheme in Storage Clouds,” IEEE Infocom, April 2014.