Differentiated Latency in Data Center Networks with Erasure Coded Files

Differentiated latency in data center networks with erasure coded files through traffic engineering Yu Xiang, Vaneet Aggarwal, Yih-Farn R. Chen, and Tian Lan Abstract—This paper proposes an algorithm to minimize system [13], [16], [14], [28], [23]. However, little effort has weighted service latency for different classes of tenants (or been made to address the impact of data ceter networking on service classes) in a data center network where erasure-coded service latency. files are stored on distributed disks/racks and access requests are scattered across the network. Due to limited bandwidth In this paper, we utilize the probabilistic scheduling policy available at both top-of-the-rack and aggregation switches and developed in [28] and analyze service latency of erasure- tenants in different service classes need differentiated services, coded storage with respect to data center topologies and network bandwidth must be apportioned among different intra- traffic engineering. To the best of our knowledge, this is the and inter-rack data flows for different service classes in line with first work that accounts for both erasure coding and data their traffic statistics. We formulate this problem as weighted queuing and employ a class of probabilistic request scheduling center networking in latency analysis. Our result provides a policies to derive a closed-form upper-bound of service latency tight upper bound on mean service latency of erasure coded- for erasure-coded storage with arbitrary file access patterns and storage and applies to general data center topology. It allows service time distributions. The result enables us to propose a us to construct quantitative models and find solutions to a joint weighted latency (over different service classes) optimization novel latency optimization leveraging both the placement of over three entangled “control knobs”: the bandwidth allocation at top-of-the-rack and aggregation switches for different service erasure coded chunks on different racks and the bandwidth classes, dynamic scheduling of file requests, and the placement reservations at different switches. of encoded file chunks (i.e., data locality). The joint optimization We consider a data center storage system with a hierarchical is shown to be a mixed-integer problem. We develop an iterative structure. Each rack has a TOR switch that is responsible algorithm which decouples and solves the joint optimization for routing data flows between different disks and associated as 3 sub-problems, which are either convex or solvable via bipartite matching in polynomial time. The proposed algorithm storage servers in the rack, while data transfers between is prototyped in an open-source, distributed file system, Tahoe, different racks are managed by an aggregation switch that and evaluated on a cloud testbed with 16 separate physical hosts connects all TOR switches. Multiple client files are stored in an Openstack cluster using 48-port Cisco Catalyst switches. distributively using an (n; k) erasure coding scheme, which Experiments validate our theoretical latency analysis and show allows each file to be reconstructed from any k-out-of-n significant latency reduction for diverse file access patterns. The results provide valuable insights on designing low-latency data encoded chunks. We assume that file access requests may center networks with erasure coded storage. be generated from anywhere inside the data center, e.g., a virtual machine spun up by a client on any of the racks. I. INTRODUCTION We consider an erasure-coded storage with multiple tenants Data center storage is growing at an exponential speed and and differentiated delay demands. While customizing elastic customers are increasingly demanding more flexible services service latency for the tenants is undoubtedly appealing to exploiting the tradeoffs among the reliability, performance, and cloud storage, it also comes with great technical challenges storage cost. Erasure coding has seen itself quickly emerged as and calls for a new framework for delivering, quantifying, a very promising technique to reduce the cost of storage while and optimizing differentiated service latency in general erasure providing similar system reliability as replicated systems. The coded storage. arXiv:1602.05551v1 [cs.DC] 17 Feb 2016 effect of coding on service latency to access files in an erasure- Due to limited bandwidth available at both the TOR and coded storage is drawing more attention these days. Google aggregation switches, a simple First Come First Serve (FCFS) and Amazon have published that every 500 ms extra delay policy to schedule all file requests fails to deliver satisfactory means a 1.2% user loss [1]. Modern data centers often adopt latency performance, not only because the policy lacks the a hierarchical architecture [37]. As encoded file chunks are ability to adapt to varying traffic patterns, but also due to the distributed across the system, accessing the files requires - need to apportion networking resources with respect to het- in addition to data chunk placement and dynamic request erogeneous service requirements. Thus, we study a weighted scheduling - intelligent traffic engineering solutions to handle: queueing policy at top-of-the-rack and aggregation switches, 1) data transfers between racks that go through aggregation which partition tenants into different service classes based on switches and 2) data transfers within a rack that go through their delay requirement and apply differentiated management Top-of-Rack (TOR) switches. policy to file requests generated by tenants in each service Quantifying exact latency for erasure-coded storage system class. We assume that the file requests submitted by tenants is an open problem. Recently, there has been a number of in each class are buffered and processed in a separate first- attempts at finding latency bounds for an erasure-coded storage come-first-serve queue at each switch. The service rate of these queues are tunable with the constraints that they are non- Symbol Meaning negative, and sum to at most the switch/link capacity. Tuning these weights allows us to provide differentiated service rate to N Number of racks, indexed by i = 1;:::;N tenants in different classes, therefore assigning differentiated m Number of storage servers in each rack service latency to tenants. R Number of files in the system, indexed by r = 1;:::;R In particular, a performance hit in erasure-coded storage (n; k) Erasure code for storing files systems comes when dealing with hot network switches. The D Number of services classes latency of each file request is determined by the maximum delay in retrieving k-out-of-n encoded chunks. Without proper B Total available bandwidth at the aggregate switch coordination in processing each batch of chunk requests that b Total available bandwidth at each top-of-the-rack switch jointly reconstructs a file, service latency is dominated by beff Effective bandwidth for requests from node i to node j in the same rack staggering chunk requests with the worst access latency (e.g., i;j eff chunk requests processed by heavily congested switches), sig- Bi;j;d Effective bandwidth for servicing requests from rack i to rack j nificantly increasing overall latency in the data center. To avoid wi;j;d Weight for apportioning top-of-rack switch bandwidth for service class d this, bandwidth reservation can be made for routing traffic W d among racks [24], [25], [26]. Thus, we apportion bandwidth i;j;d Weight for apportioning aggregate switch bandwidth for service class λi Arrival rate of request for file r of service class d from rack i at TOR and aggregation switches so that each flow has its rd own FCFS queue for the data transfer. We jointly optimize πi;j;r Probability of routing rack-i file-r of service class drequest to rack j bandwidth allocation and data locality (i.e., placement of d S Set of racks for placing encoded chunks of file r of service class d encoded file chunks) to achieve service latency minimization. rd Ni;j Connection delay for service from rack i to rack j For a given set of pair-wise bandwidth allocations among racks, and placement of different erasure-coded files in dif- Qi;j;d Queuing delay for service of class d from rack i to rack j ¯ ferent racks, we first find an upper bound on mean latency Ti;rd Expected latency of a request of file r from rack i of each file by applying the probabilistic scheduling policy proposed in [28]. Having constructed a quantitative latency TABLE I model, we consider a joint optimization of average service MAIN NOTATION. latency (weighted by the rate of arrival of requests, and weight of the service class) over the placement of contents of each file, the bandwidth reservation between any pair of racks (at TOR and aggregation switches), and the scheduling II. SYSTEM MODEL probabilities. This optimization is shown to be a mixed-integer We consider a data center consisting of N racks, denoted optimization, especially due to the integer constraints for by N = fi = 1; 2;:::;Ng, each equipped with m homoge- the content placement. To tackle this challenge, the latency neous servers that are available for data storage and running minimization is decoupled into 3 sub-problems; two of which applications. There is a Top-of-Rack (TOR) switch connected are proven to be convex. We propose an algorithm which to each rack to route intra-rack traffic and an aggregate switch iteratively minimizes service latency over the three engineering that connects all TOR switches for routing inter-rack traffic. A degrees of freedom with guaranteed convergence. set of files R = f1; 2;:::;Rg are distributedly stored among m ∗ N To validate our theoretical analysis and joint latency op- these servers and are accessed by client applications. timization for different tenants, we provide a prototype of Our goal is to minimize the overall file access latency of all the proposed algorithms in Tahoe [38], which is an open- applications to improve performance of the cloud system.

Load more