Low Latency Geo-Distributed Data Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Low Latency Geo-distributed Data Analytics Qifan Pu1;2, Ganesh Ananthanarayanan1, Peter Bodik1 Srikanth Kandula1, Aditya Akella3, Paramvir Bahl1, Ion Stoica2 1Microsoft Research 2University of California at Berkeley 3University of Wisconsin at Madison ABSTRACT 1. INTRODUCTION Low latency analytics on geographically distributed dat- Large scale cloud organizations are deploying dat- asets (across datacenters, edge clusters) is an upcoming acenters and \edge" clusters globally to provide their and increasingly important challenge. The dominant users low latency access to their services. For instance, approach of aggregating all the data to a single data- Microsoft and Google have tens of datacenters (DCs) [6, center significantly inflates the timeliness of analytics. 11], with the latter also operating 1500 edges world- At the same time, running queries over geo-distributed wide [24]. The services deployed on these geo-distributed inputs using the current intra-DC analytics frameworks sites continuously produce large volumes of data like also leads to high query response times because these user activity and session logs, server monitoring logs, frameworks cannot cope with the relatively low and and performance counters [34, 46, 53, 56]. variable capacity of WAN links. Analyzing the geo-distributed data gathered across We present Iridium, a system for low latency geo-distri- these sites is an important workload. Examples of such buted analytics. Iridium achieves low query response analyses include querying user logs to make advertise- times by optimizing placement of both data and tasks ment decisions, querying network logs to detect DoS of the queries. The joint data and task placement op- attacks, and querying system logs to maintain (stream- timization, however, is intractable. Therefore, Iridium ing) dashboards of overall cluster health, perform root- uses an online heuristic to redistribute datasets among cause diagnosis and build fault prediction models. Be- the sites prior to queries' arrivals, and places the tasks cause results of these analytics queries are used by data to reduce network bottlenecks during the query's ex- analysts, operators, and real-time decision algorithms, ecution. Finally, it also contains a knob to budget minimizing their response times is crucial. WAN usage. Evaluation across eight worldwide EC2 re- Minimizing query response times in a geo-distributed gions using production queries show that Iridium speeds setting, however, is far from trivial. The widely-used up queries by 3× − 19× and lowers WAN usage by approach is to aggregate all the datasets to a central 15% − 64% compared to existing baselines. site (a large DC), before executing the queries. How- ever, waiting for such centralized aggregation, signifi- cantly delays timeliness of the analytics (by as much CCS Concepts as 19× in our experiments).1 Therefore, the natural •Computer systems organization ! Distributed alternative to this approach is to execute the queries Architectures; •Networks ! Cloud Computing; geo-distributedly over the data stored at the sites. Additionally, regulatory and privacy concerns might also forbid central aggregation [42]. Finally, verbose or Keywords less valuable data (e.g., detailed system logs stored only geo-distributed; low latency; data analytics; network for a few days) are not shipped at all as this is deemed aware; WAN analytics too expensive. Low response time for queries on these datasets, nonetheless, remains a highly desirable goal. Permission to make digital or hard copies of all or part of this work for personal Our work focuses on minimizing response times of or classroom use is granted without fee provided that copies are not made or geo-distributed analytics queries. A potential approach distributed for profit or commercial advantage and that copies bear this notice would be to leave data in-place and use unmodified, and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- intra-DC analytics framework (such as Hadoop or Spark) mitted. To copy otherwise, or republish, to post on servers or to redistribute to across the collection of sites. However, WAN band- lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGCOMM ’15, August 17 - 21, 2015, London, United Kingdom 1An enhancement could \sample" data locally and send only c 2015 ACM. ISBN 978-1-4503-3542-3/15/08. $15.00 a small fraction [46]. Designing generic samplers, unfortu- DOI: http://dx.doi.org/10.1145/2785956.2787481 nately, is hard. Sampling also limits future analyses. 421 widths can be highly heterogeneous and relatively mod- Bangalore Boston erate [43, 47, 48] which is in sharp contrast to intra-DC Site Manager networks. Because these frameworks are not optimized Site Manager I1 Map1 S1 for such heterogeneity, query execution could be dra- Map2 matically inefficient. Consider, for example, a simple Reduce2 Reduce1 map-reduce query executing across sites. If we place no (or very few) reduce tasks on a site that has a large Core amount of intermediate data but low uplink bandwidth, Network all of the data on this site (or a large fraction) would Global Manager Site Manager have to be uploaded to other sites over its narrow up- Job Queue link, significantly affecting query response time. Map3 We build Iridium, a system targeted at geo-distributed San Francisco Beijing data analytics. Iridium views a single logical analytics framework as being deployed across all the sites. To Figure 1: Geo-distributed map-reduce query. The user submits the query in San Francisco, and the achieve low query response times, it explicitly considers query runs across Boston, Bangalore and Beijing. the heterogeneous WAN bandwidths to optimize data We also show the notations used in the paper at and task placement. These two placement aspects are Bangalore, see Table 1. central to our system since the source and destination of a network transfer depends on the locations of the data that trades off the WAN usage and latency by limiting and the tasks, respectively. Intuitively, in the example the amount of WAN bandwidth used by data moves and above, Iridium will either move data out of the site with task execution. In our experiments, with a budget equal low uplink bandwidth before the query arrives or place to that of a WAN-usage optimal scheme (proposed in many of the query's reduce tasks in it. [53, 54]), Iridium obtains 2× faster query responses. Because durations of intermediate communications Our implementation of Iridium automatically estimates (e.g., shuffles) depend on the duration of the site with site bandwidths, future query arrivals along with their the slowest data transfer, the key intuition in Iridium characteristics (intermediate data), and prioritizes data is to balance the transfer times among the WAN links, movement of the earlier-arriving queries. It also sup- thereby avoiding outliers. To that end, we formulate ports Apache Spark queries, both streaming [60] as well the task placement problem as a linear program (LP) as interactive/batch queries [59]. 2 by modeling the site bandwidths and query character- Evaluation across eight worldwide EC2 regions and istics. The best task placement, however, is still limited trace-driven simulations using production queries from by input data locations. Therefore, moving (or repli- Bing Edge, Conviva, Facebook, TPC-DS, and the Big- cating) the datasets to different sites can reduce the data benchmark show that Iridium speeds up queries by anticipated congestion during query execution. 3× − 19× compared to existing baselines that (a) cen- The joint data and task placement, even for a single trally aggregate the data, or (b) leave the data in-place map-reduce query, results in a non-convex optimization and use unmodified Spark. with no efficient solution. Hence, we devise an effi- cient and greedy heuristic that iteratively moves small 2. BACKGROUND AND MOTIVATION chunks of datasets to \better" sites. To determine which datasets to move, we prefer those with high value-per- We explain the setup of geo-distributed analytics (x2.1), byte; i.e., we greedily maximize the expected reduction illustrate the importance of careful scheduling and stor- in query response time normalized by the amount of age (x2.2), and provide an overview of our solution (x2.3). data that needs to be moved to achieve this reduction. 2.1 Geo-distributed Analytics This heuristic, for example, prefers moving datasets with many queries accessing them and/or datasets with Architecture: We consider the geo-distributed ana- queries that produce large amount of intermediate data. lytics framework to logically span all the sites. We Our solution is also mindful of the bytes transferred assume that the sites are connected using a network on the WAN across sites since WAN usage has impor- with congestion-free core. The bottlenecks are only be- tant cost implications ($/byte) [53]. Purely minimizing tween the sites and the core which has infinite band- query response time could result in increased WAN us- width, valid as per recent measurements [13]. Addi- age. Even worse, purely optimizing WAN usage can ar- tionally, there could be significant heterogeneity in the bitrarily increase query latency. This is because of the uplink and downlink bandwidths due to widely different fundamental difference between the two metrics: band- link capacities and other applications (non-Iridium traf- width cost savings are obtained by reducing WAN usage fic) sharing the links. Finally, we assume the sites have on any of the links, whereas query speedups are ob- relatively abundant compute and storage capacity. tained by reducing WAN usage only on the bottleneck Data can be generated on any site and as such, a link. Thus, to ensure fast query responses and reason- dataset (such as \user activity log for application X") able bandwidth costs, we incorporate a simple \knob" 2https://github.com/Microsoft-MNR/GDA 422 could be distributed across many sites.