arXiv:1802.00245v3 [cs.DC] 7 Feb 2018 g eore n sintss u xeiet nthe on experiments Our H of tasks. prototype assign and resources age iedcsosadoln rdcin,mnmzn re- minimizing predictions, online and decisions time aaaayissse,H system, analytics data jobs analytics data changes. of any and range requiring generic wide without is a (3) run and to failures; enough possible flexible the to due reliabili jobs the of guarantees envi (2) centers; (changeable data resources across ronment) the to effi- to) is to (adjust jobs utilize work for ciently mechanism this intelligent an system in employs analytics (1) goal data that Our geo-distributed practical a costs. develop monetary high environment and runtime regulatory changeable/unreliable unique limits, faces constraints, bandwidth centers WAN data including geo-distributed challenges of scale system the analytics to data cluster-scale Naive existing organisations. of large extension in information common useful increasingly derive to are analytics data Geo-distributed Abstract yisjb nteegodsrbtddt r mriga a as [ emerging requirement are daily data geo-distributed is these on – jobs Ana lytics traces centers. logging, data job distributed interaction geographically and at user generated including monitoring, – infrastructure data compute raw the sult, nmlil aacnesaon h ol ome the meet to world the around [ requirements centers latency-sensitive data multiple applications in their deploying are organizations Nowadays, Introduction 1 and architecture, failures. centralized facing when existing executions job the reliable guarantees in as mance e-itiue o nec aacne,s htthese that so center, data could each JMs in replicated job geo-distributed a H sovereign a in center. operating data each systems, autonomous tiple ein hwta H that show regions ioaZag hzogQa,SegZag ieL,XagoL Xiangbo Li, Yize Zhang, Sheng Qian, Zhuzhong Zhang, Xiaoda Towards eas hs nltc osuulyspottereal- the support usually jobs analytics these Because oti n,w rsn e,gnrlgeo-distributed general new, a present we end, this To OUTU tt e aoaoyfrNvlSfwr ehooy Nanjin Technology, Software Novel for Laboratory Key State OUTU 25 Reliable OUTU individually , unn cosfu lbb Cloud Alibaba four across running anan o aae J)for (JM) manager job a maintains e-itiue aaAayisSystem Analytics Data Geo-distributed 39 OUTU , 45 rvdsefiin o perfor- job efficient provides , 48 hti opsdo mul- of composed is that , 13 , and 28 , (and 21 , 38 cooperatively , 36 , 44 , 12 , Efficient 46 .A re- a As ]. , 27 , man- 19 ]. ty s - - , o xctosi a in Executions Job ) eid( period rnmsinrt cosdt centers data across rate transmission xlctyfruaeWNbnwdha constant. a as bandwidth WAN formulate explicitly rle yohrututdprisi h hrdenviron- shared the in [ parties ment untrusted other by trolled otostersucso h okrmcie rmall Fig. from in machines shown worker as the centers, data of resources master the monolithic a controls where architecture centralized a ploy etars aacness st mrv aalocality data improve to as so [ centers data across ment and environment, costs. monetary runtime even regu- unreliable and constraints, area legislative latory wide limits, of bandwidth challenges (WAN) unique important. network the are face throughput these maximizing However, and time sponse in r salsiglw orsrc h aamovement data the re- restrict more [ to and laws More establishing so. are do gions to us prevent constraints latory adit ssal.Ti a o cuaeycnomto conform accurately [ not reality may the This stable. is bandwidth data constraints. remote regulatory from the resources to respects acquire which to centers job a for makin this possible potentialities, explore its We and architecture executions. decentralized coordi- job to geo-distributed functionalities for system nate original the extend and tonomous 6 34 iue1 etaie s eetaie aaanalytics. data decentralized vs. Centralized 1: Figure , xsigapoce piietssado aaplace- data and/or tasks optimize approaches Existing nadto,ms xsigwrsasm htteWAN the that assume works existing most addition, In , a etaie rhtcue(b)Decentralized architecture (a) Centralized architecture 41 45 DC 1 DC 2 DC 1 DC , 18 , 10 master § 28 2.2 ( ] n orsrc Trsucsfo en con- being from resources IT restrict to and ] aaaayissse e aacne (Fig. center data per system analytics data , § 26 .Hne hsrsrcinde o lo sto us allow not does restriction this Hence, ). 38 2.1 WAN , , 32 an deploy to is way alternative An ). 44 ,adoreprmnsvrf htdata that verify experiments our and ], , worker 46 DC DC 3 ,XalagWn,Snl Lu Sanglu Wang, Xiaoliang i, .Hwvr l rvoswrsem- works previous all However, ]. University g DC 1 DC 2 DC 1 DC 1 a.W ru htregu- that argue We (a). varies Practical WAN vni short a in even DC DC 3 1 (b)), au- it g On the other hand, for most organizations who have For resource management, we classify three cases the geo-distributed data analytics requirement, the most where each job manager independently either requests convenient way is to purchase public cloud instances. more resources, or maintains current resources, or proac- Decisions must be made between choosing reliable (Re- tively releases some resources. The key insight here is served and On-demand) instances and unreliable (Spot) using nearly past resource utilization as feedback, irre- instances, due to the different monetary costs and job re- spectively of the prediction of future job characteristics. liability demands. Spot market prices are often signifi- Even without the future job characteristics, when cooper- cantly lower – by up to an order of magnitude – than ating with our new task assignment method, we theoreti- fixed prices for the same instances with a reliability Ser- cally prove (under some conditions) the efficiency of job vice Level Agreement (SLA) (§2.3). However, is it pos- executions by extending the very recent result [52](§4.4). sible for cloud users to obtain reliability from unreliable Each replicated JM keeps track of the current process instances with a reduced cost? There are positive answers of the job execution. We carefully design what need to by designing user bidding mechanisms [47, 53], while we be included in the intermediate information, which can answer this question in a systematic way, by providing be used to successfully recover the failure, of even the job-level fault tolerance. primary JM. Our goal in this new decentralized and change- We build HOUTU in Spark [50] on YARN [42] system, able/unreliable environment is to design new resource and leverage Zookeeper [29] to guarantee the intermediate management, task scheduling and fault tolerance strate- information consistent among job managers in different gies to achieve reliable and efficient job executions. data centers. We deploy HOUTU across four regions on To achieve this goal, such a system needs to address Alibaba Cloud (AliCloud). Our evaluation with typical three key challenges. First, we need to find an efficient workloads including TPC-H and machine learning algo- scheduling strategy that can dynamically adapt scheduling rithms shows that, HOUTU: (1) achieves efficient job per- decisions to the changeable environment. This is difficult formance as in the centralized architecture; (2) guarantees because we do not assume job characteristics as a priori reliable job executions when facing job failures; and (3) is knowledge [33], or use offline analysis [43] for its signif- very effective in reducing monetary costs. icant overhead. Second, we need to implement fault tol- We make three major contributions: erance mechanism for jobs running atop unreliable Spot • We present a general decentralized data analyt- instances. Though existing frameworks [20, 30, 50] tol- ics system to respect the possible regulatory con- erate task-level failures, the job-level fault tolerance is straints and changeable/unrealible runtime environ- absent. While in the unreliable setting, the two types of ment. The key idea is to provide a job manager for a failures have the same chance to occur. Third, we need geo-distributed job in each data center. The system is to design a general system that efficiently handles geo- general and flexible enough to deploy a wide range distributed job executions without requiring any job - of data analytics jobs while requiring no change to scription changes. This is challenging because data can the jobs themselves (§3.1). disperse among sovereign domains (data centers) with regulatory constraints. • We propose resource management strategy Af for In this work, we present HOUTU1, a new general geo- each JM which exploits resource utilization as feed- distributed data analytics system that is designed to effi- back. We design task assignment method Parades ciently operate over a collection of data centers. The key which combines the assignment within and between idea of HOUTU is to maintain a job manager (JM) for the data centers. We prove Af + Parades guarantees geo-distributed job in each data center, and each JM can efficiency for geo-distributed jobs with respect to individually assign tasks within its own data center, and makespan (§4). We carefully design the mechanism also cooperatively assign tasks between data centers. This of coordinating JMs, and the intermediate informa- differentiation allows HOUTU to run conventional task as- tion to recover a failure (§3.2). signment algorithms within a data center [49, 33, 52]. At • We build a prototype of our proposed system using the same time, across different data centers, HOUTU em- ploys a new work stealing method, converting the task Spark, YARN and Zookeeper as blocks, and demon- steals to node update events which respects to the data strate its efficiencies over four geo-distributed re- locality constraints. gions with typical diverse workloads (§5 and §6). We show that HOUTU provides efficient and reliable

1 HOUTU is the deity of deep earth in ancient who job executions, and significantly reduces the costs for controls lands from all regions. running these jobs.

2 2 Background and Motivation NC-3 NC-5 EC-1 SC-1 NC-3 (821,95) (79,22) (78,24) (79,24) This section motivates and provides background for NC-5 -- (820,115) (103,28) (71,28) HOUTU. §2.1 describes the existing and upcoming regu- EC-1 -- -- (848,99) (103,30) latory constraints which prevent us from employing a cen- SC-1 ------(821,107) tralized architecture. We measure the scarce and change- able WAN bandwidth between AliCloud regions in §2.2. Figure 2: Measured network bandwidth between four dif- We investigate a way to reduce monetary cost using Spot ferent regions in AliCloud. The entry is of form (Average, instances in §2.3, which introduce the unreliability. Standard deviation) Mbps.

2.2 Changeable environment 2.1 Regulatory constraints It is well known that WAN bandwidth is a very scarce Though it is efficient to employ data analytics systems in resource relative to LAN bandwidth. To quantify WAN clouds, many organisations still decline to widely adopt bandwidth between data centers, we measure the network cloud services due to severe confidentiality and privacy bandwidth between all pairs of AliCloud in four regions concerns [15], and explicit regulations in certain sectors including NorthChina-3 (NC-3), NorthChina-5 (NC-5), (healthcare and finance) [14]. Local governments start to EastChina (EC-1), and SouthChina-1 (SC-1). We mea- impose constraints on raw data storage and movement [6, sure the network bandwidth of each pair of different re- 10, 41]. These constraints exclude the solutions that move gions for three rounds, each for 5 minutes. As shown in arbitrary raw data between data centers [38, 44]. Fig. 2, the bandwidth within a data center is around 820 Public clouds allow users to instantiate virtual ma- Mbps, while around 100 Mbps between data centers. chines (instances) on demand. In turn, the use of virtu- What we emphasize is that the WAN bandwidth varies alization allows third-party cloud providers to maximize between different regions even in a small period. The the utilization of their sunk capital costs by multiplexing standard deviation can be as much as 30% of the available many customer VMs across a shared physical infrastruc- WAN bandwidth itself. The fluctuated bandwidth leads to ture. However, this approach introduces new vulnerabil- data transmission time unpredictable [26, 32]. ities. It is possible to map the internal cloud infrastruc- Furthermore, it may not always be the same resource ture, identify where a particular target VM is likely to re- – WAN bandwidth – that causes runtime performance side, and then instantiate new VMs until one is placed bottlenecks in wide-area data analytics queries. It is co-resident with the target, which can then be used to confirmed that memory may also becomes the bottle- mount cross-VM side-channel attacks to extract informa- neck at runtime [46], thus these uncertainties do not al- tion from a target VM [40, 18]. The attack amplifier low us to assume the capacities of resources (e.g. net- turns this initial compromise of a host into a platform for work, compute) as constant in mathematical program- launching a broad, cloud-wide attack [17]. ming [38, 44, 45]. We design intelligent mechanisms that Hence, cloud providers and exiting works are propos- can make online scheduling decisions to the changeable ing solutions in which a group of instances have their ex- environment. ternal connectivity restricted according to a declared pol- icy as a defense against information leakage [51, 3, 2]. As 2.3 Spot instance: towards reducing cost a result, these upcoming regulatory constraints lead to de- ploying an autonomous system in each data center, which Cloud computing providers may offer different SLAs at contains a complete stack of data analytics software. different prices so that users can control the value transac- tion at a fine level of granularity. Besides offering reliable By following exactly this guideline, we propose a de- (Reserved and On-demand) instances, cloud providers centralized architecture (Fig. 1(b)) and design how re- such as Google Cloud Platform (GCP) [7], Amazon EC2 source management and task scheduling should be per- [5], Microsoft Azure [9] and Alibaba Cloud [1] also of- formed to support geo-distributed job executions. We fer “Spot instances”2 where resources (at a cheaper price) speculate that derived information, such as aggregates and without a reliability SLA. When a user makes a request reports (which are critical for business intelligence but have less dramatic privacy implications) may still be al- 2We use this term from EC2, while it is called “preemptible VM” in lowed to cross geographical boundaries. GCP, “bidding instance” in AliCloud, and “low-priority VM” in Azure.

3 Reserved On-demand Spot 3.1 HOUTU architecture (per year) (per hour) (per hour) GCP [7] 1164 0.19 0.04 As shown in Fig. 4(a), HOUTU is of the decentralized ar- EC2 [5] 1013 0.2 0.035 chitecture, which is composed with several autonomous AliCloud [1] 866 0.312 0.036 systems, deployed in geographically distributed data cen- ters. Each system has the ability to run conventional Azure [10] 1312 0.26 > 0.06 single-cluster jobs, and also to cooperate with each other to support geo-distributed job executions, while we focus Figure 3: Three pricing ways to pay for an instance with the latter in this work. h4 vCPU, 16 GB memoryi in GCP, EC2, AliCloud and As stated in §1, HOUTU is a general system that effi- Azure (in USD). ciently handles geo-distributed job executions without re- quiring any job description changes. We speculate that for a Spot instance having a specific set of characteristics users have the knowledge of how data is distributed across (e.g. h4 vCPU, 16 GB memoryi), he/she includes a max- several data centers. The users specify the data locations imum bid price indicating the maximum that the user is “as if” in a centralized architecture, except with different willing to be charged for the instance. Cloud providers “masters”. In the SQL example of Fig. 5, three tables are create a market for each instance type and satisfy the re- in different data centers, and the job derives statical infor- quest of the highest bidders. Periodically, cloud providers mation from all these tables. HOUTU will automatically recalculate the market price and terminate those instances support the execution of a job described in this way. whose maximum bid is below the new market price. Be- Next, we present a job’s lifecycle, following the steps cause the Spot instance market mechanism does not pro- of Fig. 4(a). vide a way to guarantee how long an instance will run be- The job submission and job manager generations: fore it is terminated as part of a SLA, Spot market prices Suppose a user submits the DAG job to a chosen mas- are often significantly lower than fixed prices for the same ter (step 0). We use DAG to refer to a directed acyclic instances with a reliability SLA (by up to 10x lower than graph, where each vertex represents a task and edges en- On-demand price, and 3x lower than Reserved price, as code input-output dependencies. The master would re- shown in Fig. 3). solve the job description and generate corresponding job Is it possible to deploy a data analytics system using managers for it (step 1). It directly generates a primary Spot instances and guarantee reliable job executions with job manager (pJM) within its own cluster (step 2). For the reduced cost? To answer this question, it requires to tol- remote resources, the master forwards the job description erate job-level and task-level failures due to the termina- to the remote masters (step 2a) and tells them to generate tions of unreliable instances, where the former one relates semi-active job manager (sJM) for it (step 2b) (§3.2). to the failure of job managers. Because both job managers Resource request and task executions: Further, to and tasks run in unified containers, the two types of fail- obtain compute resources (task executors3), the job man- ures have the same opportunity to occur. Unfortunately, agers independently send the requests to their local mas- while the task-level fault-tolerance is implemented in cur- ters (step 3). The masters (job schedulers) schedule re- rent systems [20, 50, 30], these systems do not tolerate job sources to the JMs according to their scheduling invari- manager failures except for restarting them. ants, and signal this by returning to the JMs containers We propose HOUTU, which extends the current system that grant access to such resources (step 4). After that, functionalties to implement the job-level fault-tolerance, the JMs send the tasks to run in the containers (step 5). and applies our dynamic scheduling schemes in resource As the DAG job is dynamically unfolded and resource re- management and task assignment in the decentralized ar- quests of the job usually are not satisfied in a single wave, chitecture. We experimentally verify the effectiveness and JMs often repeat steps 3 – 5 for multiple times. efficiency of HOUTU. We leave the design of how job managers request re- sources without further characteristics of the unfolding 3 System Overview DAG, and how to schedule tasks within a data center and between data centers in §4. We first provide an overview of the HOUTU architecture and a job’s lifecycle in HOUTU. Next we elaborate how a 3Unless otherwise specified, we use the term “container” and “ex- job acts in normal operation and failure recovery. ecutor” interchangeably.

4 worker master util. feedback job manager Intermediate info. : jobId, 0 task assignment task executor task generation stageId, partionList, pJM 2 executorList, taskMap 1 4 signal 3 2a master1 master2 5 5 (job scheduler) WAN (job scheduler) pJM sJM 3 WAN 4 2b (Af+Parades) (Af+Parades) 2a sJM master3 (job scheduler) sJM 2b 4 3 (Af+Parades) sJM 5

(a) HOUTU’s architecture and a job’s lifecycle. (b) A job’s logical topology in HOUTU.

Figure 4: HOUTU’s architecture, a job’s lifecycle and a job’s logical topology in it.

OLQHLWHP WH[W)LOH ĀKGIVPDVWHUWSFKOLQHLWHPWEOā data, it reports to its job manager (pJM or sJM) about the RUGHUV WH[W)LOH ĀKGIVPDVWHUWSFKRUGHUVWEOā FXVWRPHU WH[W)LOH ĀKGIVPDVWHUWSFKFXVWRPHUWEOā output partition location. The job manager collects the 6KLSSLQJB3ULRULW\ VTO Ā6(/(&7/B25'(5.(< partition location information in its cluster, modifies the 680 /B(;7(1'('35,&( /B',6&2817 $65(9(18( 2B25'(5'$7(2B6+,335,25,7<)520FXVWRPHU&-2,1RUGHUV2 partitionList, and then notifies other job managers to keep 21&&B&867.(< 22B&867.(<-2,1OLQHLWHP/21 the consistency of partitionList. Besides the partitionList, //B25'(5.(< 22B25'(5.(<:+(5(2B25'(5'$7(    $1'/B6+,3'$7(!  *5283%</B25'(5.(< HOUTU includes jobId, stageId, executorList (the avail- 2B25'(5'$7(2B6+,335,25,7<25'(5%<5(9(18('(6& able executors from all data centers, including JMs and 2B25'(5'$7(/,0,7ā their associated roles), and taskMap (which task should be assigned by which JM) in a job’s intermediate infor- Figure 5: Pseudo-code of a job’s description in HOUTU. mation (Fig. 4(b)). HOUTU maintains a replication of the intermediate information in each data center. 3.2 AjobinHOUTU Since the job managers operate synchronously, when the job completes, all of them will proactivelyrelease their We show how the primary job manager (pJM) and semi- resources as well as themselves to their data centers. active job managers (sJMs) coordinate to execute a job. 3.2.2 Failure recovery 3.2.1 Normal operation As stated in §2.3, we focus on in this work the recoveryof In normal operation, there is exactly one pJM and all of job-level failures, which is the failures of job managers. the other JMs are sJMs in a job. When the master (to When a semi-active job manager fails because of the which a user submit the job) forwards the request (step 2a unpredictable termination of its host, the primary job in Fig. 4(a)), it includes the job description. Thus, all the manager will notice it and then send a request through its generated job managers hold the DAG structure of the job. local master to generatea new sJM in the remotedata cen- When the job managers are in position, the pJM first ter (like steps 2a and 2b in Fig. 4(a)). This sJM starts with decides the initial task assignment among the job man- the original job description and the intermediate informa- agers, and then the job managers cooperatively schedule tion in its cluster and recognises its role (as semi-active). and generate tasks to execute (dot line in Fig. 4(b)) (§4.3). It inherits the containers belonging to the previous sJM, We call each sJM semi-active because it is not totally un- and continues to operate as in normal. der control of the primary job manager, and it has free- If the primary fails, the semi-active job managers will dom to determine the task assignment in its own cluster elect a new primary using the consistent protocol (in (dash line), to coordinate with other sJMs about task as- Zookeeper). The new pJM updates and propagates the signment, and to manage its compute resources according intermediate information about its role change. Next, the to resource utilization feedback (purple solid line) (§4.2). new primary continues the process of the job, operates in After a task completes its computation on a partition of normal and generates a new semi-active job manager to

5 running waiting Algorithm 1 Af (applied by each job manager) 2 tasks tasks Stage 0 1: procedure AF(d(q − 1), a(q − 1), u(q − 1)) 1 2 3 (available) 2: if q =1 then 3: d(q) ← 1 1 3 Stage 1 4: else if u(q − 1) <δ and no waiting tasks then 4 5 (unavailable) 5: d(q) ← d(q − 1) / ρ //inefficient 6: else if d(q − 1) >a(q − 1) then Stage 2 7: d(q) ← d(q − 1) //efficient and deprived 6 (unavailable) 8: else 9: d(q) ← d(q − 1) · ρ //efficient and satisfied return d(q) Figure 6: Example of a running DAG job Ji.

replace the failed pJM as above. makespan and average job response time.4 Please refer We assume that all the job managers would not fail si- to Appendix A for the problem formulation. HOUTU ap- multaneously. Actually, it is of particular interest to study plies Af (Adaptive feedback algorithm) for each JM to the problem which guarantees deterministic reliability of manage resources, and Parades in each JM to schedule a job execution in the mixed environment (with reliable tasks, which we will demonstrate in next two subsections, and unreliable instances) and minimizes the total mone- respectively. tary cost, however this is out of the scope of this work. 4.2 Resource management using Af 4 Design Resources in a data center are scheduled by the job sched- uler to sub-jobs between periods, each of equal time In this section, we first provide the problem statement of length L. We denote the sub-job to the collection of tasks optimizing efficiency of jobs (§4.1). Next, we show how of a job that are executedin the same data center (andhan- the JMs use resource utilization feedback to manage re- dled by the same job manager). Fig. 6 shows an example of sub-job partition with dot-line cycles. sources (step 3 in Fig. 4(a)) (§4.2). Then, we describe j how the JMs schedule tasks within and between data cen- For each sub-job Ji of job Ji, its job manager (pJM or sJM) enforces Af (Algorithm 1) to determine the desire ters (step 5 in Fig. 4(a)) (§4.3). Finally, we theoretically j analyze the performance of the algorithms ( 4.4). number of containers for next period d(Ji , q) based on its § j last period desire d(Ji , q − 1), the last period allocation j j a(Ji , q − 1), the last period resource utilization u(Ji , q − 4.1 Problem statement 5 j 1) and waiting tasks. u(Ji , q − 1) corresponds to the average resource utilization in period q − 1, and can be Resources in HOUTU are scheduled in terms of contain- measured by the monitoring mechanism. ers (corresponding to some fixed amount of memory and Consistent with [11], we classify the period q − 1 as cores). Instead of assuming the priori knowledge of com- satisfied versus deprived. Af compares the job’s alloca- plete characteristics of jobs [22, 23, 24], which restricts j j tion a(Ji , q − 1) with its desire d(Ji , q − 1). The period the types of workloads and incurs offline overheads, we j j rely on only partial priori knowledge of a job (the knowl- is satisfied if a(Ji , q − 1) = d(Ji , q − 1), as the sub-job j edge from available stages). In the example of Fig. 6, only Ji acquires as many containers as it requests from the job j j the task information (including the input data locations, scheduler. Otherwise, a(Ji , q − 1) < d(Ji , q − 1), the fine-grained resource requirements, and process times) in period is deprived. The classification of a period as effi- Stage 0 is currently known, while the task information in cient versus inefficient is more involved than that in [11]. Stage 1 and Stage 2 is currently unknown because they Af uses a parameter δ as well as the presence of waiting have not been released yet. We consider that tasks in the task information. The period is inefficient if the utilization j same stage have identical characteristics, which conforms u(Ji , q − 1) < δ and there is no waiting task in period to the fact in practical systems as they perform the same q − 1. Otherwise the period is efficient. computations on different partitions of the input. 4The response time of a job is the duration time from its release to its In the scenario where multiple DAG jobs arrive completion. 5 Jj and leave online, we are interested in minimizing the We omit i in Algorithm 1 for brevity.

6 Algorithm 2 Parades (applied by each job manager) Table 1: Explanations of notations. 1: procedure ONUPDATE(n, δ, τ ) Notation Explanation 2: For each t , increase t .wait by the time since ij ij j j last event UPDATE; cont ← true d(Ji , q) Ji ’s desire for period q j j 3: if no waiting task then a(Ji , q) Ji ’s allocation for period q j j 4: t = STEAL(n); u(Ji , q) Ji ’s resource utilization in period q 5: tlist.add(t); n.free −= t.r; cont ← false; δ the utilization threshold parameter 6: while n.free > 0 and cont do ρ the resource adjustment parameter 7: cont ← false; τ the task waiting time parameter 8: if there is a node-localtask tij on n and n.free ≥ t .r then ij two perspectives. When a container updates its status, the 9: t = t ij algorithm adds the waiting time for each waiting task of 10: else if there is a rack-local task t on n and ik the sub-job since the last event UPDATE happened (line n.free ≥ t .r and t .wait ≥ τ · t .p then ij ik ik 2), followed by the task assignment procedure. Delay 11: t = t ik scheduling sets the waiting time thresholds for tasks as 12: else if there is a task t with t .wait ≥ 2τ · il il an invariant, while we modify the threshold for each task t .p and n.free ≥ 1 − δ then il to be linearly dependent of its processing time p (which 13: t = t il is known), under the intuition of that “long” tasks can tol- 14: tlist.add(t); n.free −= t.r; cont ← true; return tlist erate a longer waiting time to acquire their preferred re- 15: procedure ONRECEIVESTEAL(n) sources. On the other hand, if there is no waiting task, 16: return ONUPDATE(n, δ, τ ) the job manager becomes a “thief” and tries to steal tasks from other “victim” job managers in the same job (line 17: procedure STEAL(n) 4). Each victim job manager will handle this steal as a 18: for each job manager of the same job do UPDATE event (line 16). 19: tlist.add(SENDSTEAL(n)) Parades operates as follows in task assignment proce- 20: return tlist dure: It first checks whether there is a node-local task waiting, which means the container n is on the same server as the task prefers. Assigning the task to its preferred If the period is inefficient, Af decreases the desire by a server which containing its input data helps in reducing factor ρ. If the period is efficient but deprived, it means data transmission over the network. We use n.free to de- that the sub-job efficiently used the resources it was allo- note the free resources on container n. Secondly, the algo- cated, but Af had requested more containers than the sub- rithm would check whether there is a rack-local task for job actually received from the job scheduler. It maintains the n, as the container shares the same rack as the task’s the same desire in period q. If the period is efficient and preferred server. If the task has waited for more than the satisfied, the sub-job efficiently used the resources that Af threshold time (τ · t .p), and the container has enough requests. Af assumes that the sub-job can use more con- ij free resources, we assign the task to the container. Finally, tainers and increases its desire by a factor ρ. In all three when a task has waited for long enough time (2τ · t .p), cases, Af allows Parades to assign multiple tasks to exe- il and n.free ≥ 1 − δ, we always allow the task could be cute in a container. assigned if possible. When n.free ≥ 1 − δ, the utilized resource of the container n <δ. We assume til.r + δ ≤ 1, 4.3 Task assignment using Parades for each i,l, as the upper bound for task resource require- ment. Initial task assignment (applied by the primary job Please refer to Table 1 for the involved notations in our manager): When a new stage of a DAG job becomes algorithms and their explanations. available, the primary job manager initially decides the fraction of tasks to place on each data center to be propor- 4.4 Analysis of Af + Parades tional to the amount of data on the data center. Parades (Parameterized delay scheduling with work To prove the proposed algorithms guarantee efficient per- stealing) is applied by each job manager after the initial formance for online jobs, we settle the job scheduler em- assignment. Parades is based on framework of the orig- ployed in each data center as the fair scheduler [4, 8], per- inal delay scheduling algorithm [49], but extends it from haps the most widely used job scheduler in both indus-

7 try and academia. Once there is a free resource, the fair Input datasets Workloads scheduler always allocates it to the job which currentlyoc- small medium large cupies the fewest fraction of the cluster resources, unless WordCount 200 MB 1 GB 5 GB the job’s requests have been satisfied. TPC-H -- 1 GB 10 GB We prove the following theorem about the competitive Iterative ML 170 MB 1 GB ~ 3 GB ratio of makespan. Specifically, we extend the very recent PageRank 150 MB 1 GB ~ 6 GB result [52] about the efficiency of jobs scheduled by Af al- gorithm and parameterized delay (Pdelay) scheduling al- Figure 7: Input sizes for four workloads. gorithm in a single data center.6 Please see Appendix B for the proof sketch. We are still working on the provable efficiency about the average job response time. taskMap. After a task completes, it reports to its job man- ager about the output location, who will then propagate Theorem 1 When multiple geo-distributed DAG jobs ar- the location information in partitionList among other job rive online and each data center applies fair job sched- managers. uler, the makespan of these jobs applying Af + Parades, is How a new job manager inherits the containers be- O(1)-competitive. longing to the failed one? We modify YARN master to allow to grant tokens to the new generated job manager with the same jobId as the failed one. Then, the new job 5 Implementation manager could use these tokens to access the correspond- ing containers. We implement HOUTU using Apache Spark [50], Hadoop Af: We continuously (per second) measure the con- YARN [42] and Apache Zookeeper [29] as building j tainer utilizations in a sub-job Ji in a period q of length blocks. We make the following major changes: L, and calculate the average at the end of the period. We Monitor mechanism: We estimate the dynamic re- acquirethe desire number of containersfor the next period source availability on each container by adding a resource d(q + 1) byAf(§4.2). If d(q + 1) ≥ d(q), we directly up- monitor process (in nodeManager component of YARN). date the desire and push this new desire to the job sched- The monitor process reads resource usages (e.g., CPU, uler. When d(q+1) < d(q), the problem is involved, since memory) from OS counters and reports them to its job we should decide Which containers should be killed, and manager. Each job manager and its per-container mon- when the kill should be performed? We aggressively kill itors interact in an asynchronous manner to avoid over- the several containers which firstly become free. We add heads. the control information through the job manager in Spark Parameterized delay scheduling: Based on the fact to negotiate resources with YARN master. that tasks in a stage have similar resource requirements, we estimate the requirements using the measured statis- tics from the first few executions of tasks in a stage. We 6 Experimental Evaluation continue to refine these estimations as more tasks have been measured. We estimate task processing time as the In this section, we first present the methodology in con- average processing time of all finished tasks in the same ducting our experiments (§6.1). Then, we show the ef- stage. We modify the original implementation of delay ficient job performance HOUTU guarantees in both nor- scheduling in Spark to take τ as a parameter read from mal operation and changeable environment (§6.2), and the configuration file. analyze the monetary costs of HOUTU and other deploy- How the job managers coordinate with each other? ments when running the same workloads (§6.3). Finally, As stated in §3.2.1, we use Zookeeper to synchronize JMs we verify the ability of recovering of job manager fail- in the same job. Specifically, when the pJM determines ures in HOUTU (§6.4) and measure the overheads that it the initial task assignment, it writes this information to introduces in detail (§6.5). taskMap (Fig. 4(b)). sJMs will notice this modification and begin their task assignment procedures using Parades 6.1 Methodology (§4.3). If a job manager successfully steals a task from another, it also needs to modify the corresponding item in Testbed: We deploy HOUTU to 20 machines spread across four AliCloud regions as we show in §2.2. In each 6We extend the Pdelay algorithm in [52] with work stealing, which can only accelerate task assignment and at most delay tasks as much as region, we start five machines of type n4.xlarge or in Pdelay algorithm. n1.large, depending on their availability. Both types

8 1 This approximation is due to that we allow job managers

0.8 in a job to share resources across data centers by work Average JRT Makespan stealing (Parades). Second, When compared with the de- 0.6 Houtu 290 s 387 s Houtu centralized architecture with static scheduling algorithm,

CDF cent-dyna 295 s 417 s 0.4 cent−dyna HOUTU has 29% improvement in terms of average job re- decent−stat decent-stat 377 s 561 s cent−stat sponse time, and 31% improvementin terms of makespan. 0.2 cent-stat 488 s 1109 s This gain comes from the use of adaptively scheduling 00 200 400 600 800 1000 Job response time (s) mechanism based on utilization feedback (Af). (a) CDF of job response time. (b) average job response time To further demonstrate that HOUTU guarantees effi- and makespan. cient job performance in a changeable environment, we intentionally inject workloads to consume spare resources Figure 8: Job performance in four deployments. in data centers and see how a job reacts to this variation. Fig. 9 shows the cumulative running tasks of a job execu- tion in different scenarios and mechanisms. In Fig. 9(a), a of instances have 4 CPU cores, 8GB RAM and run 64-bit job executes normally and completes at time 115. While Ubuntu 16.04. In each region, we choose one On-demand in Fig. 9(b) and Fig. 9(c), we inject workloads into three instance as the master and four Spot instances as workers. data centers NC-3, EC-1 and SC-1 to use up almost all Workload: We use workloads for our evaluation in- spare resources in these data centers at time 100 after a cluding WordCount, TPC-H benchmark, Iterative ma- job submission. Fig. 9(b) demonstrates that work stealing chine learning and PageRank. For each workload, the mechanism ensures that the job manager in NC-5 gradu- variation in input sizes is based on real workloads from ally steals tasks from the other resource-tense data centers Yahoo! and Facebook, in scale with our deployment as the new stages of the DAG job become available. How- (Fig. 7). For the job distribution, we set 46%, 40% and ever, without work stealing, the pJM assigns tasks only ac- 14% of jobs are with small, medium and large input sizes cording to the data distribution (initial assignment), which respectively, which also conforms to realistic job distri- then leads to that the sJMs in resource-tense data cen- bution [42]. For TPC-H benchmark, we place in each ters would queue the tasks to be executed. As shown in data center two tables, while for other three workloads, Fig. 9(c), the queueing delays the job. Job response times we evenly partition the input across four data centers. in the last scenarios are 183 and 333 seconds, respectively. Baselines: We evaluate the effectiveness of HOUTU by evaluating four main types of systems/deployments: (1) the centralized Spark on YARN system with built-in static 6.3 Cost analysis resource scheduling (cent-stat); (2) the centralized Spark on YARN system with state-of-the-art dynamic resource In this subsection, we configure the centralized architec- ture with On-demand instances, while we keep the de- scheduling (cent-dyna) [52]; (3) HOUTU, decentralized architecture with Af + Parades; (4) decentralized archi- centralized architecture configuration with Spot instances tecture with static resource scheduling (decent-stat). (except the masters). We use the same workloads as in Metrics: We use average job response time and Fig. 8, and calculate the monetary costs in different de- ployments. Costs are divided into machine cost and data makespan to evaluate the effectiveness of jobs which ar- 7 rive in an online manner. We also care about the monetary transfer cost across different data centers . cost of running these jobs, compared with the deployment Fig. 10 shows two types of costs in different deploy- using total reliable (On-demand) instances. Finally, we ments normalized with the cost in cent-stat. First, we are interested in job response times when facing failures. observe HOUTU is very effective in reducing the ma- chine cost of running geo-distributed jobs, which is 90% cheaper than the cost in cent-stat. Not surprisingly, 6.2 Job performance the major cost saving comes from the use of Spot in- stances. Second, HOUTU has fewer data transfer com- We use the workloads stated before, and set the job sub- pared with centralized architectures. This is because cen- mission time following an exponential distribution with tralized architectures do not distinguish machines in dif- mean interval as 60 seconds. ferent data centers; while HOUTU differentiates task as- Fig. 8 shows the job performance in our four differ- signment within a data center and between data centers, ent deployments. First, we find that HOUTU has approxi- mate performance compared with the centralized architec- 7In AliCloud [1], the price of data transfer across data centers is ture with start-of-the-art dynamic scheduling mechanism. 0.13$/GB, while it is free to transfer data within a data center.

9 600 1300 800 NC−3 NC−5 1000 600 400 EC−1 SC−1 400 500 200 200 Cumulative # of runningCumulative tasks Cumulative # of runningCumulative tasks Cumulative # of runningCumulative tasks Time (s) 50 100 150 200 Time (s) 50 100 150 200 Time (s) 100 200 300 job inject job inject job completes workloads completes workloads completes

(a) The normal job execution in HOUTU (b) The normal job execution in HOUTU (c) The job execution without work stealing when we inject workloads mechanism when we inject workloads

Figure 9: The cumulative running tasks of a job execution in different scenarios and mechanisms.

machine communication the previous computations. The job response time is 299 cost cost Houtu 0.09 0.84 seconds in the last case, which is significantly longer than the times in two executions in HOUTU. cent-dyna 0.37 0.77 decent-stat 0.15 0.79 6.5 Overhead Figure 10: Normalized cost of different deployments. We measure overheads of HOUTU from two perspectives. First, we collect the intermediate information of jobs and a task steal happens only after the thief job manager from four workloads on large input datasets, and measure finishes its own tasks. HOUTU saves about 20% commu- their sizes during their executions. Fig. 12(a) plots the nication cost compared with cent-stat. 25th percentile, median and 75th percentile sizes for each workload in the corresponding box. We find the average sizes for the four workloads are 43.1 KB, 43.4 KB, 37.8 6.4 Failure recovery KB and 30.8 KB, respectively, which are small enough to use Zookeeper to keep them consistent. One of our major design considerations of HOUTU is to ensure that a job could recover from a failure due to the Second, we measure the time costs of mechanisms that unreliable environment and continue to execute. To un- HOUTU introduces. For the Af overhead, it just maintains derstand the effectiveness of our proposed mechanism, we the update operation and incurs negligible costs. Com- respectively run a job in HOUTU and cent-dyna, and we pared to the default implementation in YARN, we add the manually terminate the host (VM) where the job manager monitoring mechanism in each container process, which resides at 70 seconds after its submission. has moderate overhead. As a job manager incurs trans- We count the number of containers belonging to the mission delay in work stealing, we find the average delay job. Fig. 11 shows the process of the job execution ex- of the steal message transmissions is 163.5 ms across dif- periencing a job manager failure. In Fig. 11(a), we kill ferent system loads, which is also acceptable. the VM which hosts the pJM, and after 10 seconds we see a new sJM replaces the failed pJM.8 The sJM then inher- its the old containers and continues its work. While in 7 Related Work Fig. 11(b), we kill a sJM and see the similar process. The Wide-area data analytics: Prior work establishes the interval time is always lower than 20 seconds in our exten- emerging problem of analyzing the globally-generated sive experiments. The job response times in two scenarios data in data analytics systems [45, 28, 38, 44, 46, 27, 19]. are 147 seconds and 154 seconds, respectively. However, These works show promising WAN bandwidth reduction in the centralized architecture, the failure of a job man- and job performance improvement. SWAG [28] adjusts ager leads to the resubmission of the job, which wastes the order of jobs across data centers to reduce job com- 8A new pJM is first elected and then the new pJM tells the master pletion times. Iridium [38] optimizes data and task place- where the former pJM resided to generate a new sJM (§3.2.2). ment to reduce query response times and WAN usage.

10 20 20 30 NC−3 15 NC−5 15 EC−1 20 SC−1 10 10 10 5 5 Number of containers Number of containers Number of containers

0 00 Time (s) 50 100 150 00 Time (s) 50 150 0 100 200 300 kill new inherit job kill new inherit job kill new job job pJM sJM containers completes sJM sJM containers completes JM submission completes

(a) Recovery of a pJM failure in HOUTU (b) Recovery of a sJM failure in HOUTU (c) A job manager failure in cent-dyna

Figure 11: The job failure recovery in HOUTU and the centralized architecture

90 250 systems like MapReduce [20], Dryad [30] andSpark [50], 80 200 70 each job manager tracks the execution time of every task, 60 150 and reschedules a copy task when the execution time ex- 50 ceeding a threshold (straggler). At the level of jobs, the 40 100

30 cost (ms) Time cluster (job scheduler) will resubmit a job when its reports 20 50 are absent for a while. The resubmitted job starts its ex- 10

Size of intermediate info. (KB) info. of intermediate Size 0 0 Af monitor work steal ecution from scratch, wasting the previous computations. WordCount TPC-H ML PageRank mechanism mechanism In the relevant grid computing, fault-tolerance of jobs is (a) Intermediate info. size of four (b) Time cost of different mecha- achieved by checkpointing, which is the collection of pro- workloads on large input. nisms in HOUTU. cess context states [37, 35]. The process context states are Figure 12: Intermediate information size and time cost. stored periodically on a stable storage, which is not ap- plicable in the data analytics systems due to the overhead for each job manager collecting the real-time task states Clarinet [44] pushes wide-area network awareness to the and then persisting them. We include the output location query planner, and selects a query execution plan before for each task (partitionList) instead of its context state in the query begins. The proposed solutions work in the cen- its intermediate information, which is effective and incurs tralized architecture and assume the WAN bandwidth as acceptable overheads as evidenced in our experiments. constant, however, these may not conform to the practi- cal scenario due to our argument in §2. In contrast, we focus on the design of the decentralized geo-distributed data analytics architecture and requires no modification to the current job descriptions. 8 Conclusion Scheduling in a single data analytics system: Data- locality is a primary goal when scheduling tasks within a We introduce HOUTU, a new data analytics system that is job. Delay scheduling [49], Quincy [31] and Corral [33] designed to support analytics jobs on globally-generated try to improve the locality of individual tasks by schedul- data with respect to the practical constraints, without any ing them close to their input data. Fairness-quality trade- need to change the jobs. HOUTU provides a job man- off between multiple jobs is another goal. Carbyne [23] ager for a job in each data center, ensuring the reliabil- and Graphene [24] improve cluster utilizations and per- ity of its execution. We present the strategy for each JM formances while allowing a little unfairness among jobs. to independently manage resources without complete pri- Most of these systems rely on the priori knowledge of ori knowledge of jobs, and the mechanism for each JM DAG job characteristics. Instead, we use utilization feed- to assign tasks which can adjust its decisions according back to dynamically adjust scheduling decisions with only to the changeable environment. We experimentally ver- partial priori knowledge. Further, we extend this mech- ify HOUTU’s functionalities to guarantee reliable and effi- anism in the context of geo-distributed data centers and cient job executions. We conclude that HOUTU is a prac- allow job managers to cooperate in scheduling tasks. tical and effective system to enable constrained globally- Fault-tolerance for jobs in data analytics: In current distributed analytics jobs.

11 A Problem Formulation to the given resources), which minimizes makespan and average response time of J , while satisfying the task lo- Suppose there is a set of jobs J = {J1,J2, ..., J|J|} to be cality preferences. scheduled on a set of containers P = {P1, P2, ..., P|P|} from all data centers. These containers are different since they reside in different servers (and different data cen- B Efficiency of the Makespan ters) containing different input data for jobs. Time is dis- cretized into scheduling periods of equal length L, where We first state a theorem from [52] and then use it to each period q includes the interval [L · q, L · (q + 1) − 1]. prove the efficiency of makespan in the context of geo- L is a configurable system parameter. distributed DAG jobs running in multiple data centers. We model a job J as a DAG. Each vertex of the DAG i Theorem 2 [52] In a single data center with container represents a task and each edge represents a dependency set P, which applies fair job scheduler, when DAG jobs between the two tasks. Each task in a job prefers a unique J running in it with each applying Adaptive feedback subset of P, as the containers in the subset store the input algorithm to request resources and parameterized delay data forthetask. For each task t ∈ J , we denoteby t .r ij i ij scheduling to assign tasks, the makespan of these jobs is to be the peak requirements. We assume 0 ≤ tij .r ≤ 1, normalized by the container capacity. We also assume 2 1+ ρ 2τ T (J ) T(J ) ≤ ( + + ) 1 tij .r ≥ θ, where θ > 0, i.e., a task must consume some 1 − δ δ θ |P| amount of resources. We associate tij .p to be the process- + L logρ |P| +2L. ing time of task tij . Furthermore, the work of a job Ji is defined as T1(Ji)= tij .r· tij .p. The release time Ptij ∈Ji Assume there are k data centers, the sub-job set execut- r J is thetime at which thejob J is submitted. A task is j ( i) i ing in data center j is J and there are |Pj | containers in called in the waiting state when its predecessor tasks have 1 1 2 date center j. In the Ji example of Fig. 6, Ji ∈ J , Ji ∈ all completed and itself has not been scheduled yet. 2 3 3 1 2 1+ρ 2τ J , and J ∈ J . Denote ci = ( + + ) j i |Pi| 1−δ δ θ The sub-job Ji of Ji corresponds to a collection of and = L logρ |Pi| +2L. By directly applying theorem tasks executing in the data center j. Each job manager 2, we have for each i, handles the task executions of a sub-job in the job man- ager’s data center. The job managersof a job are oblivious i i T(J ) ≤ ci · T1(J )+ di . to the further characteristics of the unfolding DAG. Sum them up, we have Definition 1 The makespan of a job set J is the time taken to complete all the jobs in J , that is, T(J ) = k k k i i T(J ) ≤ cmax · T (J )+ di maxJi∈J T(Ji), where T(Ji) is the completion time of job X X 1 X i i i Ji. =1 =1 =1 k Definition 2 The average response time of a job set J is = cmax · T1(J )+ X di 1 given by (T(Ji) − r(Ji)). i=1 |J| PJi∈J k T1(J ) The job schedulerofa data center anda job managerin- = cmax|P| · + di , |P| X teract as follows. The job scheduler reallocates resources i=1 between schedulingperiods. At the end of period q−1, the j j in which cmax is the max of ci and the first equality job manager of sub-job Ji determines its desire d(Ji , q), j is due to the definition of work. According to the fact which is the number of containers J wants for period q. k i T(J i) ≥ T(J ), we have Collecting the desires from all running sub-jobs, the job Pi=1 j j scheduler decides allocation a(J , q) for each sub-job J k i i T (J ) (with a(J j , q) ≤ d(J j , q)). Once a job is allocated con- T(J ) ≤ c |P| · 1 + d . i i max |P| X i tainers, the job manager further schedules its tasks. And i=1 the allocation does not change during the period. T1(J ) ∗ Given a job set J and container set from all data centers Since |P| is a lower bound of T (J ) due to [16], and P, we seek for a combination of a job scheduler (how to the number of available containersin all data centers |P| is allocate resources to sub-jobs), and job managers within constant once the system is well configured, we complete each job (how to request resources and how to assign tasks the proof of theorem 1.

12 References [17] A. Burtsev, D. Johnson, J. Kunz, E. Eide, and J. Van der Merwe. Capnet: Security and least author- [1]Alibaba Cloud – Pricing. ity in a capability-enabled cloud. In SoCC, 2017. https://ecs-buy.aliyun.com/price. [18] R. Buyya, J. Broberg, and A. M. Goscinski. Cloud [2] Amazon Web Services. Amazon virtual private computing: Principles and paradigms, chapter 24: cloud. https://aws.amazon.com/vpc/. Legal Issues in Cloud Computing. 2010.

[3] Amazon Web Services. Aws iden- [19] L. Chen, S. Liu, B. Li, and B. Li. Scheduling tity and access management (iam). jobs across geo-distributed datacenters with max- https://aws.amazon.com/iam/ . min fairness. In INFOCOM, 2017. [4]Apache YARN – Fair Scheduler. [20] J. Dean and S. Ghemawat. Mapreduce: Simplified http://tinyurl.com/j9vzsl9. data processing on large clusters. In OSDI, 2004. [5]Cloud Services Pricing – Ama- zon Web Services (AWS). [21] Google. Google data center locations. https://aws.amazon.com/pricing/. https://tinyurl.com/n7nthda. [6] European Commission press release. Commis- [22] R. Grandl, G. Ananthanarayanan, S. Kandula, sion to pursue role as honest broker in fu- S. Rao, and A. Akella. Multi-resource packing for ture global negotiations on internet governance. cluster schedulers. In SIGCOMM, 2014. https://tinyurl.com/k8xcvy4. [23] R. Grandl, M. Chowdhury, A. Akella, and G. Anan- [7] Google Cloud Platform – Price List. thanarayanan. Altruistic scheduling in multi- https://tinyurl.com/y9nyq68e. resource clusters. In OSDI, 2016.

[8]Max-min fairness. [24] R. Grandl, S. Kandula, S. Rao, A. Akella, and https://tinyurl.com/krkdmho. J. Kulkarni. Graphene: Packing and dependency- aware scheduling for data-parallel clusters. In OSDI, [9] Microsoft Azure – Pricing Overview. 2016. https://tinyurl.com/zk5kvla. [10] Personal Data (Privacy) Ordinance. [25] A. Gupta, F. Yang, J. Govig, A. Kirsch, K. Chan, https://tinyurl.com/86l7dqg, 2009. K. Lai, S. Wu, S. G. Dhoot, A. R. Kumar, A. Agi- wal, S. Bhansali, M. Hong, J. Cameron, M. Siddiqi, [11] K. Agrawal, Y. He, W. J. Hsu, and C. E. Leiserson. D. Jones, J. Shute, A. Gubarev, S. Venkataraman, Adaptive scheduling with parallelism feedback. In and D. Agrawal. Mesa: Geo-replicated, near real- PPoPP, 2006. time, scalable data warehousing. In PVLDB, 2014.

[12] Alibaba. Alibaba cloud available regions. [26] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, https://tinyurl.com/y84lfshq. V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven wan. In SIG- [13] Amazon. AWS global infrastructure. COMM https://tinyurl.com/px6dzut. , 2013. [14] G. J. Annas et al. Hipaa regulations-a new era of [27] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, medical-record privacy? New England Journal of G. R. Ganger, P. B. Gibbons, and O. Mutlu. Gaia: Medicine, 348(15), 2003. Geo-distributed machine learning approaching LAN speeds. In NSDI, 2017. [15] A. Armando, R. Carbone, L. Compagna, J. Cuellar, and L. Tobarra. Formal analysis of saml 2.0 web [28] C.-C. Hung, L. Golubchik, and M. Yu. Scheduling browser single sign-on: Breaking the saml-based jobs across geo-distributed datacenters. In SoCC, single sign-on for google apps. In FMSE, 2008. 2015. [16] T. Brecht, X. Deng, and N. Gu. Competitive dy- [29] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. namic multiprocessor allocation for parallel applica- Zookeeper: Wait-free coordination for internet-scale tions. Parallel Processing Letters, 07(01), 1997. systems. In USENIX ATC, 2010.

13 [30] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fet- hadoop yarn: Yet another resource negotiator. In terly. Dryad: Distributed data-parallel programs SoCC, 2013. from sequential building blocks. In EuroSys, 2007. [43] S. Venkataraman, Z. Yang, M. Franklin, B. Recht, [31] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, and I. Stoica. Ernest: Efficient performance pre- K. Talwar, and A. Goldberg. Quincy: fair scheduling diction for large-scale advanced analytics. In NSDI, for distributed computing clusters. In SOSP, 2009. 2016. [32] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, [44] R. Viswanathan, G. Ananthanarayanan, and A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, A. Akella. CLARINET: Wan-aware optimization J. Zolla, U. H¨olzle, S. Stuart, and A. Vahdat. B4: Ex- for analytics queries. In OSDI, 2016. perience with a globally-deployed software defined wan. In SIGCOMM, 2013. [45] A. Vulimiri, C. Curino, P. B. Godfrey, T. Jungblut, J. Padhye, and G. Varghese. Global analytics in [33] V. Jalaparti, P. Bodik, I. Menache, S. Rao, the face of bandwidth and regulatory constraints. In K. Makarychev, and M. Caesar. Network-aware NSDI, 2015. scheduling for data-parallel jobs: Plan when you can. In SIGCOMM, 2015. [46] H. Wang and B. Li. Lube: Mitigating bottlenecks in wide area data analytics. In HotCloud, 2017. [34] K. Kloudas, R. Rodrigues, N. M. Preguic¸a, and M. Mamede. PIXIDA: optimizing data parallel jobs [47] R. Wolski, J. Brevik, R. Chard, and K. Chard. Proba- in wide-area data analytics. In PVLDB, 2015. bilistic guarantees of execution duration for amazon spot instances. In SC, 2017. [35] H. Lee, K. Chung, S. Chin, J. Lee, D. Lee, S. Park, and H. Yu. A resource management and fault toler- [48] Z. Wu, M. Butkiewicz, D. Perkins, E. Katz-Bassett, ance services in grid computing. J. Parallel Distrib. and H. V. Madhyastha. Spanstore: Cost-effective Comput., 65(11), 2005. geo-replicated storage spanning multiple cloud ser- vices. In SOSP, 2013. [36]Microsoft. Azure regions. https://tinyurl.com/y98skbet. [49] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmele- egy, S. Shenker, and I. Stoica. Delay scheduling: A [37] A. Nguyen-Tuong and A. S. Grimshaw. Integrating simple technique for achieving locality and fairness fault-tolerance techniques in grid applications. Uni- in cluster scheduling. In EuroSys, 2010. versity of Virginia, 2000. [50] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, [38] Q. , G. Ananthanarayanan, P. Bodik, S. Kandula, M. McCauly, M. J. Franklin, S. Shenker, and I. Sto- A. Akella, P. Bahl, and I. Stoica. Low latency geo- ica. Resilient distributed datasets: A fault-tolerant distributed data analytics. In SIGCOMM, 2015. abstraction for in-memory cluster computing. In NSDI, 2012. [39] A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman. Aggregation and degradation in jet- [51] Y. Zhai, L. Yin, J. Chase, T. Ristenpart, and stream: Streaming analytics in the wide area. In M. Swift. Cqstr: Securing cross-tenant applications NSDI, 2014. with cloud containers. In SoCC, 2016. [40] T. Ristenpart, E. Tromer, H. Shacham, and S. Sav- [52] X. Zhang, Z. Qian, S. Zhang, X. Li, X. Wang, and age. Hey, you, get off of my cloud: Exploring in- S. Lu. COBRA: Toward provably efficient semi- formation leakage in third-party compute clouds. In clairvoyant scheduling in data analytics systems. In CCS, 2009. INFOCOM, 2018. [41] M. Rost and K. Bock. Privacyby design and the new [53] L. Zheng, C. Joe-Wong, C. W. Tan, M. Chiang, and protection goals. In DuD, 2017. X. Wang. How to bid the cloud. In SIGCOMM, 2015. [42] V.K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agar- wal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache

14