Scalable and Coordinated Scheduling for Cloud-Scale Computing

Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, and Jingren Zhou, Microsoft; Zhengping Qian, Ming Wu, and Lidong Zhou, Microsoft Research https://www.usenix.org/conference/osdi14/technical-sessions/presentation/boutin This paper is included in the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014 • Broomfield, CO 978-1-931971-16-4 Open access to the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou Microsoft Zhengping Qian, Ming Wu, Lidong Zhou Microsoft Research Abstract tions submit jobs to the clusters every day, resulting in a Efficiently scheduling data-parallel computation jobs peak rate of tens of thousands of scheduling requests per over cloud-scale computing clusters is critical for job second. The submitted jobs are diverse in nature, with a performance, system throughput, and resource utiliza- variety of characteristics in terms of data volume to pro- tion. It is becoming even more challenging with growing cess, complexity of computation logic, degree of paral- cluster sizes and more complex workloads with diverse lelism, and resource requirements. A scheduler must (i) characteristics. This paper presents Apollo, a highly scale to make tens of thousands of scheduling decisions scalable and coordinated scheduling framework, which per second on a cluster with tens of thousands of servers; has been deployed on production clusters at Microsoft (ii) maintain fair sharing of resources among different to schedule thousands of computations with millions of users and groups; and (iii) make high-quality scheduling tasks efficiently and effectively on tens of thousands of decisions that take into account factors such as data local- machines daily. The framework performs scheduling de- ity, job characteristics, and server load, to minimize job cisions in a distributed manner, utilizing global cluster latencies while utilizing the resources in a cluster fully. information via a loosely coordinated mechanism. Each This paper presents the Apollo scheduling framework, scheduling decision considers future resource availabil- which has been fully deployed to schedule jobs in cloud- ity and optimizes various performance and system fac- scale production clusters at Microsoft, serving a variety tors together in a single unified model. Apollo is ro- of on-line services. Scheduling billions of tasks daily bust, with means to cope with unexpected system dy- efficiently and effectively, Apollo addresses the schedul- namics, and can take advantage of idle system resources ing challenges in large-scale clusters with the following gracefully while supplying guaranteed resources when technical contributions. needed. To balance scalability and scheduling quality, • 1 Introduction Apollo adopts a distributed and (loosely) coordinated scheduling framework, in which indepen- MapReduce-like systems [7, 15] make data-parallel dent scheduling decisions are made in an optimistic computations easy to program and allow running jobs and coordinated manner by incorporating synchro- that process terabytes of data on large clusters of com- nized cluster utilization information. Such a de- modity hardware. Each data-processing job consists of sign strikes the right balance: it avoids the subop- a number of tasks with inter-task dependencies that de- timal (and often conflicting) decisions by indepen- scribe execution order. A task is a basic unit of compu- dent schedulers of a completely decentralized archi- tation that is scheduled to execute on a server. tecture, while removing the scalability bottleneck Efficient scheduling, which tracks task dependencies and single point of failure of a centralized design. and assigns tasks to servers for execution when ready, is critical to the overall system performance and ser- To achieve high-quality scheduling decisions, • vice quality. The growing popularity and diversity of Apollo schedules each task on a server that min- data-parallel computation makes scheduling increasingly imizes the task completion time. The estimation challenging. For example, the production clusters that model incorporates a variety of factors and al- we use for data-parallel computations are growing in lows a scheduler to perform a weighted decision, size, each with over 20,000 servers. A growing commu- rather than solely considering data locality or server nity of thousands of users from many different organiza- load. The data parallel nature of computation al- USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 285 lows Apollo to refine the estimates of task execution time continuously based on observed runtime statistics from similar tasks during job execution. To supply individual schedulers with cluster infor- • mation, Apollo introduces a lightweight hardware- independent mechanism to advertise load on servers. When combined with a local task queue on each server, the mechanism provides a near-future view of resource availability on all the servers, which is used by the schedulers in decision making. To cope with unexpected cluster dynamics, subopti- • Figure 1: A sample SCOPE execution graph. mal estimations, and other abnormal runtime behav- iors, which are facts of life in large-scale clusters, our engineering experiences in developing and deploy- Apollo is made robust through a series of correc- ing Apollo to our cloud infrastructure in Section 4. A tion mechanisms that dynamically adjust and rec- thorough evaluation is presented in Section 5. We review tify suboptimal decisions at runtime. We present a related work in Section 6 and conclude in Section 7. unique deferred correction mechanism that allows 2 Scheduling at Production Scale resolving conflicts between independent schedulers only if they have a significant impact, and show that Apollo serves as the underlying scheduling framework such an approach works well in practice. for Microsoft’s distributed computation platform, which supports large-scale data analysis for a variety of busi- To drive high cluster utilization while maintaining ness needs. A typical cluster contains tens of thousands • low job latencies, Apollo introduces opportunistic of commodity servers, interconnected by an oversub- scheduling, which effectively creates two classes of scribed network. A distributed file system stores data in tasks: regular tasks and opportunistic tasks. Apollo partitions that are distributed and replicated, similar to ensures low latency for regular tasks, while using GFS [12] and HDFS [3]. All computation jobs are writ- the opportunistic tasks for high utilization to fill in ten using SCOPE [32], a SQL-like high-level scripting the slack left by regular tasks. Apollo further uses a language, augmented with user-defined processing logic. token based mechanism to manage capacity and to The optimizer transforms a job into a physical execution avoid overloading the system by limiting the total plan represented as a directed acyclic graph (DAG), with number of regular tasks. tasks, each representing a basic computation unit, as ver- tices and the data flows between tasks as edges. Tasks To ensure no service disruption or performance re- that perform the same computation on different parti- • gression when we roll out Apollo to replace a previ- tions of the same inputs are logically grouped together ous scheduler deployed in production, we designed in stages. The number of tasks per stage indicates the Apollo to support staged rollout to production clus- degree of parallelism (DOP). ters and validation at scale. Those constraints have Figure 1 shows a sample execution graph in SCOPE, received little attention in research, but are never- greatly simplified from an important production job that theless crucial in practice and we share our experi- collects user click information and derives insights for ences in achieving those demanding goals. advertisement effectiveness. Conceptually, the job performs a join between an unstructured user log and a We observe that Apollo schedules over 20,000 tasks structured input that is pre-partitioned by the join key. per second in a production cluster with over 20,000 ma- The plan first partitions the unstructured input using the chines. It also delivers high scheduling quality, with 95% partitioning scheme from the other input: stages S1 and of regular tasks experiencing a queuing delay of under S2 respectively partition the data and aggregate each par- 1 second, while achieving consistently high (over 80%) tition. A partitioned join is then performed in stage S4. and balanced CPU utilization across the cluster. The DOP is set to 312 for S1 based on the input data vol- The rest of the paper is organized as follows. Sec- ume, set to 10 for S5, and set to 150 for S2, S3, and S4. tion 2 presents a high-level overview of our distributed computing infrastructure and the query workload that 2.1 Capacity Management and Tokens Apollo supports. Section 3 presents an architectural In order to ensure fairness and predictability of perfor- overview, explains the coordinated scheduling in detail, mance, the system uses a token-based mechanism to al- and describes the correction mechanisms. We describe locate capacity to jobs. Each token is defined as the right 286 11th USENIX Symposium on Operating

Load more