Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale

Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale Zhuo Zhang∗, Chao Li∗, Yangyu Tao∗, Renyu Yangy∗, Hong Tang∗, Jie Xuyx Alibaba Cloud Computing Inc.∗ Beihang Universityy University of Leedsx fzhuo.zhang, li.chao, yangyu.taoyy, [email protected] [email protected] [email protected] ABSTRACT ly. In recent years, many systems have been proposed from Scalability and fault-tolerance are two fundamental chal- both academia and industry to support distributed data pro- lenges for all distributed computing at Internet scale. De- cessing with commodity server clusters, such as Mesos [11], spite many recent advances from both academia and indus- Yarn [18] and Omega [16]. However, two major challenges try, these two problems are still far from settled. In this remain unsettled in these systems when faced with the chal- paper, we present Fuxi, a resource management and job lenges of managing resources for systems at Internet scale: scheduling system that is capable of handling the kind of 1) Scalability: Resource scheduling can be simply consid- workload at Alibaba where hundreds of terabytes of data ered as the process of matching demand (requests to allocate are generated and analyzed everyday to help optimize the resources to run processes) with supply (available resources company's business operations and user experiences. We of cluster nodes). So the complexity of resource management employ several novel techniques to enable Fuxi to perfor- is directly affected by the number of concurrent tasks and m efficient scheduling of hundreds of thousands of concur- the number of server nodes in a cluster. Furthermore, oth- rent tasks over large clusters with thousands of nodes: 1) er factors also impact the complexity, including supporting an incremental resource management protocol that supports resource allocation over multiple dimensions (such as CPU, multi-dimensional resource allocation and data locality; 2) memory, and local storage), fairness and quota constraints user-transparent failure recovery where failures of any Fuxi across competing applications; and scheduling tasks close to components will not impact the execution of user jobs; and data. A naive approach of delegating everything decision to 3) an effective detection mechanism and a multi-level black- a single master node (as in Hadoop 1.0) would be severely listing scheme that prevents them from affecting job execu- limited by the capability of the master. On the other hand, tion. Our evaluation results demonstrate that 95% and 91% a fully-decentralized solution would be hard to design to scheduled CPU/memory utilization can be fulfilled under satisfy scheduling constraints that depends on fast-changing synthetic workloads, and Fuxi is capable of achieving 2.36T- global states without high synchronization cost (such as quo- B/minute throughput in GraySort. Additionally, the same ta management). Both Mesos and Yarn attempt to deal Fuxi job only experiences approximately 16% slowdown un- with these issues in slightly different ways. Mesos adopts a der a 5% fault-injection rate. The slowdown only grows to multiple-level resource offering framework. However, Mesos 20% when we double the fault-injection rate to 10%. Fux- master offers free resources in turn among frameworks, the i has been deployed in our production environment since waiting time for each framework to acquire desired resources 2009, and it now manages hundreds of thousands of server highly depends upon the resource offering order and other nodes. frameworks' scheduling efficiency. Yarn's architecture de- couples resource management and programming paradigms. However, the decoupling is only for the separation of code 1. INTRODUCTION logic between resource management and MapReduce job ex- We are now officially living in the Big Data era. Accord- ecution, so that other programming paradigms can be ac- ing to a study by Harvard Business Review in 2012, 2.5 ex- commodated in Yarn. Nevertheless, it does not reduce the abytes of data are generated everyday and the speed of data complexity of resource management, and still inherits the generation doubles every 40 months [13]. To keep up with linear resource model as in Hadoop 1.0. the pace of data generation, data processing has also been In both cases, states are exchanged with periodic mes- progressively migrating from traditional database-based ap- sages and the interval configuration is another intricate chal- proaches to distributed systems that scale out much easi- lenge. A long period could reduce communication overhead but would also hurt utilization when applications wait for resource assignment. On the other hand, frequent adjust- This work is licensed under the Creative Commons Attribution- ments would improve response to demand/supply changes NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- (and thus improve resource utilization); however, it would cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- also aggravate the message flooding phenomenon. In fact, mission prior to any use beyond those covered by the license. Contact Yahoo! reported that they did not run Hadoop clusters big- copyright holder by emailing [email protected]. Articles from this volume ger than 4,000 nodes [18] and Yarn had yet to demonstrate were invited to present their results at the 40th International Conference on its scale limit. Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 13 2) Fault-tolerance: With increasing scale of a cluster, Copyright 2014 VLDB Endowment 2150-8097/14/08. 1393 the probability of hardware failures also arises. Additional- scription need to be recorded. b) For an application master, ly, rare-case software bugs or hardware deficits that never the FuxiMaster leverages heartbeat to determine whether to show up in a test environment could also suddenly surface start a new master or not. The application master can also in production. Essentially, failures become the norm rather conduct failover to recover the finished and running workers than the exception at large scale [15]. The variety of failures by means of a light-weighted checkpoint scheme; c) For the include halt failures due to OS crash, network disconnection, daemon process on each machine, existing running tasks will and disk hang, etc [9]. Traditional mechanisms like health be adopted rather than being killed. monitoring tools or heartbeat can help but cannot complete- Faulty node detection and multi-level blacklist. We ly shield the failures from applications. Another challenge adopt a multi-level machine blacklist method to effectively is how to cope with master failures. In Yarn, when the re- identify machines behaving abnormally yet not dead. In source manager fails and then recovers, it has no recall of Fuxi, both FuxiMaster and application masters employ ma- the cluster states. Thus all running applications (including chine blacklist mechanisms. FuxiMaster uses three schemas application masters) will start over. Similarly, the failure of to find out the bad machines while application master makes a node manager could also result in the re-execution of all use of a bottom-up approach to distinguish temporary ab- running tasks on the node, even if it soon recovers. normality from persistent bad machines. Additionally, Fux- At Alibaba, hundreds of millions of customers visit our iMaster and application masters share the blacklist to make Web sites everyday, looking for things to buy from over one collaborative judgments for faulty nodes. billion items offered by our merchants. Hundreds of ter- Our evaluation results demonstrate the major advantages abytes of user behavior, transaction, and payment data are of Fuxi. Under a stressful synthetic workload, the aver- logged and must go through elaborated processing every- age scheduling time for each request is merely 0.88ms while day to support the optimization of our core business oper- 95% memory and 91% CPU can be planned out by FuxiMas- ations, such as online marketing, product search and fraud ter. Additionally, we achieve a 2.36TB/minute sort through- detection; and to improve user experiences such as person- put from GraySort, a 66.5% improvement over the previous alization and product recommendation. In this paper, we record by Yahoo! [8]. With a 5% fault-injection rate, the present Fuxi1, the resource management and job scheduling same Fuxi job only experiences a 15% slowdown. Even af- system that supports our proprietary data platform (called ter we double the fault-injection rate to 10%, the slowdown Open Data Processing Service or ODPS [6]) to handle the only grows to 20%. Fuxi has been deployed in our produc- Internet-scale workload at Alibaba. tion environment since 2009. It now manages hundreds of Fuxi's overall architecture bears some resemblance to Yarn. thousands of server nodes at Alibaba. In this paper, we mainly focus on the the following three as- The remaining sections are structured as follows: Sec- pects that are key to scaling Fuxi to thousands of nodes and tion 2 discusses the system overview and core design phi- hundreds of thousands of concurrent processes, maintain- losophy. Section 3 focuses on resource management while ing high resource utilization, and shielding low-level failures section 4 describes job scheduling, including job execution from impacting applications. We believe these techniques and fault tolerance handling; Section 5 presents the exper- are in the right direction to solving the scalability and fault- imental results, followed by related work in Section 6. We tolerance problems being faced by similar systems handling conclude our paper and discuss future work in Section 7. Internet-scale traffic. Incremental resource management protocol. Fux- 2. SYSTEM OVERVIEW i's resource management protocol allows an application to In this section, we provide an introduction of the Apsara specify its resource demand at once, and then make incre- cloud platform, and then present an overview of Fuxi system.

Load more