Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems Andrey Goder, Alexey Spiridonov, and Yin Wang, Facebook, Inc

Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems Andrey Goder, Alexey Spiridonov, and Yin Wang, Facebook, Inc. https://www.usenix.org/conference/atc15/technical-session/presentation/goder This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems Andrey Goder Alexey Spiridonov Yin Wang {agoder, lesha, yinwang}@fb.com Facebook, Inc Abstract fline data is ever used, and it is used only once. Further- more, some batch jobs cannot be run on copies of data; Data-intensive batch jobs increasingly compete for re- e.g., bulk database updates must mutate the online data. sources with customer-facing online workloads in mod- Running batch jobs directly on live customer-facing pro- ern data centers. Today, the two classes of workloads run duction systems has the potential to dramatically improve on separate infrastructures using different resource man- efficiency in modern environments such as Facebook. agers that pursue different objectives. Batch processing Unfortunately, existing batch job schedulers are mostly systems strive for coarse-grained throughput whereas on- designed for offline operation [13, 15, 25, 28, 29, 32, 33, line systems must keep the latency of fine-grained end- 36,40,42] and are ill-suited to online systems. First, they user requests low. Better resource management would do not support hard constraints on the burdens that jobs allow both batch and online workloads to share infras- place upon data hosts. The former may overload the lat- tructure, reducing hardware and eliminating the inefficient ter, which is unacceptable for online data hosts serving and error-prone chore of creating and maintaining copies end users. Second, batch schedulers assume immutable of data. This paper describes Facebook’s Bistro, a sched- data, whereas online data changes frequently. Finally, a uler that runs data-intensive batch jobs on live, customer- batch scheduler typically assumes a specific offline data facing production systems without degrading the end-user ecosystem, whereas online systems access a wide range experience. Bistro employs a novel hierarchical model of data sources and storage systems. of data and computational resources. The model enables Facebook formerly ran many data-intensive jobs on Bistro to schedule workloads efficiently and adapt rapidly Hadoop, which illustrates the shortcomings of conven- to changing configurations. At Facebook, Bistro is replac- tional batch schedulers. Hadoop tasks can draw data from ing Hadoop and custom-built batch schedulers, allowing data hosts outside the Hadoop cluster, which can easily batch jobs to share infrastructure with online workloads overload data hosts serving online workloads. Our engi- without harming the performance of either. neers formerly resorted to cumbersome distributed lock- ing schemes that manually encoded resource constraints into Hadoop tasks themselves. Dynamically changing 1 Introduction data poses still further challenges to Hadoop, because there is no easy way to update a queued Hadoop job; even Facebook stores a considerable amount of data in many pausing and resuming a job is difficult. Finally, Hadoop different formats, and frequently runs batch jobs that pro- is tightly integrated with HDFS, which hosts only a small cess, transform, or transfer data. Possible examples in- portion of our data; moving/copying the data is very inef- clude re-encoding billions of videos, updating trillions of ficient. Over the years, our engineers responded to these rows in databases to accommodate application changes, limitations by developing many ad hoc schedulers tai- and migrating petabytes of data among various BLOB lored to specific jobs; developing and maintaining per-job storage systems [11,12,31]. schedulers is very painful. On the positive side, the ex- Typical large-scale data processing systems such as perience of developing many specialized schedulers has Hadoop [41] run against offline copies of data. Creating given us important insights into the fundamental require- copies of data is awkward, slow, error-prone, and some- ments of typical batch jobs. The most important obser- times impossible due to the size of the data; maintain- vation is that many batch jobs are “map-only” or “map- ing offline copies is even more inefficient. These troubles heavy,” in the sense that they are (mostly) embarrassingly overshadow the benefits if only a small portion of the of- parallel. This in turn has allowed us to develop a sched- USENIX Association 2015 USENIX Annual Technical Conference 459 loads. We compare Bistro with a brute-force scheduling resources jobs solution, and describe several production applications of name capacity video re-encoding volume compact Bistro at Facebook. Finally, Bistro is available as open- volume 1 lock 1 lock 1 lock source software [2]. host 200 IOPS 100 IOPS 200 IOPS rack 40 Gbps 1 Gbps 0 Gbps Section 2 discusses the scheduling problem, resource model, and the scheduling algorithm of Bistro. Section 3 (a) Resource capacity and job consumption explains the implementation details of Bistro. Section 4 includes performance experiments and production applications, followed by related work and conclusion in Sec- tions 5 and 6, respectively. 2 Scheduling (b) Forest resource model Bistro schedules data-parallel jobs against online clus- Figure 1: The scheduling problem: maximum resource ters. This section describes the scheduling problem, utilization subject to hierarchical resource constraints. Bistro’s solution, and extensions. uler that trades some of the generality of existing batch 2.1 The Scheduling Problem schedulers for benefits heretofore unavailable. First we define a few terms. A job is the overall work to This paper describes Bistro, a scheduler that allows of- be completed on a set of data shards, e.g., re-encoding all fline batch jobs to share clusters with online customer- videos in Haystack [12], one of Facebook’s proprietary facing workloads without harming the performance of ei- BLOB storage systems. A task is the work to be com- ther. Bistro treats data hosts and their resources as first- pleted on one data shard, e.g., re-encoding all videos on class objects subject to hard constraints and models re- one Haystack volume. So, a job is a set of tasks, one per sources as a hierarchical forest of resource trees. Admin- data shard. A worker is the process that performs tasks. istrators specify total resource capacity and per job con- A scheduler is the process that dispatches tasks to work- sumption at each level. Bistro schedules as many tasks ers. A worker host, data host, or scheduler host is the onto leaves as possible while satisfying their resource re- computer that runs worker processes, stores data shards, quirements along the paths to roots. or runs scheduler processes, respectively. The forest model conveniently captures hierarchical re- Consider the scheduling problem depicted in Figure 1a. source constraints, and allows the scheduler to flexibly We want to perform two jobs on data stored in Haystack: accommodate varying data granularity, resource fluctu- video re-encoding and volume compaction. The for- ations, and job changes. Partitioning the forest model mer employs advanced compression algorithms for bet- corresponds to partitioning the underlying resource pools, ter video quality and space efficiency. The latter runs which makes it easy to scale both jobs and clusters relative periodically to recycle the space of deleted videos for to one another, and allows concurrent scheduling for bet- Haystack’s append-only storage. Tasks of both jobs op- ter throughput. Bistro reduces infrastructure hardware by erate at the granularity of data volumes. To avoid disrupt- eliminating the need for separate online and batch clus- ing production traffic, we constrain the resource capacity ters while improving efficiency by eliminating the need and job consumption in Figure 1a, which effectively al- to copy data between the two. Since its inception at lows at most two video re-encoding tasks or one volume Facebook two years ago, Bistro has replaced Hadoop and compaction task per host, and twenty video re-encoding custom-built schedulers in many production systems, and tasks per rack. has processed trillions of rows and petabytes of data. Volumes, hosts, and racks naturally form a forest by Our main contributions are the following: We define their physical relationships, illustrated in Figure 1b. Job a class of data-parallel jobs with hierarchical resource tasks correspond to the leaf nodes of trees in this forest; constraints at online data resources. We describe Bistro, each job must complete exactly one task per leaf. This a novel tree-based scheduler that safely runs such batch implies a one-to-one relationship between tasks and data jobs “in the background” on live customer-facing pro- units (shards, volumes, databases, etc.), which is the com- duction systems without harming the “foreground” work- mon case for data-parallel processing at Facebook. A task 460 2015 USENIX Annual Technical Conference USENIX Association Algorithm 1 Brute-force scheduling algorithm (baseline) 1: procedure SCHEDULEONE(M) 2: for job Ji ∈J do 3: for node l ∈ all leaf nodes of M do l 4: if task ti has not finished and there are enough Figure 2: Head-of-line blocking using FIFO queue for resources along l to root then k l our scheduling problem. Here t2 can block the rest of 5: update resource(M,ti) k l the queue while waiting for the lock on vol k held by t1 , 6: run(ti) and possibly other resources at higher-level nodes. 7: end if 8: end for requires resources at each node along the path from leaf 9: end for to root. In this figure, the two jobs have a total of four 10: end procedure running tasks.

Load more