Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley https://www.usenix.org/conference/osdi18/presentation/nishihara

This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3

Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Ray: A Distributed Framework for Emerging AI Applications

Philipp Moritz,∗ Robert Nishihara,∗ Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica University of California, Berkeley

Abstract and their use in prediction. These frameworks often lever- age specialized hardware (e.g., GPUs and TPUs), with the The next generation of AI applications will continuously goal of reducing training time in a batch setting. Examples interact with the environment and learn from these inter- include TensorFlow [7], MXNet [18], and PyTorch [46]. actions. These applications impose new and demanding The promise of AI is, however, far broader than classi- systems requirements, both in terms of performance and cal supervised learning. Emerging AI applications must flexibility. In this paper, we consider these requirements increasingly operate in dynamic environments, react to and present Ray—a distributed system to address them. changes in the environment, and take sequences of ac- Ray implements a unified interface that can express both tions to accomplish long-term goals [8, 43]. They must task-parallel and actor-based computations, supported by aim not only to exploit the data gathered, but also to ex- a single dynamic execution engine. To meet the perfor- plore the space of possible actions. These broader require- mance requirements, Ray employs a distributed scheduler ments are naturally framed within the paradigm of rein- and a distributed and fault-tolerant store to manage the forcement learning (RL). RL deals with learning to oper- system’s control state. In our experiments, we demon- ate continuously within an uncertain environment based strate scaling beyond 1.8 million tasks per second and on delayed and limited feedback [56]. RL-based systems better performance than existing specialized systems for have already yielded remarkable results, such as Google’s several challenging reinforcement learning applications. AlphaGo beating a human world champion [54], and are beginning to find their way into dialogue systems, UAVs 1 Introduction [42], and robotic manipulation [25, 60]. The central goal of an RL application is to learn a Over the past two decades, many organizations have been policy—a mapping from the state of the environment to a collecting—and aiming to exploit—ever-growing quanti- choice of action—that yields effective performance over ties of data. This has led to the development of a plethora time, e.g., winning a game or piloting a drone. Finding ef- of frameworks for distributed data analysis, including fective policies in large-scale applications requires three batch [20, 64, 28], streaming [15, 39, 31], and graph [34, main capabilities. First, RL methods often rely on simula- 35, 24] processing systems. The success of these frame- tion to evaluate policies. Simulations make it possible to works has made it possible for organizations to analyze explore many different choices of action sequences and to large data sets as a core part of their business or scientific learn about the long-term consequences of those choices. strategy, and has ushered in the age of “Big Data.” Second, like their supervised learning counterparts, RL al- More recently, the scope of data-focused applications gorithms need to perform distributed training to improve has expanded to encompass more complex artificial intel- the policy based on data generated through simulations or ligence (AI) or machine learning (ML) techniques [30]. interactions with the physical environment. Third, poli- The paradigm case is that of supervised learning, where cies are intended to provide solutions to control problems, data points are accompanied by labels, and where the and thus it is necessary to serve the policy in interactive workhorse technology for mapping data points to labels closed-loop and open-loop control scenarios. is provided by deep neural networks. The complexity of These characteristics drive new systems requirements: these deep networks has led to another flurry of frame- a system for RL must support fine-grained computations works that focus on the training of deep neural networks (e.g., rendering actions in milliseconds when interacting ∗equal contribution with the real world, and performing vast numbers of sim-

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 561 ulations), must support heterogeneity both in time (e.g., Agent Environment a simulation may take milliseconds or hours) and in re- source usage (e.g., GPUs for training and CPUs for simu- Training Serving action (ai) Simulation Policy policy Policy lations), and must support dynamic execution, as results state (si+1) improvement evaluation (observation) of simulations or interactions with the environment can (e.g., SGD) reward (ri+1) change future computations. Thus, we need a dynamic computation framework that handles millions of hetero- trajectory: s0, (s1, r1), …, (sn, rn) geneous tasks per second at millisecond-level latencies. Existing frameworks that have been developed for Figure 1: Example of an RL system. Big Data workloads or for supervised learning work- loads fall short of satisfying these new requirements for metadata store which maintains the computation lineage RL. Bulk-synchronous parallel systems such as Map- and a directory for data objects. This allows Ray to sched- Reduce [20], [64], and Dryad [28] do not ule millions of tasks per second with millisecond-level support fine-grained simulation or policy serving. Task- latencies. Furthermore, Ray provides lineage-based fault parallel systems such as CIEL [40] and Dask [48] provide tolerance for tasks and actors, and replication-based fault little support for distributed training and serving. The tolerance for the metadata store. same is true for streaming systems such as Naiad [39] While Ray supports serving, training, and simulation and Storm [31]. Distributed deep-learning frameworks in the context of RL applications, this does not mean that such as TensorFlow [7] and MXNet [18] do not naturally it should be viewed as a replacement for systems that pro- support simulation and serving. Finally, model-serving vide solutions for these workloads in other contexts. In systems such as TensorFlow Serving [6] and Clipper [19] particular, Ray does not aim to substitute for serving sys- support neither training nor simulation. tems like Clipper [19] and TensorFlow Serving [6], as While in principle one could develop an end-to-end so- these systems address a broader set of challenges in de- lution by stitching together several existing systems (e.g., ploying models, including model management, testing, Horovod [53] for distributed training, Clipper [19] for and model composition. Similarly, despite its flexibility, serving, and CIEL [40] for simulation), in practice this ap- Ray is not a substitute for generic data-parallel frame- proach is untenable due to the tight coupling of these com- works, such as Spark [64], as it currently lacks the rich ponents within applications. As a result, researchers and functionality and (e.g., straggler mitigation, query practitioners today build one-off systems for specialized optimization) that these frameworks provide. RL applications [58, 41, 54, 44, 49, 5]. This approach im- We make the following contributions: poses a massive systems engineering burden on the devel- opment of distributed applications by essentially pushing • We design and build the first distributed frame- standard systems challenges like scheduling, fault toler- work that unifies training, simulation, and serving— ance, and data movement onto each application. necessary components of emerging RL applications. In this paper, we propose Ray, a general-purpose cluster-computing framework that enables simulation, • To support these workloads, we unify the actor and training, and serving for RL applications. The require- task-parallel abstractions on top of a dynamic task ments of these workloads range from lightweight and execution engine. stateless computations, such as for simulation, to long- • To achieve and fault tolerance, we pro- running and stateful computations, such as for training. pose a system design principle in which control state To satisfy these requirements, Ray implements a unified is stored in a sharded metadata store and all other interface that can express both task-parallel and actor- system components are stateless. based computations. Tasks enable Ray to efficiently and dynamically load balance simulations, large in- • To achieve scalability, we propose a bottom-up dis- puts and state spaces (e.g., images, video), and recover tributed scheduling strategy. from failures. In contrast, actors enable Ray to efficiently support stateful computations, such as model training, and expose shared mutable state to clients, (e.g., a parameter 2 Motivation and Requirements server). Ray implements the actor and the task abstrac- tions on top of a single dynamic execution engine that is We begin by considering the basic components of an RL highly scalable and fault tolerant. system and fleshing out the key requirements for Ray. As To meet the performance requirements, Ray distributes shown in Figure 1, in an RL setting, an agent interacts two components that are typically centralized in existing repeatedly with the environment. The goal of the agent frameworks [64, 28, 40]: (1) the task scheduler and (2) a is to learn a policy that maximizes a reward.A policy is

562 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association // evaluate policy by interacting with env. (e.g., simulator) sample-efficient enough to rely solely on data obtained rollout(policy, environment): from interactions with the physical world. These simula- trajectory = [] tions vary widely in complexity. They might take a few ms state = environment.initial_state() while (not environment.has_terminated()): (e.g., simulate a move in a chess game) to minutes (e.g., action = policy.compute(state) // Serving simulate a realistic environment for a self-driving car). state, reward = environment.step(action) // Simulation In contrast with supervised learning, in which train- trajectory.append(state, reward) ing and serving can be handled separately by different return trajectory systems, in RL all three of these workloads are tightly coupled in a single application // improve policy iteratively until it converges , with stringent latency re- train_policy(environment): quirements between them. Currently, no framework sup- policy = initial_policy() ports this coupling of workloads. In theory, multiple spe- while (policy has not converged): cialized frameworks could be stitched together to provide trajectories = [] the overall capabilities, but in practice, the resulting data for i from 1 to k: movement and latency between systems is prohibitive in // evaluate policy by generating k rollouts the context of RL. As a result, researchers and practition- trajectories.append(rollout(policy, environment)) ers have been building their own one-off systems. // improve policy policy = policy.update(trajectories) // Training This state of affairs calls for the development of new return policy distributed frameworks for RL that can efficiently support training, serving, and simulation. In particular, such a Figure 2: Typical RL pseudocode for learning a policy. framework should satisfy the following requirements: Fine-grained, heterogeneous computations. The dura- tion of a computation can range from milliseconds (e.g., a mapping from the state of the environment to a choice taking an action) to hours (e.g., training a complex pol- of action. The precise definitions of environment, agent, icy). Additionally, training often requires heterogeneous state, action, and reward are application-specific. hardware (e.g., CPUs, GPUs, or TPUs). To learn a policy, an agent typically employs a two-step process: (1) policy evaluation and (2) policy improvement. Flexible computation model. RL applications require To evaluate the policy, the agent interacts with the envi- both stateless and stateful computations. Stateless compu- ronment (e.g., with a simulation of the environment) to tations can be executed on any node in the system, which generate trajectories, where a trajectory consists of a se- makes it easy to achieve load balancing and movement quence of (state, reward) tuples produced by the current of computation to data, if needed. Thus stateless com- policy. Then, the agent uses these trajectories to improve putations are a good fit for fine-grained simulation and the policy; i.e., to update the policy in the direction of the data processing, such as extracting features from images gradient that maximizes the reward. Figure 2 shows an or videos. In contrast stateful computations are a good fit example of the pseudocode used by an agent to learn a for implementing parameter servers, performing repeated policy. This pseudocode evaluates the policy by invok- computation on GPU-backed data, or running third-party ing rollout(environment, policy) to generate trajectories. simulators that do not expose their state. train policy() then uses these trajectories to improve the Dynamic execution. Several components of RL appli- current policy via policy.update(trajectories). This pro- cations require dynamic execution, as the order in which cess repeats until the policy converges. computations finish is not always known in advance (e.g., Thus, a framework for RL applications must provide the order in which simulations finish), and the results of a efficient support for training, serving, and simulation computation can determine future computations (e.g., the (Figure 1). Next, we briefly describe these workloads. results of a simulation will determine whether we need to Training typically involves running stochastic gradient perform more simulations). descent (SGD), often in a distributed setting, to update the We make two final comments. First, to achieve high policy. Distributed SGD typically relies on an allreduce utilization in large clusters, such a framework must handle aggregation step or a parameter server [32]. millions of tasks per second.∗ Second, such a framework Serving uses the trained policy to render an action based is not intended for implementing deep neural networks on the current state of the environment. A serving system or complex simulators from scratch. Instead, it should aims to minimize latency, and maximize the number of enable seamless integration with existing simulators [13, decisions per second. To scale, load is typically balanced 11, 59] and deep learning frameworks [7, 18, 46, 29]. across multiple nodes serving the policy. Finally, most existing RL applications use simulations ∗Assume 5ms single-core tasks and a cluster of 200 32-core nodes. to evaluate the policy—current RL algorithms are not This cluster can run (1s/5ms) × 32 × 200 = 1.28M tasks/sec.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 563 Name Description futures = f.remote(args) Execute function f remotely. f.remote() can take objects or futures as inputs and returns one or more futures. This is non-blocking. objects = ray.get(futures) Return the values associated with one or more futures. This is blocking. ready futures = ray.wait(futures,k,timeout) Return the futures whose corresponding tasks have completed as soon as either k have completed or the timeout expires. actor = Class.remote(args) Instantiate class Class as a remote actor, and return a handle to it. Call a method futures = actor.method.remote(args) on the remote actor and return one or more futures. Both are non-blocking.

Table 1: Ray API

3 Programming and Computation Model Table 2 summarizes the properties of tasks and actors. Tasks enable fine-grained load balancing through leverag- Ray implements a dynamic task graph computation ing load-aware scheduling at task granularity, input data model, i.e., it models an application as a graph of depen- locality, as each task can be scheduled on the node stor- dent tasks that evolves during execution. On top of this ing its inputs, and low recovery overhead, as there is no model, Ray provides both an actor and a task-parallel need to checkpoint and recover intermediate state. In con- programming abstraction. This unification differentiates trast, actors provide much more efficient fine-grained up- Ray from related systems like CIEL, which only pro- dates, as these updates are performed on internal rather vides a task-parallel abstraction, and from Orleans [14] or than external state, which typically requires serialization Akka [1], which primarily provide an actor abstraction. and deserialization. For example, actors can be used to implement parameter servers [32] and GPU-based itera- tive computations (e.g., training). In addition, actors can 3.1 Programming Model be used to wrap third-party simulators and other opaque handles that are hard to serialize. Tasks. A task represents the execution of a remote func- tion on a stateless worker. When a remote function is To satisfy the requirements for heterogeneity and flex- invoked, a future representing the result of the task is ibility (Section 2), we augment the API in three ways. returned immediately. Futures can be retrieved using First, to handle concurrent tasks with heterogeneous du- ray.get() and passed as arguments into other remote func- rations, we introduce ray.wait(), which waits for the tions without waiting for their result. This allows the user first k available results, instead of waiting for all results to express parallelism while capturing data dependencies. like ray.get(). Second, to handle resource-heterogeneous Table 1 shows Ray’s API. tasks, we enable developers to specify resource require- ments so that the Ray scheduler can efficiently manage re- Remote functions operate on immutable objects and sources. Third, to improve flexibility, we enable nested re- are expected to be stateless and side-effect free: their mote functions, meaning that remote functions can invoke outputs are determined solely by their inputs. This implies other remote functions. This is also critical for achiev- idempotence, which simplifies fault tolerance through ing high scalability (Section 4), as it enables multiple pro- function re-execution on failure. cesses to invoke remote functions in a distributed fashion. Actors. An actor represents a stateful computation. Each actor exposes methods that can be invoked remotely and are executed serially. A method execution is similar to a 3.2 Computation Model task, in that it executes remotely and returns a future, but differs in that it executes on a stateful worker. A handle Ray employs a dynamic task graph computation to an actor can be passed to other actors or tasks, making model [21], in which the execution of both remote func- it possible for them to invoke methods on that actor. tions and actor methods is automatically triggered by the system when their inputs become available. In this sec- tion, we describe how the computation graph (Figure 4) is Tasks (stateless) Actors (stateful) constructed from a user program (Figure 3). This program Fine-grained load balancing Coarse-grained load balancing uses the API in Table 1 to implement the pseudocode Support for object locality Poor locality support High overhead for small updates Low overhead for small updates from Figure 2. Efficient failure handling Overhead from checkpointing Ignoring actors first, there are two types of nodes in a computation graph: data objects and remote function invocations, or tasks. There are also two types of edges: Table 2: Tasks vs. actors tradeoffs. data edges and control edges. Data edges capture the de-

564 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association @ray.remote T0 def create_policy(): train_policy # Initialize the policy randomly.

return policy A10 T1 A20 Simulator create_policy Simulator @ray.remote(num_gpus=1) policy class Simulator(object): A11 1 A21 rollout rollout def __init__(self):

# Initialize the environment. rollout11 rollout21

self.env= Environment() T2 def rollout(self, policy, num_steps): update_policy observations=[] A12 policy2 A22 observation= self.env.current_state() rollout rollout for_ in range(num_steps): action= policy(observation) rollout12 rollout22 observation= self.env.step(action) T3 update_policy observations.append(observation) return observations … … …

object task/method @ray.remote(num_gpus=2) data edges control edges stateful edges def update_policy(policy,*rollouts): # Update the policy. return policy Figure 4: The task graph corresponding to an invocation of train policy.remote() in Figure 3. Remote function calls and the @ray.remote actor method calls correspond to tasks in the task graph. The def train_policy(): # Create a policy. figure shows two actors. The method invocations for each actor policy_id= create_policy.remote() (the tasks labeled A1i and A2i) have stateful edges between them # Create 10 actors. indicating that they share the mutable actor state. There are con- simulators= [Simulator.remote() for_ in range(10)] trol edges from train policy to the tasks that it invokes. To train # Do 100 steps of training. for_ in range(100): multiple policies in parallel, we could call train policy.remote() # Perform one rollout on each actor. multiple times. rollout_ids= [s.rollout.remote(policy_id) fors in simulators] # Update the policy with the rollouts. methods invoked on the same actor object form a chain policy_id= update_policy.remote(policy_id,*rollout_ids) that is connected by stateful edges (Figure 4). This chain return ray.get(policy_id) captures the order in which these methods were invoked. Stateful edges help us embed actors in an otherwise Figure 3: Python code implementing the example in Figure 2 stateless task graph, as they capture the implicit data de- in Ray. Note that @ray.remote indicates remote functions and pendency between successive method invocations sharing actors. Invocations of remote functions and actor methods return the internal state of an actor. Stateful edges also enable futures, which can be passed to subsequent remote functions or us to maintain lineage. As in other dataflow systems [64], actor methods to encode task dependencies. Each actor has an we track data lineage to enable reconstruction. By explic- environment object self.env shared between all of its methods. itly including stateful edges in the lineage graph, we can easily reconstruct lost data, whether produced by remote functions or actor methods (Section 4.2.3). pendencies between data objects and tasks. More pre- cisely, if data object D is an output of task T, we add a 4 Architecture data edge from T to D. Similarly, if D is an input to T, we add a data edge from D to T. Control edges capture Ray’s architecture comprises (1) an application layer im- the computation dependencies that result from nested re- plementing the API, and (2) a system layer providing high mote functions (Section 3.1): if task T1 invokes task T2, scalability and fault tolerance. then we add a control edge from T1 to T2. Actor method invocations are also represented as nodes 4.1 Application Layer in the computation graph. They are identical to tasks with one key difference. To capture the state dependency The application layer consists of three types of processes: across subsequent method invocations on the same actor, • Driver: A process executing the user program. we add a third type of edge: a stateful edge. If method Mj is called right after method Mi on the same actor, • Worker: A stateless process that executes tasks then we add a stateful edge from Mi to Mj. Thus, all (remote functions) invoked by a driver or another

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 565 Node Node Node we decouple the durable lineage storage from the other Driver Worker Actor Driver Worker Worker system components, allowing each to scale independently.

App Layer App Maintaining low latency requires minimizing over- Object Store Object Store Object Store heads in task scheduling, which involves choosing where Local Scheduler Local Scheduler Local Scheduler to execute, and subsequently task dispatch, which in-

Web UI volves retrieving remote inputs from other nodes. Many Global Control Store (GCS) existing dataflow systems [64, 40, 48] couple these by Object Table Debugging Global Global Tools storing object locations and sizes in a centralized sched- SchedulerGlobal Task Table Scheduler Scheduler Function Table Profiling Tools uler, a natural design when the scheduler is not a bottle-

System Layer (backend) System Layer Event Logs neck. However, the scale and granularity that Ray targets Error Diagnosis requires keeping the centralized scheduler off the critical path. Involving the scheduler in each object transfer is pro- Figure 5: Ray’s architecture consists of two parts: an applica- hibitively expensive for primitives important to distributed tion layer and a system layer. The application layer implements training like allreduce, which is both communication- the API and the computation model described in Section 3, the intensive and latency-sensitive. Therefore, we store the system layer implements task scheduling and data management object metadata in the GCS rather than in the scheduler, to satisfy the performance and fault-tolerance requirements. fully decoupling task dispatch from task scheduling. In summary, the GCS significantly simplifies Ray’s worker. Workers are started automatically and as- overall design, as it enables every component in the sys- signed tasks by the system layer. When a remote tem to be stateless. This not only simplifies support for function is declared, the function is automatically fault tolerance (i.e., on failure, components simply restart published to all workers. A worker executes tasks and read the lineage from the GCS), but also makes it serially, with no local state maintained across tasks. easy to scale the distributed object store and scheduler in- dependently, as all components share the needed state via • Actor : A stateful process that executes, when in- the GCS. An added benefit is the easy development of de- voked, only the methods it exposes. Unlike a worker, bugging, profiling, and visualization tools. an actor is explicitly instantiated by a worker or a driver. Like workers, actors execute methods seri- ally, except that each method depends on the state 4.2.2 Bottom-Up Distributed Scheduler resulting from the previous method execution. As discussed in Section 2, Ray needs to dynamically schedule millions of tasks per second, tasks which may 4.2 System Layer take as little as a few milliseconds. None of the clus- The system layer consists of three major components: a ter schedulers we are aware of meet these requirements. global control store, a distributed scheduler, and a dis- Most cluster computing frameworks, such as Spark [64], tributed object store. All components are horizontally CIEL [40], and Dryad [28] implement a centralized sched- scalable and fault-tolerant. uler, which can provide locality but at latencies in the tens of ms. Distributed schedulers such as work stealing [12], Sparrow [45] and Canary [47] can achieve high scale, but 4.2.1 Global Control Store (GCS) they either don’t consider data locality [12], or assume The global control store (GCS) maintains the entire con- tasks belong to independent jobs [45], or assume the com- trol state of the system, and it is a unique feature of our putation graph is known [47]. design. At its core, GCS is a key-value store with pub- To satisfy the above requirements, we design a two- sub functionality. We use sharding to achieve scale, and level hierarchical scheduler consisting of a global sched- per-shard chain replication [61] to provide fault tolerance. uler and per-node local schedulers. To avoid overloading The primary reason for the GCS and its design is to main- the global scheduler, the tasks created at a node are sub- tain fault tolerance and low latency for a system that can mitted first to the node’s local scheduler. A local sched- dynamically spawn millions of tasks per second. uler schedules tasks locally unless the node is overloaded Fault tolerance in case of node failure requires a solu- (i.e., its local task queue exceeds a predefined threshold), tion to maintain lineage information. Existing lineage- or it cannot satisfy a task’s requirements (e.g., lacks a based solutions [64, 63, 40, 28] focus on coarse-grained GPU). If a local scheduler decides not to schedule a and can therefore use a single node (e.g., mas- locally, it forwards it to the global scheduler. Since this ter, driver) to store the lineage without impacting perfor- scheduler attempts to schedule tasks locally first (i.e., at mance. However, this design is not scalable for a fine- the leaves of the scheduling hierarchy), we call it a bottom- grained and dynamic workload like simulation. Therefore, up scheduler.

566 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association Node 1 Node N disk using an LRU policy. Driver Worker Worker Worker Worker Worker As with existing cluster computing frameworks, such … as Spark [64], and Dryad [28], the object store is limited Local Scheduler Local Scheduler to immutable data. This obviates the need for complex consistency protocols (as objects are not updated), and

Global Control State (GCS) simplifies support for fault tolerance. In the case of node failure, Ray recovers any needed objects through lineage

Global Global re-execution. The lineage stored in the GCS tracks both Scheduler Scheduler stateless tasks and stateful actors during initial execution; Submit Schedule Load we use the former to reconstruct objects in the store. tasks tasks info For simplicity, our object store does not support dis- Figure 6: Bottom-up distributed scheduler. Tasks are submitted tributed objects, i.e., each object fits on a single node. Dis- bottom-up, from drivers and workers to a local scheduler and tributed objects like large matrices or trees can be imple- forwarded to the global scheduler only if needed (Section 4.2.2). mented at the application level as collections of futures. The thickness of each arrow is proportional to its request rate. 4.2.4 Implementation

The global scheduler considers each node’s load and Ray is an active open source project† developed at the Uni- task’s constraints to make scheduling decisions. More pre- versity of California, Berkeley. Ray fully integrates with cisely, the global scheduler identifies the set of nodes that the Python environment and is easy to install by simply have enough resources of the type requested by the task, running pip install ray. The implementation com- and of these nodes selects the node which provides the prises ≈ 40K lines of code (LoC), 72% in ++ for the lowest estimated waiting time. At a given node, this time system layer, 28% in Python for the application layer. The is the sum of (i) the estimated time the task will be queued GCS uses one Redis [50] key-value store per shard, with at that node (i.e., task queue size times average task ex- entirely single-key operations. GCS tables are sharded ecution), and (ii) the estimated transfer time of task’s by object and task IDs to scale, and every shard is chain- remote inputs (i.e., total size of remote inputs divided by replicated [61] for fault tolerance. We implement both average bandwidth). The global scheduler gets the queue the local and global schedulers as event-driven, single- size at each node and the node resource availability via threaded processes. Internally, local schedulers maintain heartbeats, and the location of the task’s inputs and their cached state for local object metadata, tasks waiting for sizes from GCS. Furthermore, the global scheduler com- inputs, and tasks ready for dispatch to a worker. To trans- putes the average task execution and the average transfer fer large objects between different object stores, we stripe bandwidth using simple exponential averaging. If the the object across multiple TCP connections. global scheduler becomes a bottleneck, we can instantiate more replicas all sharing the same information via GCS. 4.3 Putting Everything Together This makes our scheduler architecture highly scalable. Figure 7 illustrates how Ray works end-to-end with a simple example that adds two objects a and b, which 4.2.3 In-Memory Distributed Object Store could be scalars or matrices, and returns result c. The To minimize task latency, we implement an in-memory remote function add() is automatically registered with the distributed storage system to store the inputs and outputs GCS upon initialization and distributed to every worker of every task, or stateless computation. On each node, we in the system (step 0 in Figure 7a). implement the object store via . This al- Figure 7a shows the step-by-step operations triggered lows zero-copy data sharing between tasks running on the by a driver invoking add.remote(a,b), where a and b are same node. As a data format, we use Apache Arrow [2]. stored on nodes N1 and N2, respectively. The driver sub- mits add(a, b) to the local scheduler (step 1), which for- If a task’s inputs are not local, the inputs are replicated ‡ to the local object store before execution. Also, a task wards it to a global scheduler (step 2). Next, the global a b writes its outputs to the local object store. Replication scheduler looks up the locations of add( , )’s arguments eliminates the potential bottleneck due to hot data objects in the GCS (step 3) and decides to schedule the task on N b and minimizes task execution time as a task only read- node 2, which stores argument (step 4). The local N s/writes data from/to the local memory. This increases scheduler at node 2 checks whether the local object a b throughput for computation-bound workloads, a profile store contains add( , )’s arguments (step 5). Since the shared by many AI applications. For low latency, we keep †https://github.com/ray-project/ray objects entirely in memory and evict them as needed to ‡Note that N1 could also decide to schedule the task locally.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 567 N1 Global Control Store (GCS) N2 10-1 Driver Function Table Worker Locality Aware 1.6 -2 @ray.remote @ray.remote @ray.remote 10 Unaware def add(a, b): 0 def add(a, b): def add(a, b): 1.2 return a + b return a + b return a + b 10-3 0.8 idc = add.remote(a, b) c = ray.get(idc) Object Table -4 10 0.4 9 ida N1 id 8 -5 Millions of tasks/s Object store b N2 Object store Mean task latency (s) 10 0.0 6 1 100KB 1MB 10MB 100MB 10 20 30 40 50 60 100 7 Object size number of nodes ida a ida a idb b 3 5 (a) Ray locality scheduling (b) Ray scalability Local Scheduler 2 Global Scheduler 4 Local Scheduler

(a) Executing a task remotely Figure 8: (a) Tasks leverage locality-aware placement. 1000 N1 Global Control Store (GCS) N2 tasks with a random object dependency are scheduled onto one

Driver Function Table Worker of two nodes. With locality-aware policy, task latency remains @ray.remote @ray.remote @ray.remote independent of the size of task inputs instead of growing by 1-2 def add(a, b): def add(a, b): def add(a, b): return a + b return a + b return a + b orders of magnitude. (b) Near-linear scalability leveraging the GCS and bottom-up distributed scheduler. Ray reaches 1 million idc = add.remote(a, b) c = ray.get(idc) Object Table tasks per second throughput with 60 nodes. x ∈ {70,80,90} ida N1 omitted due to cost. 7 2 idb N2 3 1 idc N2, N1 5 4 ida a idc c idc c ida a idb b 6 in many cases this number is much smaller, as most tasks Local Scheduler Local Scheduler Global Scheduler are scheduled locally, and the GCS replies are cached by (b) Returning the result of a remote task the global and local schedulers.

Figure 7: An end-to-end example that adds a and b and returns c. Solid lines are data plane operations and dotted lines are 5 Evaluation control plane operations. (a) The function add() is registered with the GCS by node 1 (N1), invoked on N1, and executed In our evaluation, we study the following questions: on N2. (b) N1 gets add()’s result using ray.get(). The Object 1. How well does Ray meet the latency, scalability, Table entry for c is created in step 4 and updated in step 6 after and fault tolerance requirements listed in Section 2? c is copied to N1. (Section 5.1) 2. What overheads are imposed on distributed primi- tives (e.g., allreduce) written using Ray’s API? (Sec- local store doesn’t have object a, it looks up a’s location tion 5.1) in the GCS (step 6). Learning that a is stored at N1, N2’s 3. In the context of RL workloads, how does Ray com- object store replicates it locally (step 7). As all arguments pare against specialized systems for training, serv- of add() are now stored locally, the local scheduler in- ing, and simulation? (Section 5.2) vokes add() at a local worker (step 8), which accesses the 4. What advantages does Ray provide for RL applica- arguments via shared memory (step 9). tions, compared to custom systems? (Section 5.3) Figure 7b shows the step-by-step operations triggered All experiments were run on Amazon Web Services. by the execution of ray.get() at N1, and of add() at N2, Unless otherwise stated, we use m4.16xlarge CPU in- respectively. Upon ray.get(idc)’s invocation, the driver stances and p3.16xlarge GPU instances. checks the local object store for the value c, using the future idc returned by add() (step 1). Since the local 5.1 Microbenchmarks object store doesn’t store c, it looks up its location in the GCS. At this time, there is no entry for c, as c has not Locality-aware task placement. Fine-grain load bal- been created yet. As a result, N1’s object store registers a ancing and locality-aware placement are primary benefits callback with the Object Table to be triggered when c’s of tasks in Ray. Actors, once placed, are unable to move entry has been created (step 2). Meanwhile, at N2, add() their computation to large remote objects, while tasks can. completes its execution, stores the result c in the local In Figure 8a, tasks placed without data locality awareness object store (step 3), which in turn adds c’s entry to the (as is the case for actor methods), suffer 1-2 orders of GCS (step 4). As a result, the GCS triggers a callback magnitude latency increase at 10-100MB input data sizes. to N1’s object store with c’s entry (step 5). Next, N1 Ray unifies tasks and actors through the shared object replicates c from N2 (step 6), and returns c to ray.get() store, allowing developers to use tasks for e.g., expensive (step 7), which finally completes the task. postprocessing on output produced by simulation actors. While this example involves a large number of RPCs, End-to-end scalability. One of the key benefits of

568 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 20000 16 14 write ) 4 4 15000 12 s 10 read 10 μ ( 10 node dead y

10000 8 c IOPS 6 n 103 103 e t

5000 4 a L 2 throughput (GB/s) 0 0 1KB 10KB 100KB 1MB 10MB100MB 1GB 0 1 2 3 4 5 6 7 8 9 10 object size Time since start (s) Figure 9: Object store write throughput and IOPS. From a (a) A timeline for GCS read and write latencies as viewed from single client, throughput exceeds 15GB/s (red) for large objects a client submitting tasks. The chain starts with 2 replicas. We and 18K IOPS (cyan) for small objects on a 16 core instance manually trigger reconfiguration as follows. At t ≈ 4.2s, a chain (m4.4xlarge). It uses 8 threads to copy objects larger than 0.5MB member is killed; immediately after, a new chain member joins, and 1 for small objects. Bar plots report throughput with initiates state transfer, and restores the chain to 2-way replication. 1, 2, 4, 8, 16 threads. Results are averaged over 5 runs. The maximum client-observed latency is under 30ms despite reconfigurations. 50 million no-op tasks 8000 the Global Control Store (GCS) and the bottom-up dis-

tributed scheduler is the ability to horizontally scale the 6000 system to support a high throughput of fine-grained tasks, while maintaining fault tolerance and low-latency task 4000 scheduling. In Figure 8b, we evaluate this ability on an Ray, no GCS flush workload of empty tasks, increas- 2000 ing the cluster size on the x-axis. We observe near-perfect Ray, GCS flush

linearity in progressively increasing task throughput. Ray GCS Used Memory (MB) 0 exceeds 1 million tasks per second throughput at 60 nodes 0 10000 20000 30000 40000 50000 60000 Elasped Time (seconds) and continues to scale linearly beyond 1.8 million tasks per second at 100 nodes. The rightmost datapoint shows (b) The Ray GCS maintains a constant memory footprint with that Ray can process 100 million tasks in less than a GCS flushing. Without GCS flushing, the memory footprint minute (54s), with minimum variability. As expected, in- reaches a maximum capacity and the workload fails to complete creasing task duration reduces throughput proportionally within a predetermined duration (indicated by the red cross). to mean task duration, but the overall scalability remains Figure 10: Ray GCS fault tolerance and flushing. linear. While many realistic workloads may exhibit more limited scalability due to object dependencies and inher- ent limits to application parallelism, this demonstrates the from any server in the chain (having received explicit scalability of our overall architecture under high load. errors). Overall, reconfigurations caused a maximum Object store performance. To evaluate the perfor- client-observed delay of under 30ms (this includes both mance of the object store (Section 4.2.3), we track two failure detection and recovery delays). metrics: IOPS (for small objects) and write throughput GCS flushing. Ray is equipped to periodically flush (for large objects). In Figure 9, the write throughput from the contents of GCS to disk. In Figure 10b we submit 50 a single client exceeds 15GB/s as object size increases. million empty tasks sequentially and monitor GCS mem- For larger objects, memcpy dominates object creation ory consumption. As expected, it grows linearly with the time. For smaller objects, the main overheads are in seri- number of tasks tracked and eventually reaches the mem- alization and IPC between the client and object store. ory capacity of the system. At that point, the system be- GCS fault tolerance. To maintain low latency while comes stalled and the workload fails to finish within a rea- providing strong consistency and fault tolerance, we build sonable amount of time. With periodic GCS flushing, we a lightweight chain replication [61] layer on top of Redis. achieve two goals. First, the memory footprint is capped Figure 10a simulates recording Ray tasks to and reading at a user-configurable level (in the microbenchmark we tasks from the GCS, where keys are 25 bytes and values employ an aggressive strategy where consumed memory are 512 bytes. The client sends requests as fast as it can, is kept as low as possible). Second, the flushing mecha- having at most one in-flight request at a time. Failures are nism provides a natural way to snapshot lineage to disk reported to the chain master either from the client (having for long-running Ray applications. received explicit errors, or timeouts despite retries) or Recovering from task failures. In Figure 11a, we

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 569 800 4 OpenMPI 700 Ray ring reduce latency 60 10 (16 nodes, 100MB) 2000 Original tasks Ray* 600 3 Ray Re-executed tasks 10 500 1500 400 40 102 300

1000 101 200 20 100 100 0 Iteration time (milliseconds)

500 Iteration time (milliseconds) 10MB 100MB 1GB +0 +1 +5 +10 Object size Added scheduler latency (ms) 0 0 Number of nodes

Throughput (tasks/s) 0 50 100 150 200 (a) Ray vs OpenMPI (b) Ray scheduler ablation Time since start (s) (a) Task reconstruction Figure 12: (a) Mean execution time of allreduce on 16 m4.16xl nodes. Each worker runs on a distinct node. Ray* restricts Ray 700 to 1 thread for sending and 1 thread for receiving. (b) Ray’s low- 600 Original tasks latency scheduling is critical for allreduce. 500 Re-executed tasks Checkpoint tasks 400 300 primitive important to many machine learning workloads. 200 Here, we evaluate whether Ray can natively support a 100 ring allreduce [57] implementation with low enough over- 0 head to match existing implementations [53]. We find that Throughput (tasks/s) 100 200 300 400 500 600 Ray completes allreduce across 16 nodes on 100MB in Time since start (s) ∼200ms and 1GB in ∼1200ms, surprisingly outperform- (b) Actor reconstruction ing OpenMPI (v1.10), a popular MPI implementation, by 1.5× and 2× respectively (Figure 12a). We attribute Figure 11: Ray fault-tolerance. (a) Ray reconstructs lost task Ray’s performance to its use of multiple threads for net- dependencies as nodes are removed (dotted line), and recovers work transfers, taking full advantage of the 25Gbps con- to original throughput when nodes are added back. Each task nection between nodes on AWS, whereas OpenMPI se- is 100ms and depends on an object generated by a previously submitted task. (b) Actors are reconstructed from their last quentially sends and receives data on a single thread [22]. checkpoint. At t = 200s, we kill 2 of the 10 nodes, causing 400 For smaller objects, OpenMPI outperforms Ray by switch- of the 2000 actors in the cluster to be recovered on the remaining ing to a lower overhead algorithm, an optimization we nodes (t = 200–270s). plan to implement in the future. Ray’s scheduler performance is critical to implement- ing primitives such as allreduce. In Figure 12b, we inject demonstrate Ray’s ability to transparently recover from artificial task execution delays and show that performance worker node failures and elastically scale, using the drops nearly 2× with just a few ms of extra latency. Sys- durable GCS lineage storage. The workload, run on tems with centralized schedulers like Spark and CIEL typ- m4.xlarge instances, consists of linear chains of 100ms ically have scheduler overheads in the tens of millisec- tasks submitted by the driver. As nodes are removed (at onds [62, 38], making such workloads impractical. Sched- 25s, 50s, 100s), the local schedulers reconstruct previous uler throughput also becomes a bottleneck since the num- results in the chain in order to continue execution. Over- ber of tasks required by ring reduce scales quadratically all per-node throughput remains stable throughout. with the number of participants. Recovering from actor failures. By encoding actor method calls as stateful edges directly in the dependency graph, we can reuse the same object reconstruction mech- anism as in Figure 11a to provide transparent fault tol- 5.2 Building blocks erance for stateful computation. Ray additionally lever- ages user-defined checkpoint functions to bound the re- End-to-end applications (e.g., AlphaGo [54]) require a construction time for actors (Figure 11b). With minimal tight coupling of training, serving, and simulation. In this overhead, checkpointing enables only 500 methods to be section, we isolate each of these workloads to a setting re-executed, versus 10k re-executions without checkpoint- that illustrates a typical RL application’s requirements. ing. In the future, we hope to further reduce actor recon- Due to a flexible programming model targeted to RL, and struction time, e.g., by allowing users to annotate meth- a system designed to support this programming model, ods that do not mutate state. Ray matches and sometimes exceeds the performance of Allreduce. Allreduce is a distributed communication dedicated systems for these individual workloads.

570 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 7000 System Small Input Larger Input 6000 Horovod + TF 5000 Clipper 4400 ± 15 states/sec 290 ± 1.3 states/sec 4000 Distributed TF 3000 Ray + TF Ray 6200 ± 21 states/sec 6900 ± 150 states/sec 2000 1000

Mean images / s 0 4 8 16 32 64 Num GPUs (V100) Table 3: Throughput comparisons for Clipper [19], a dedicated serving system, and Ray for two embedded serving workloads. Figure 13: Images per second reached when distributing the We use a residual network and a small fully connected network, training of a ResNet-101 TensorFlow model (from the official taking 10ms and 5ms to evaluate, respectively. The server is TF benchmark). All experiments were run on p3.16xl instances queried by clients that each send states of size 4KB and 100KB connected by 25Gbps Ethernet, and workers allocated 4 GPUs respectively in batches of 64. per node as done in Horovod [53]. We note some measurement deviations from previously reported, likely due to hardware differences and recent TensorFlow performance improvements. server throughput achieved using a Ray actor to serve We used OpenMPI 3.0, TF 1.8, and NCCL2 for all runs. a policy versus using the open source Clipper system over REST. Here, both client and server processes are co- located on the same machine (a p3.8xlarge instance). This 5.2.1 Distributed Training is often the case for RL applications but not for the general We implement data-parallel synchronous SGD leverag- web serving workloads addressed by systems like Clipper. ing the Ray actor abstraction to represent model replicas. Due to its low-overhead serialization and shared memory Model weights are synchronized via allreduce (5.1) or pa- abstractions, Ray achieves an order of magnitude higher rameter server, both implemented on top of the Ray API. throughput for a small fully connected policy model that In Figure 13, we evaluate the performance of the takes in a large input and is also faster on a more expensive Ray (synchronous) parameter-server SGD implementa- residual network policy model, similar to one used in tion against state-of-the-art implementations [53], us- AlphaGo Zero, that takes smaller input. ing the same TensorFlow model and synthetic data gen- erator for each experiment. We compare only against 5.2.3 Simulation TensorFlow-based systems to accurately measure the over- head imposed by Ray, rather than differences between the Simulators used in RL produce results with variable deep learning frameworks themselves. In each iteration, lengths (“timesteps”) that, due to the tight loop with train- model replica actors compute gradients in parallel, send ing, must be used as soon as they are available. The task the gradients to a sharded parameter server, then read the heterogeneity and timeliness requirements make simu- summed gradients from the parameter server for the next lations hard to support efficiently in BSP-style systems. iteration. To demonstrate, we compare (1) an MPI implementation Figure 13 shows that Ray matches the performance of that submits 3n parallel simulation runs on n cores in 3 Horovod and is within 10% of distributed TensorFlow rounds, with a global barrier between rounds§, to (2) a (in distributed replicated mode). This is due to Ray program that issues the same 3n tasks while concur- the ability to express the same application-level optimiza- rently gathering simulation results back to the driver. Ta- tions found in these specialized systems in Ray’s general- ble 4 shows that both systems scale well, yet Ray achieves purpose API. A key optimization is the pipelining of gra- up to 1.8× throughput. This motivates a programming dient computation, transfer, and summation within a sin- model that can dynamically spawn and collect the results gle iteration. To overlap GPU computation with network of fine-grained simulation tasks. transfer, we use a custom TensorFlow operator to write tensors directly to Ray’s object store. System, programming model 1 CPU 16 CPUs 256 CPUs MPI, bulk synchronous 22.6K 208K 2.16M Ray, asynchronous tasks 22.3K 290K 4.03M 5.2.2 Serving Model serving is an important component of end-to-end Table 4: Timesteps per second for the Pendulum-v0 simulator applications. Ray focuses primarily on the embedded in OpenAI Gym [13]. Ray allows for better utilization when serving of models to simulators running within the same running heterogeneous simulations at scale. dynamic task graph (e.g., within an RL application on Ray). In contrast, systems like Clipper [19] focus on serving predictions to external clients. §Note that experts can use MPI’s asynchronous primitives to get In this setting, low latency is critical for achieving high around barriers—at the expense of increased program complexity —we utilization. To show this, in Table 3 we compare the nonetheless chose such an implementation to simulate BSP.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 571 90 5.3 RL Applications 80 Reference ES 500 MPI PPO 70 Ray ES Ray PPO 400 Without a system that can tightly couple the training, sim- 60 ulation, and serving steps, reinforcement learning algo- 50 300 40 rithms today are implemented as one-off solutions that 30 200 20 make it difficult to incorporate optimizations that, for ex- 100 ample, require a different computation structure or that 10 0 x x x 0 Mean time to solve (minutes) utilize different architectures. Consequently, with imple- 256 1024 8192 Mean time to solve (minutes) 8x1 64x8 512x64 mentations of two representative reinforcement learning Number of CPUs CPUs x GPUs applications in Ray, we are able to match and even out- (a) Evolution Strategies (b) PPO perform custom systems built specifically for these algo- rithms. The primary reason is the flexibility of Ray’s pro- Figure 14: Time to reach a score of 6000 in the Humanoid- gramming model, which can express application-level op- v1 task [13]. (a) The Ray ES implementation scales well to timizations that would require substantial engineering ef- 8192 cores and achieves a median time of 3.7 minutes, over twice as fast as the best published result. The special-purpose fort to port to custom-built systems, but are transparently system failed to run beyond 1024 cores. ES is faster than PPO supported by Ray’s dynamic task graph execution engine. on this benchmark, but shows greater runtime variance. (b) The Ray PPO implementation outperforms a specialized MPI 5.3.1 Evolution Strategies implementation [5] with fewer GPUs, at a fraction of the cost. The MPI implementation required 1 GPU for every 8 CPUs, To evaluate Ray on large-scale RL workloads, we imple- whereas the Ray version required at most 8 GPUs (and never ment the evolution strategies (ES) algorithm and com- more than 1 GPU per 8 CPUs). pare to the reference implementation [49]—a system spe- return rollouts to the driver. Tasks are submitted un- cially built for this algorithm that relies on Redis for mes- til 320000 simulation steps are collected (each task pro- saging and low-level libraries for data- duces between 10 and 1000 steps). The policy update per- sharing. The algorithm periodically broadcasts a new pol- forms 20 steps of SGD with a batch size of 32768. The icy to a pool of workers and aggregates the results of model parameters in this example are roughly 350KB. roughly 10000 tasks (each performing 10 to 1000 simula- These experiments were run using p2.16xlarge (GPU) and tion steps). m4.16xlarge (high CPU) instances. As shown in Figure 14a, an implementation on Ray As shown in Figure 14b, the Ray implementation out- scales to 8192 cores. Doubling the cores available yields performs the optimized MPI implementation in all exper- an average completion time of 1.6×. Conversely, iments, while using a fraction of the GPUs. The reason the special-purpose system fails to complete at 2048 cores, is that Ray is heterogeneity-aware and allows the user to where the work in the system exceeds the processing utilize asymmetric architectures by expressing resource capacity of the application driver. To avoid this issue, the requirements at the granularity of a task or actor. The Ray Ray implementation uses an aggregation tree of actors, implementation can then leverage TensorFlow’s single- reaching a median time of 3.7 minutes, more than twice process multi-GPU support and can pin objects in GPU as fast as the best published result (10 minutes). memory when possible. This optimization cannot be eas- Initial parallelization of a serial implementation using ily ported to MPI due to the need to asynchronously gather Ray required modifying only 7 lines of code. Performance rollouts to a single GPU process. Indeed, [5] includes two improvement through hierarchical aggregation was easy custom implementations of PPO, one using MPI for large to realize with Ray’s support for nested tasks and actors. clusters and one that is optimized for GPUs but that is re- In contrast, the reference implementation had several hun- stricted to a single node. Ray allows for an implementa- dred lines of code dedicated to a protocol for communi- tion suitable for both scenarios. cating tasks and data between workers, and would require Ray’s ability to handle resource heterogeneity also de- further engineering to support optimizations like hierar- creased PPO’s cost by a factor of 4.5 [4], since CPU-only chical aggregation. tasks can be scheduled on cheaper high-CPU instances. In contrast, MPI applications often exhibit symmetric ar- 5.3.2 Proximal Policy Optimization chitectures, in which all processes run the same code and require identical resources, in this case preventing the We implement Proximal Policy Optimization (PPO) [51] use of CPU-only machines for scale-out. Furthermore, in Ray and compare to a highly-optimized reference im- the MPI implementation requires on-demand instances plementation [5] that uses OpenMPI communication prim- since it does not transparently handle failure. Assum- itives. The algorithm is an asynchronous scatter-gather, ing 4× cheaper spot instances, Ray’s fault tolerance and where new tasks are assigned to simulation actors as they resource-aware scheduling together cut costs by 18×.

572 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 6 Related Work mer to simulate low-level message-passing and synchro- nization primitives, but the pitfalls and user experience in Dynamic task graphs. Ray is closely related to this case are similar to those of MPI. OpenMPI [22] can CIEL [40] and Dask [48]. All three support dynamic achieve high performance, but it is relatively hard to pro- task graphs with nested tasks and implement the futures gram as it requires explicit coordination to handle hetero- abstraction. CIEL also provides lineage-based fault toler- geneous and dynamic task graphs. Furthermore, it forces ance, while Dask, like Ray, fully integrates with Python. the to explicitly handle fault tolerance. However, Ray differs in two aspects that have important Actor systems. Orleans [14] and Akka [1] are two ac- performance consequences. First, Ray extends the task tor frameworks well suited to developing highly available model with an actor abstraction. This is necessary for and concurrent distributed systems. However, compared efficient stateful computation in distributed training and to Ray, they provide less support for recovery from data serving, to keep the model data collocated with the com- loss. To recover stateful actors, the Orleans developer putation. Second, Ray employs a fully distributed and de- must explicitly checkpoint actor state and intermediate re- coupled control plane and scheduler, instead of relying on sponses. Stateless actors in Orleans can be replicated for a single master storing all metadata. This is critical for ef- scale-out, and could therefore act as tasks, but unlike in ficiently supporting primitives like allreduce without sys- Ray, they have no lineage. Similarly, while Akka explic- tem modification. At peak performance for 100MB on 16 itly supports persisting actor state across failures, it does nodes, allreduce on Ray (Section 5.1) submits 32 rounds not provide efficient fault tolerance for stateless computa- of 16 tasks in 200ms. Meanwhile, Dask reports a maxi- tion (i.e., tasks). For message delivery, Orleans provides mum scheduler throughput of 3k tasks/s on 512 cores [3]. at-least-once and Akka provides at-most-once semantics. With a centralized scheduler, each round of allreduce In contrast, Ray provides transparent fault tolerance and would then incur a minimum of ∼5ms of scheduling exactly-once semantics, as each method call is logged in delay, translating to up to 2× worse completion time (Fig- the GCS and both arguments and results are immutable. ure 12b). Even with a decentralized scheduler, coupling We find that in practice these limitations do not affect the the control plane information with the scheduler leaves performance of our applications. Erlang [10] and C++ Ac- the latter on the critical path for data transfer, adding an tor Framework [17], two other actor-based systems, have extra roundtrip to every round of allreduce. similarly limited support for fault tolerance. Dataflow systems. Popular dataflow systems, such Global control store and scheduling. The concept as MapReduce [20], Spark [65], and Dryad [28] have of logically centralizing the control plane has been pre- widespread adoption for analytics and ML workloads, viously proposed in software defined networks (SDNs) but their computation model is too restrictive for a fine- [16], distributed file systems (e.g., GFS [23]), resource grained and dynamic simulation workload. Spark and management (e.g., Omega [52]), and distributed frame- MapReduce implement the BSP execution model, which works (e.g., MapReduce [20], BOOM [9]), to name a assumes that tasks within the same stage perform the few. Ray draws inspiration from these pioneering efforts, same computation and take roughly the same amount of but provides significant improvements. In contrast with time. Dryad relaxes this restriction but lacks support for SDNs, BOOM, and GFS, Ray decouples the storage of dynamic task graphs. Furthermore, none of these systems the control plane information (e.g., GCS) from the logic provide an actor abstraction, nor implement a distributed implementation (e.g., schedulers). This allows both stor- scalable control plane and scheduler. Finally, Naiad [39] age and computation layers to scale independently, which is a dataflow system that provides improved scalability is key to achieving our scalability targets. Omega uses for some workloads, but only supports static task graphs. a distributed architecture in which schedulers coordinate Machine learning frameworks. TensorFlow [7] and via globally shared state. To this architecture, Ray adds MXNet [18] target deep learning workloads and effi- global schedulers to balance load across local schedulers, ciently leverage both CPUs and GPUs. While they and targets ms-level, not second-level, task scheduling. achieve great performance for training workloads consist- Ray implements a unique distributed bottom-up sched- ing of static DAGs of linear algebra operations, they have uler that is horizontally scalable, and can handle dynami- limited support for the more general computation required cally constructed task graphs. Unlike Ray, most existing to tightly couple training with simulation and embedded cluster computing systems [20, 64, 40] use a centralized serving. TensorFlow Fold [33] provides some support for scheduler architecture. While Sparrow [45] is decentral- dynamic task graphs, as well as MXNet through its inter- ized, its schedulers make independent decisions, limiting nal C++ APIs, but neither fully supports the ability to mod- the possible scheduling policies, and all tasks of a job are ify the DAG during execution in response to task progress, handled by the same global scheduler. Mesos [26] im- task completion times, or faults. TensorFlow and MXNet plements a two-level hierarchical scheduler, but its top- in principle achieve generality by allowing the program- level scheduler manages frameworks, not individual tasks.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 573 Canary [47] achieves impressive performance by having gorithms are notoriously hard to debug. Third, fault toler- each scheduler instance handle a portion of the task graph, ance helps save money since it allows us to run on cheap but does not handle dynamic computation graphs. resources like spot instances on AWS. Of course, this [12] is a parallel programming language whose comes at the price of some overhead. However, we found work-stealing scheduler achieves provably efficient load- this overhead to be minimal for our target workloads. balancing for dynamic task graphs. However, with no GCS and Horizontal Scalability. The GCS dramati- central coordinator like Ray’s global scheduler, this fully cally simplified Ray development and debugging. It en- parallel design is also difficult to extend to support data abled us to query the entire system state while debugging locality and resource heterogeneity in a distributed setting. Ray itself, instead of having to manually expose internal component state. In addition, the GCS is also the backend 7 Discussion and Experiences for our timeline visualization tool, used for application- level debugging. Building Ray has been a long journey. It started two years The GCS was also instrumental to Ray’s horizontal ago with a Spark library to perform distributed training scalability. In Section 5, we were able to scale by adding and simulations. However, the relative inflexibility of the more shards whenever the GCS became a bottleneck. The BSP model, the high per-task overhead, and the lack of an GCS also enabled the global scheduler to scale by sim- actor abstraction led us to develop a new system. Since we ply adding more replicas. Due to these advantages, we released Ray roughly one year ago, several hundreds of believe that centralizing control state will be a key design people have used it and several companies are running it component of future distributed systems. in production. Here we discuss our experience developing and using Ray, and some early user feedback. API. In designing the API, we have emphasized mini- 8 Conclusion malism. Initially we started with a basic task abstraction. Later, we added the wait() primitive to accommodate roll- No general-purpose system today can efficiently support outs with heterogeneous durations and the actor abstrac- the tight loop of training, serving, and simulation. To ex- tion to accommodate third-party simulators and amortize press these core building blocks and meet the demands of the overhead of expensive initializations. While the re- emerging AI applications, Ray unifies task-parallel and sulting API is relatively low-level, it has proven both pow- actor programming models in a single dynamic task graph erful and simple to use. We have already used this API to and employs a scalable architecture enabled by the global implement many state-of-the-art RL algorithms on top of control store and a bottom-up distributed scheduler. The Ray, including A3C [36], PPO [51], DQN [37], ES [49], programming flexibility, high throughput, and low laten- DDPG [55], and Ape-X [27]. In most cases it took us cies simultaneously achieved by this architecture is partic- just a few tens of lines of code to port these algorithms to ularly important for emerging artificial intelligence work- Ray. Based on early user feedback, we are considering loads, which produce tasks diverse in their resource re- enhancing the API to include higher level primitives and quirements, duration, and functionality. Our evaluation libraries, which could also inform scheduling decisions. demonstrates linear scalability up to 1.8 million tasks per Limitations. Given the workload generality, special- second, transparent fault tolerance, and substantial perfor- ized optimizations are hard. For example, we must make mance improvements on several contemporary RL work- scheduling decisions without full knowledge of the com- loads. Thus, Ray provides a powerful combination of flex- putation graph. Scheduling optimizations in Ray might ibility, performance, and ease of use for the development require more complex runtime profiling. In addition, stor- of future AI applications. ing lineage for each task requires the implementation of garbage collection policies to bound storage costs in the GCS, a feature we are actively developing. Fault tolerance. We are often asked if fault tolerance 9 Acknowledgments is really needed for AI applications. After all, due to the statistical nature of many AI algorithms, one could sim- This research is supported in part by NSF CISE Expedi- ply ignore failed rollouts. Based on our experience, our tions Award CCF-1730628 and gifts from Alibaba, Ama- answer is “yes”. First, the ability to ignore failures makes zon Web Services, Ant Financial, Arm, CapitalOne, Eric- applications much easier to write and reason about. Sec- sson, Facebook, Google, Huawei, Intel, , Sco- ond, our particular implementation of fault tolerance via tiabank, Splunk and VMware as well as by NSF grant deterministic replay dramatically simplifies debugging as DGE-1106400. We are grateful to our anonymous review- it allows us to easily reproduce most errors. This is par- ers and our shepherd, Miguel Castro, for thoughtful feed- ticularly important since, due to their stochasticity, AI al- back, which helped improve the quality of this paper.

574 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association References [18] CHEN, T., LI,M.,LI, Y., LIN,M.,WANG,N.,WANG,M., XIAO, T., XU,B.,ZHANG,C., AND ZHANG,Z. MXNet: A [1] Akka. https://akka.io/. flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning [2] Apache Arrow. https://arrow.apache.org/. Systems (LearningSys’16) (2016). [3] Dask Benchmarks. http://matthewrocklin.com/blog/ [19] CRANKSHAW,D.,WANG,X.,ZHOU,G.,FRANKLIN,M.J., work/2017/07/03/scaling. GONZALEZ,J.E., AND STOICA,I. Clipper: A low-latency [4] EC2 Instance Pricing. https://aws.amazon.com/ec2/ online prediction serving system. In 14th USENIX Symposium pricing/on-demand/. on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 613–627. [5] OpenAI Baselines: high-quality implementations of reinforce- ment learning algorithms. https://github.com/openai/ [20] DEAN,J., AND GHEMAWAT,S. MapReduce: Simplified data baselines. processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. [6] TensorFlow Serving. https://www.tensorflow.org/ serving/. [21] DENNIS,J.B., AND MISUNAS, D. P. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd An- [7] ABADI,M.,BARHAM, P., CHEN,J.,CHEN,Z.,DAVIS,A., nual Symposium on Computer Architecture (New York, NY, USA, DEAN,J.,DEVIN,M.,GHEMAWAT,S.,IRVING,G.,ISARD,M., 1975), ISCA ’75, ACM, pp. 126–132. ETAL. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating [22] GABRIEL,E.,FAGG,G.E.,BOSILCA,G.,ANGSKUN, T., DON- Systems Design and Implementation (OSDI). Savannah, Georgia, GARRA,J.J.,SQUYRES,J.M.,SAHAY, V., KAMBADUR, P., USA (2016). BARRETT,B.,LUMSDAINE,A.,CASTAIN,.H.,DANIEL, D.J.,GRAHAM,R.L., AND WOODALL, T. S. Open MPI: Goals, [8]A GARWAL,A.,BIRD,S.,COZOWICZ,M.,HOANG,L.,LANG- concept, and design of a next generation MPI implementation. In FORD,J.,LEE,S.,LI,J.,MELAMED,D.,OSHRI,G.,RIBAS, Proceedings, 11th European PVM/MPI Users’ Group Meeting O.,SEN,S., AND SLIVKINS,A. A multiworld testing decision (Budapest, Hungary, September 2004), pp. 97–104. service. arXiv preprint arXiv:1606.03966 (2016). [23] GHEMAWAT,S.,GOBIOFF,H., AND LEUNG, S.-T. The Google [9] ALVARO, P., CONDIE, T., CONWAY,N.,ELMELEEGY,K., file system. 29–43. HELLERSTEIN,J.M., AND SEARS,R. BOOM Analytics: ex- ploring data-centric, declarative programming for the cloud. In [24] GONZALEZ,J.E.,XIN,R.S.,DAVE,A.,CRANKSHAW,D., Proceedings of the 5th European conference on Computer systems FRANKLIN,M.J., AND STOICA,I. GraphX: Graph processing (2010), ACM, pp. 223–236. in a distributed dataflow framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implemen- [10] ARMSTRONG,J.,VIRDING,R.,WIKSTROM¨ ,C., AND tation (Berkeley, CA, USA, 2014), OSDI’14, USENIX Associa- WILLIAMS, M. Concurrent programming in ERLANG. tion, pp. 599–613.

[11] BEATTIE,C.,LEIBO,J.Z.,TEPLYASHIN,D.,WARD, T., [25] GU*,S.,HOLLY*,E.,LILLICRAP, T., AND LEVINE,S. Deep re- ¨ WAINWRIGHT,M.,KUTTLER,H.,LEFRANCQ,A.,GREEN,S., inforcement learning for robotic manipulation with asynchronous ´ VALDES, V., SADIK,A., ETAL. DeepMind Lab. arXiv preprint off-policy updates. In IEEE International Conference on Robotics arXiv:1612.03801 (2016). and Automation (ICRA 2017) (2017). [12] BLUMOFE,R.D., AND LEISERSON,C.E. Scheduling mul- [26] HINDMAN,B.,KONWINSKI,A.,ZAHARIA,M.,GHODSI,A., tithreaded computations by work stealing. J. ACM 46, 5 (Sept. JOSEPH,A.D.,KATZ,R.,SHENKER,S., AND STOICA,I. 1999), 720–748. Mesos: A platform for fine-grained resource sharing in the data Proceedings of the 8th USENIX Conference on Net- [13]B ROCKMAN,G.,CHEUNG, V., PETTERSSON,L.,SCHNEIDER, center. In worked Systems Design and Implementation J.,SCHULMAN,J.,TANG,J., AND ZAREMBA, W. OpenAI gym. (Berkeley, CA, USA, arXiv preprint arXiv:1606.01540 (2016). 2011), NSDI’11, USENIX Association, pp. 295–308.

[14] BYKOV,S.,GELLER,A.,KLIOT,G.,LARUS,J.R.,PANDYA, [27] HORGAN,D.,QUAN,J.,BUDDEN,D.,BARTH-MARON,G., R., AND THELIN,J. Orleans: for everyone. In HESSEL,M., VAN HASSELT,H., AND SILVER,D. Distributed Proceedings of the 2nd ACM Symposium on Cloud Computing prioritized experience replay. International Conference on Learn- (2011), ACM, p. 16. ing Representations (2018).

[15] CARBONE, P., EWEN,S.,FORA´ ,G.,HARIDI,S.,RICHTER, [28] ISARD,M.,BUDIU,M.,YU, Y., BIRRELL,A., AND FETTERLY, S., AND TZOUMAS,K. State management in : D. Dryad: Distributed data-parallel programs from sequential Consistent stateful distributed . Proc. VLDB building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys Endow. 10, 12 (Aug. 2017), 1718–1729. European Conference on Computer Systems 2007 (New York, NY, USA, 2007), EuroSys ’07, ACM, pp. 59–72. [16] CASADO,M.,FREEDMAN,M.J.,PETTIT,J.,LUO,J.,MCKE- OWN,N., AND SHENKER,S. Ethane: Taking control of the enter- [29] JIA, Y., SHELHAMER,E.,DONAHUE,J.,KARAYEV,S.,LONG, prise. SIGCOMM Comput. Commun. Rev. 37, 4 (Aug. 2007), 1–12. J.,GIRSHICK,R.,GUADARRAMA,S., AND DARRELL, T. Caffe: Convolutional architecture for fast feature embedding. arXiv [17] CHAROUSSET,D.,SCHMIDT, T. C., HIESGEN,R., AND preprint arXiv:1408.5093 (2014). WAHLISCH¨ ,M. Native actors: A scalable software platform for distributed, heterogeneous environments. In Proceedings of the [30] JORDAN,M.I., AND MITCHELL, T. M. Machine learning: 2013 workshop on Programming based on actors, agents, and de- Trends, perspectives, and prospects. Science 349, 6245 (2015), centralized control (2013), ACM, pp. 87–96. 255–260.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 575 [31] LEIBIUSKY,J.,EISBRUCH,G., AND SIMONASSI,D. Getting [45] OUSTERHOUT,K.,WENDELL, P., ZAHARIA,M., AND STOICA, Started with Storm. O’Reilly Media, Inc., 2012. I. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems [32] LI,M.,ANDERSEN,D.G.,PARK, J. W., SMOLA,A.J., Principles (New York, NY, USA, 2013), SOSP ’13, ACM, pp. 69– AHMED,A.,JOSIFOVSKI, V., LONG,J.,SHEKITA,E.J., AND 84. SU, B.-Y. Scaling distributed machine learning with the parame- ter server. In Proceedings of the 11th USENIX Conference on Op- [46] PASZKE,A.,GROSS,S.,CHINTALA,S.,CHANAN,G.,YANG, erating Systems Design and Implementation (Berkeley, CA, USA, E.,DEVITO,Z.,LIN,Z.,DESMAISON,A.,ANTIGA,L., AND 2014), OSDI’14, pp. 583–598. LERER, A. Automatic differentiation in PyTorch.

[33]L OOKS,M.,HERRESHOFF,M.,HUTCHINS,D., AND NORVIG, P. Deep learning with dynamic computation graphs. arXiv preprint [47] QU,H.,MASHAYEKHI,O.,TEREI,D., AND LEVIS, P. Canary: arXiv:1702.02181 (2017). A scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412 (2016). [34] LOW, Y., GONZALEZ,J.,KYROLA,A.,BICKSON,D., GUESTRIN,C., AND HELLERSTEIN,J. GraphLab: A new frame- [48] ROCKLIN,M. Dask: Parallel computation with blocked algo- work for parallel machine learning. In Proceedings of the Twenty- rithms and task scheduling. In Proceedings of the 14th Python in Sixth Conference on Uncertainty in Artificial Intelligence (Arling- Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130 ton, Virginia, United States, 2010), UAI’10, pp. 340–349. – 136.

[35] MALEWICZ,G.,AUSTERN,M.H.,BIK,A.J.,DEHNERT,J.C., [49]S ALIMANS, T., HO,J.,CHEN,X., AND SUTSKEVER, I. Evolu- HORN,I.,LEISER,N., AND CZAJKOWSKI,G. Pregel: A system tion strategies as a scalable alternative to reinforcement learning. for large-scale graph processing. In Proceedings of the 2010 ACM arXiv preprint arXiv:1703.03864 (2017). SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD ’10, ACM, pp. 135–146. [50] SANFILIPPO,S. Redis: An open source, in-memory store. https://redis.io/, 2009. [36] MNIH, V., BADIA, A. P., MIRZA,M.,GRAVES,A.,LILLICRAP, T. P., HARLEY, T., SILVER,D., AND KAVUKCUOGLU,K. Asyn- chronous methods for deep reinforcement learning. In Interna- [51] SCHULMAN,J.,WOLSKI, F., DHARIWAL, P., RADFORD,A., tional Conference on Machine Learning (2016). AND KLIMOV,O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017). [37] MNIH, V., KAVUKCUOGLU,K.,SILVER,D.,RUSU,A.A., VENESS,J.,BELLEMARE,M.G.,GRAVES,A.,RIEDMILLER, [52] SCHWARZKOPF,M.,KONWINSKI,A.,ABD-EL-MALEK,M., M.,FIDJELAND,A.K.,OSTROVSKI,G., ETAL. Human-level AND WILKES,J. Omega: Flexible, scalable schedulers for large control through deep reinforcement learning. Nature 518, 7540 compute clusters. In Proceedings of the 8th ACM European Con- (2015), 529–533. ference on Computer Systems (New York, NY, USA, 2013), Eu- roSys ’13, ACM, pp. 351–364. [38] MURRAY,D. A Distributed Execution Engine Supporting Data- dependent Control Flow. University of Cambridge, 2012. [53] SERGEEV,A., AND DEL BALSO,M. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint [39] MURRAY,D.G.,MCSHERRY, F., ISAACS,R.,ISARD,M., arXiv:1802.05799 (2018). BARHAM, P., AND ABADI,M. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operat- [54] SILVER,D.,HUANG,A.,MADDISON,C.J.,GUEZ,A., ing Systems Principles (New York, NY, USA, 2013), SOSP ’13, SIFRE,L.,VAN DEN DRIESSCHE,G.,SCHRITTWIESER,J., ACM, pp. 439–455. ANTONOGLOU,I.,PANNEERSHELVAM, V., LANCTOT,M., [40] MURRAY,D.G.,SCHWARZKOPF,M.,SMOWTON,C.,SMITH, ETAL. Mastering the game of Go with deep neural networks and S.,MADHAVAPEDDY,A., AND HAND,S. CIEL: A universal exe- tree search. Nature 529, 7587 (2016), 484–489. cution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and [55] SILVER,D.,LEVER,G.,HEESS,N.,DEGRIS, T., WIERSTRA, Implementation (Berkeley, CA, USA, 2011), NSDI’11, USENIX D., AND RIEDMILLER,M. Deterministic policy gradient algo- Association, pp. 113–126. rithms. In ICML (2014).

[41] NAIR,A.,SRINIVASAN, P., BLACKWELL,S.,ALCICEK,C., [56] SUTTON,R.S., AND BARTO,A.G. Reinforcement Learning: FEARON,R.,MARIA,A.D.,PANNEERSHELVAM, V., SULEY- An Introduction. MIT press Cambridge, 1998. MAN,M.,BEATTIE,C.,PETERSEN,S.,LEGG,S.,MNIH, V., KAVUKCUOGLU,K., AND SILVER, D. meth- [57] THAKUR,R.,RABENSEIFNER,R., AND GROPP, W. Optimiza- ods for deep reinforcement learning, 2015. tion of collective communication operations in MPICH. The Inter- national Journal of High Performance Computing Applications [42] NG,A.,COATES,A.,DIEL,M.,GANAPATHI, V., SCHULTE,J., 19, 1 (2005), 49–66. TSE,B.,BERGER,E., AND LIANG,E. Autonomous inverted he- licopter flight via reinforcement learning. Experimental Robotics [58] TIAN, Y., GONG,Q.,SHANG, W., WU, Y., AND ZITNICK,C.L. IX (2006), 363–372. ELF: An extensive, lightweight and flexible research platform [43] NISHIHARA,R.,MORITZ, P., WANG,S.,TUMANOV,A.,PAUL, for real-time strategy games. Advances in Neural Information W., SCHLEIER-SMITH,J.,LIAW,R.,NIKNAMI,M.,JORDAN, Processing Systems (NIPS) (2017). M.I., AND STOICA,I. Real-time machine learning: The missing pieces. In Workshop on Hot Topics in Operating Systems (2017). [59] TODOROV,E.,EREZ, T., AND TASSA, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems [44] OPENAI. OpenAI Dota 2 1v1 bot. https://openai.com/ (IROS), 2012 IEEE/RSJ International Conference on (2012), IEEE, the-international/, 2017. pp. 5026–5033.

576 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association [60] VAN DEN BERG,J.,MILLER,S.,DUCKWORTH,D.,HU,H., WAN,A.,FU, X.-Y., GOLDBERG,K., AND ABBEEL, P. Su- perhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In Robotics and Au- tomation (ICRA), 2010 IEEE International Conference on (2010), IEEE, pp. 2074–2081.

[61] VAN RENESSE,R., AND SCHNEIDER, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI’04, USENIX Association.

[62] VENKATARAMAN,S.,PANDA,A.,OUSTERHOUT,K.,GHODSI, A.,ARMBRUST,M.,RECHT,B.,FRANKLIN,M., AND STOICA, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the Twenty-Sixth ACM Symposium on Operating Systems Principles (2017), SOSP ’17, ACM.

[63] WHITE, T. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.

[64] ZAHARIA,M.,CHOWDHURY,M.,DAS, T., DAVE,A.,MA,J., MCCAULEY,M.,FRANKLIN,M.J.,SHENKER,S., AND STO- ICA,I. Resilient distributed datasets: A fault-tolerant abstrac- tion for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implemen- tation (2012), USENIX Association, pp. 2–2.

[65] ZAHARIA,M.,XIN,R.S.,WENDELL, P., DAS, T., ARMBRUST, M.,DAVE,A.,MENG,X.,ROSEN,J.,VENKATARAMAN,S., FRANKLIN,M.J.,GHODSI,A.,GONZALEZ,J.,SHENKER,S., AND STOICA,I. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56–65.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 577