Dryadlinq: a System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu Michael Isard Dennis Fetterly Mihai Budiu Úlfar Erlingsson1 Pradeep Kumar Gunda Jon Currey Microsoft Research Silicon Valley 1joint affiliation, Reykjavík University, Iceland Abstract expressive data model of strongly typed .NET objects. The main contribution of this paper is a set of language DryadLINQ is a system and a set of language extensions extensions and a corresponding system that can auto- that enable a new programming model for large scale dis- matically and transparently compile imperative programs tributed computing. It generalizes previous execution en- in a general-purpose language into distributed computa- vironments such as SQL, MapReduce, and Dryad in two tions that execute efficiently on large computing clusters. ways: by adopting an expressive data model of strongly Our goal is to give the programmer the illusion of typed .NET objects; and by supporting general-purpose writing for a single computer and to have the sys- imperative and declarative operations on datasets within tem deal with the complexities that arise from schedul- a traditional high-level programming language. ing, distribution, and fault-tolerance. Achieving this A DryadLINQ program is a sequential program com- goal requires a wide variety of components to inter- posed of LINQ expressions performing arbitrary side- act, including cluster-management software, distributed- effect-free transformations on datasets, and can be writ- execution middleware, language constructs, and devel- ten and debugged using standard .NET development opment tools. Traditional parallel databases (which tools. The DryadLINQ system automatically and trans- we survey in Section 6.1) as well as more recent parently translates the data-parallel portions of the pro- data-processing systems such as MapReduce [15] and gram into a distributed execution plan which is passed Dryad [26] demonstrate that it is possible to implement to the Dryad execution platform. Dryad, which has been high-performance large-scale execution engines at mod- in continuous operation for several years on production est financial cost, and clusters running such platforms clusters made up of thousands of computers, ensures ef- are proliferating. Even so, their programming interfaces ficient, reliable execution of this plan. all leave room for improvement. We therefore believe We describe the implementation of the DryadLINQ that the language issues addressed in this paper are cur- compiler and runtime. We evaluate DryadLINQ on a rently among the most pressing research areas for data- varied set of programs drawn from domains such as intensive computing, and our work on the DryadLINQ web-graph analysis, large-scale log mining, and machine system stems from this belief. learning. We show that excellent absolute performance can be attained—a general-purpose sort of 1012 Bytes of DryadLINQ exploits LINQ (Language INtegrated data executes in 319 seconds on a 240-computer, 960- Query [2], a set of .NET constructs for programming disk cluster—as well as demonstrating near-linear scal- with datasets) to provide a powerful hybrid of declarative ing of execution time on representative applications as and imperative programming. The system is designed to we vary the number of computers used for a job. provide flexible and efficient distributed computation in any LINQ-enabled programming language including C#, VB, and F#. Objects in DryadLINQ datasets can be of 1 Introduction any .NET type, making it easy to compute with data such as image patches, vectors, and matrices. DryadLINQ The DryadLINQ system is designed to make it easy for programs can use traditional structuring constructs such a wide variety of developers to compute effectively on as functions, modules, and libraries, and express iteration large amounts of data. DryadLINQ programs are written using standard loops. Crucially, the distributed execu- as imperative or declarative operations on datasets within tion layer employs a fully functional, declarative descrip- a traditional high-level programming language, using an tion of the data-parallel component of the computation, USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 1 which enables sophisticated rewritings and optimizations tualization is that it requires intermediate results to be like those traditionally employed by parallel databases. stored to persistent media, potentially increasing compu- In contrast, parallel databases implement only declar- tation latency. ative variants of SQL queries. There is by now a This paper makes the following contributions to the widespread belief that SQL is too limited for many ap- literature: plications [15, 26, 31, 34, 35]. One problem is that, in We have demonstrated a new hybrid of declarative • order to support database requirements such as in-place and imperative programming, suitable for large-scale updates and efficient transactions, SQL adopts a very re- data-parallel computing using a rich object-oriented strictive type system. In addition, the declarative “query- programming language. oriented” nature of SQL makes it difficult to express We have implemented the DryadLINQ system and common programming patterns such as iteration [14]. • Together, these make SQL unsuitable for tasks such as validated the hypothesis that DryadLINQ programs can machine learning, content parsing, and web-graph anal- be automatically optimized and efficiently executed on ysis that increasingly must be run on very large datasets. large clusters. The MapReduce system [15] adopted a radically sim- We have designed a small set of operators that im- • plified programming abstraction, however even common prove LINQ’s support for coarse-grain parallelization operations like database Join are tricky to implement in while preserving its programming model. this model. Moreover, it is necessary to embed MapRe- Section 2 provides a high-level overview of the steps in- duce computations in a scripting language in order to volved when a DryadLINQ program is run. Section 3 execute programs that require more than one reduction discusses LINQ and the extensions to its programming or sorting stage. Each MapReduce instantiation is self- model that comprise DryadLINQ along with simple il- contained and no automatic optimizations take place lustrative examples. Section 4 describes the DryadLINQ across their boundaries. In addition, the lack of any type- implementation and its interaction with the low-level system support or integration between the MapReduce Dryad primitives. In Section 5 we evaluate our system stages requires programmers to explicitly keep track of using several example applications at a variety of scales. objects passed between these stages, and may compli- Section 6 compares DryadLINQ to related work and Sec- cate long-term maintenance and re-use of software com- tion 7 discusses limitations of the system and lessons ponents. learned from its development. Several domain-specific languages have appeared on top of the MapReduce abstraction to hide some of this complexity from the programmer, including 2 System Architecture Sawzall [32], Pig [31], and other unpublished systems such as Facebook’s HIVE. These offer a limited hy- DryadLINQ compiles LINQ programs into distributed bridization of declarative and imperative programs and computations running on the Dryad cluster-computing generalize SQL’s stored-procedure model. Some whole- infrastructure [26]. A Dryad job is a directed acyclic query optimizations are automatically applied by these graph where each vertex is a program and edges repre- systems across MapReduce computation boundaries. sent data channels. At run time, vertices are processes However, these approaches inherit many of SQL’s disad- communicating with each other through the channels, vantages, adopting simple custom type systems and pro- and each channel is used to transport a finite sequence viding limited support for iterative computations. Their of data records. The data model and serialization are support for optimizations is less advanced than that in provided by higher-level software layers, in this case DryadLINQ, partly because the underlying MapReduce DryadLINQ. execution platform is much less flexible than Dryad. Figure 1 illustrates the Dryad system architecture. The DryadLINQ and systems such as MapReduce are also execution of a Dryad job is orchestrated by a central- distinguished from traditional databases [25] by having ized “job manager.” The job manager is responsible virtualized expression plans. The planner allocates re- for: (1) instantiating a job’s dataflow graph; (2) schedul- sources independent of the actual cluster used for execu- ing processes on cluster computers; (3) providing fault- tion. This means both that DryadLINQ can run plans tolerance by re-executing failed or slow processes; (4) requiring many more steps than the instantaneously- monitoring the job and collecting statistics; and (5) trans- available computation resources would permit, and that forming the job graph dynamically according to user- the computational resources can change dynamically, supplied policies. e.g. due to faults—in essence, we have an extra degree A cluster is typically controlled by a task scheduler, of freedom in buffer management compared with tradi- separate from Dryad, which manages a batch queue of tional schemes [21, 24, 27, 28, 29]. A downside of vir- jobs and executes a few at a time subject to cluster

Load more