DryadLINQ: A System for General-Purpose Distributed Data- Using a High-Level Language

Yuan Yu Michael Isard Dennis Fetterly Mihai Budiu Úlfar Erlingsson1 Pradeep Kumar Gunda Jon Currey Research Silicon Valley 1joint affiliation, Reykjavík University, Iceland

Abstract expressive data model of strongly typed .NET objects. The main contribution of this paper is a set of language DryadLINQ is a system and a set of language extensions extensions and a corresponding system that can auto- that enable a new programming model for large scale dis- matically and transparently compile imperative programs tributed computing. It generalizes previous execution en- in a general-purpose language into distributed computa- vironments such as SQL, MapReduce, and Dryad in two tions that execute efficiently on large computing clusters. ways: by adopting an expressive data model of strongly Our goal is to give the programmer the illusion of typed .NET objects; and by supporting general-purpose writing for a single computer and to have the sys- imperative and declarative operations on datasets within tem deal with the complexities that arise from schedul- a traditional high-level programming language. ing, distribution, and fault-tolerance. Achieving this A DryadLINQ program is a sequential program com- goal requires a wide variety of components to inter- posed of LINQ expressions performing arbitrary side- act, including cluster-management software, distributed- effect-free transformations on datasets, and can be writ- execution middleware, language constructs, and devel- ten and debugged using standard .NET development opment tools. Traditional parallel databases (which tools. The DryadLINQ system automatically and trans- we survey in Section 6.1) as well as more recent parently translates the data-parallel portions of the pro- data-processing systems such as MapReduce [15] and gram into a distributed execution plan which is passed Dryad [26] demonstrate that it is possible to implement to the Dryad execution platform. Dryad, which has been high-performance large-scale execution engines at mod- in continuous operation for several years on production est financial cost, and clusters running such platforms clusters made up of thousands of computers, ensures ef- are proliferating. Even so, their programming interfaces ficient, reliable execution of this plan. all leave room for improvement. We therefore believe We describe the implementation of the DryadLINQ that the language issues addressed in this paper are cur- compiler and runtime. We evaluate DryadLINQ on a rently among the most pressing research areas for data- varied set of programs drawn from domains such as intensive computing, and our work on the DryadLINQ web-graph analysis, large-scale log mining, and machine system stems from this belief. learning. We show that excellent absolute performance can be attained—a general-purpose sort of 1012 Bytes of DryadLINQ exploits LINQ (Language INtegrated data executes in 319 seconds on a 240-computer, 960- Query [2], a set of .NET constructs for programming disk cluster—as well as demonstrating near-linear scal- with datasets) to provide a powerful hybrid of declarative ing of execution time on representative applications as and imperative programming. The system is designed to we vary the number of computers used for a job. provide flexible and efficient distributed computation in any LINQ-enabled programming language including #, VB, and F#. Objects in DryadLINQ datasets can be of 1 Introduction any .NET type, making it easy to compute with data such as image patches, vectors, and matrices. DryadLINQ The DryadLINQ system is designed to make it easy for programs can use traditional structuring constructs such a wide variety of developers to compute effectively on as functions, modules, and libraries, and express iteration large amounts of data. DryadLINQ programs are written using standard loops. Crucially, the distributed execu- as imperative or declarative operations on datasets within tion layer employs a fully functional, declarative descrip- a traditional high-level programming language, using an tion of the data-parallel component of the computation,

USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 1 which enables sophisticated rewritings and optimizations tualization is that it requires intermediate results to be like those traditionally employed by parallel databases. stored to persistent media, potentially increasing compu- In contrast, parallel databases implement only declar- tation latency. ative variants of SQL queries. There is by now a This paper makes the following contributions to the widespread belief that SQL is too limited for many ap- literature: plications [15, 26, 31, 34, 35]. One problem is that, in We have demonstrated a new hybrid of declarative • order to support database requirements such as in-place and imperative programming, suitable for large-scale updates and efficient transactions, SQL adopts a very re- data-parallel computing using a rich object-oriented strictive type system. In addition, the declarative “query- programming language. oriented” nature of SQL makes it difficult to express We have implemented the DryadLINQ system and common programming patterns such as iteration [14]. • Together, these make SQL unsuitable for tasks such as validated the hypothesis that DryadLINQ programs can machine learning, content parsing, and web-graph anal- be automatically optimized and efficiently executed on ysis that increasingly must be run on very large datasets. large clusters. The MapReduce system [15] adopted a radically sim- We have designed a small set of operators that im- • plified programming abstraction, however even common prove LINQ’s support for coarse-grain parallelization operations like database Join are tricky to implement in while preserving its programming model. this model. Moreover, it is necessary to embed MapRe- Section 2 provides a high-level overview of the steps in- duce computations in a scripting language in order to volved when a DryadLINQ program is run. Section 3 execute programs that require more than one reduction discusses LINQ and the extensions to its programming or sorting stage. Each MapReduce instantiation is self- model that comprise DryadLINQ along with simple il- contained and no automatic optimizations take place lustrative examples. Section 4 describes the DryadLINQ across their boundaries. In addition, the lack of any type- implementation and its interaction with the low-level system support or integration between the MapReduce Dryad primitives. In Section 5 we evaluate our system stages requires programmers to explicitly keep track of using several example applications at a variety of scales. objects passed between these stages, and may compli- Section 6 compares DryadLINQ to related work and Sec- cate long-term maintenance and re-use of software com- tion 7 discusses limitations of the system and lessons ponents. learned from its development. Several domain-specific languages have appeared on top of the MapReduce abstraction to hide some of this complexity from the programmer, including 2 System Architecture Sawzall [32], Pig [31], and other unpublished systems such as Facebook’s HIVE. These offer a limited hy- DryadLINQ compiles LINQ programs into distributed bridization of declarative and imperative programs and computations running on the Dryad cluster-computing generalize SQL’s stored-procedure model. Some whole- infrastructure [26]. A Dryad job is a directed acyclic query optimizations are automatically applied by these graph where each vertex is a program and edges repre- systems across MapReduce computation boundaries. sent data channels. At run time, vertices are processes However, these approaches inherit many of SQL’s disad- communicating with each other through the channels, vantages, adopting simple custom type systems and pro- and each channel is used to transport a finite sequence viding limited support for iterative computations. Their of data records. The data model and serialization are support for optimizations is less advanced than that in provided by higher-level software layers, in this case DryadLINQ, partly because the underlying MapReduce DryadLINQ. execution platform is much less flexible than Dryad. Figure 1 illustrates the Dryad system architecture. The DryadLINQ and systems such as MapReduce are also execution of a Dryad job is orchestrated by a central- distinguished from traditional databases [25] by having ized “job manager.” The job manager is responsible virtualized expression plans. The planner allocates re- for: (1) instantiating a job’s dataflow graph; (2) schedul- sources independent of the actual cluster used for execu- ing processes on cluster computers; (3) providing fault- tion. This means both that DryadLINQ can run plans tolerance by re-executing failed or slow processes; (4) requiring many more steps than the instantaneously- monitoring the job and collecting statistics; and (5) trans- available computation resources would permit, and that forming the job graph dynamically according to user- the computational resources can change dynamically, supplied policies. e.g. due to faults—in essence, we have an extra degree A cluster is typically controlled by a task scheduler, of freedom in buffer management compared with tradi- separate from Dryad, which manages a batch queue of tional schemes [21, 24, 27, 28, 29]. A downside of vir- jobs and executes a few at a time subject to cluster policy.

2 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association job graph data plane ation of code and static data for the remote Dryad ver- Files, TCP, FIFO tices; and (c) the generation of serialization code for the required data types. Section 4 describes these steps in detail. V V V Step 4. DryadLINQ invokes a custom, DryadLINQ- specific, Dryad job manager. The job manager may be NS PD PD PD executed behind a cluster firewall. Step 5. The job manager creates the job graph using the plan created in Step 3. It schedules and spawns the ver- Job manager control plane cluster tices as resources become available. See Figure 1. Figure 1: Dryad system architecture. NS is the name server which maintains the cluster membership. The job manager is responsible Step 6. Each Dryad vertex executes a vertex-specific for spawning vertices (V) on available computers with the help of a program (created in Step 3b). remote-execution and monitoring daemon (PD). Vertices exchange data through files, TCP pipes, or shared-memory channels. The grey shape Step 7. When the Dryad job completes successfully it indicates the vertices in the job that are currently running and the cor- writes the data to the output table(s). respondence with the job execution graph. Step 8. The job manager terminates, and it re- 2.1 DryadLINQ Execution Overview turns control back to DryadLINQ. DryadLINQ creates the local DryadTable objects encapsulating the out- Figure 2 shows the flow of execution when a program is puts of the execution. These objects may be used as executed by DryadLINQ. inputs to subsequent expressions in the user program. Data objects within a DryadTable output are fetched Client machine to the local context only if explicitly dereferenced. (1)ToDryadTable .NET foreach Step 9. Control returns to the user application. The it- erator interface over a DryadTable allows the user to (2) LINQ .NET (9) read its contents as .NET objects. Expr Objects DryadLINQ Step 10. The application may generate subsequent Output DryadLINQ expressions, to be executed by a repetition (3) Compile DryadTable of Steps 2–9.

Invoke 3 Programming with DryadLINQ (4) (5) Results Vertex Exec JM In this section we highlight some particularly useful and code plan (8) distinctive aspects of DryadLINQ. More details on the programming model may be found in LINQ language reference [2] and materials on the DryadLINQ project Input Dryad Output (7) tables Execution Tables website [1] including a language tutorial. A companion technical report [38] contains listings of some of the sam- Data center (6) ple programs described below.

Figure 2: LINQ-expression execution in DryadLINQ. 3.1 LINQ Step 1. A .NET user application runs. It creates a DryadLINQ expression object. Because of LINQ’s de- The term LINQ [2] refers to a set of .NET constructs ferred evaluation (described in Section 3), the actual ex- for manipulating sets and sequences of data items. We ecution of the expression has not occurred. describe it here as it applies to C# but DryadLINQ pro- grams have been written in other .NET languages includ- Step 2. The application calls ToDryadTable trigger- ing F#. The power and extensibility of LINQ derive from ing a data-parallel execution. The expression object is a set of design choices that allow the programmer to ex- handed to DryadLINQ. press complex computations over datasets while giving Step 3. DryadLINQ compiles the LINQ expression into the runtime great leeway to decide how these computa- a distributed Dryad execution plan. It performs: (a) the tions should be implemented. decomposition of the expression into subexpressions, The base type for a LINQ collection is IEnumer- each to be run in a separate Dryad vertex; (b) the gener- able. From a programmer’s perspective, this is

USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 3 an abstract dataset of objects of type T that is ac- cessed using an iterator interface. LINQ also defines // SQL-style syntax to join two input sets: the IQueryable interface which is a subtype of // scoreTriples and staticRank var adjustedScoreTriples = IEnumerable and represents an (unevaluated) ex- from d in scoreTriples pression constructed by combining LINQ datasets us- join in staticRank on d.docID equals r.key ing LINQ operators. We need make only two obser- select new QueryScoreDocIDTriple(d, r); vations about these types: (a) in general the program- var rankedQueries = mer neither knows nor cares what concrete type imple- from s in adjustedScoreTriples ments any given dataset’s IEnumerable interface; and group s by s.query into g (b) DryadLINQ composes all LINQ expressions into select TakeTopQueryResults(g); IQueryable objects and defers evaluation until the result is needed, at which point the expression graph within the // Object-oriented syntax for the above join IQueryable is optimized and executed in its entirety on var adjustedScoreTriples = the cluster. Any IQueryable object can be used as an scoreTriples.Join(staticRank, d => d.docID, r => r.key, argument to multiple operators, allowing efficient re-use (d, r) => new QueryScoreDocIDTriple(d, r)); of common subexpressions. var groupedQueries = LINQ expressions are statically strongly typed adjustedScoreTriples.GroupBy(s => s.query); through use of nested generics, although the compiler var rankedQueries = hides much of the type-complexity from the user by pro- groupedQueries.Select( viding a range of “syntactic sugar.” Figure 3 illustrates g => TakeTopQueryResults(g)); LINQ’s syntax with a fragment of a simple example pro- gram that computes the top-ranked results for each query Figure 3: A program fragment illustrating two ways of expressing the in a stored corpus. Two versions of the same LINQ ex- same operation. The first uses LINQ’s declarative syntax, and the sec- pression are shown, one using a declarative SQL-like ond uses object-oriented interfaces. Statements such as r => r.key use C#’s syntax for anonymous lambda expressions. syntax, and the second using the object-oriented style we adopt for more complex programs. Partition .NET objects The program first performs a Join to “look up” the static rank of each document contained in a scoreTriples tuple and then computes a new rank for that tuple, com- bining the query-dependent score with the static score in- side the constructor for QueryScoreDocIDTriple. The program next groups the resulting tuples by query, and outputs the top-ranked results for each query. The full Collection Figure 4: The DryadLINQ data model: strongly-typed collections of example program is included in [38]. .NET objects partitioned on a set of computers. The second, object-oriented, version of the example illustrates LINQ’s use of C#’s lambda expressions. The tain arbitrary .NET types, but each DryadLINQ dataset is Join method, for example, takes as arguments a dataset in general distributed across the computers of a cluster, to perform the Join against (in this case staticRank) and partitioned into disjoint pieces as shown in Figure 4. The three functions. The first two functions describe how to partitioning strategies used—hash-partitioning, range- determine the keys that should be used in the Join. The partitioning, and round-robin—are familiar from paral- third function describes the Join function itself. Note that lel databases [18]. This dataset partitioning is managed the compiler performs static type inference to determine transparently by the system unless the programmer ex- the concrete types of var objects and anonymous lambda plicitly overrides the optimizer’s choices. expressions so the programmer need not remember (or The inputs and outputs of a DryadLINQ computation even know) the type signatures of many subexpressions are represented by objects of type DryadTable, or helper functions. which is a subtype of IQueryable. Subtypes of DryadTable support underlying storage providers 3.2 DryadLINQ Constructs that include distributed filesystems, collections of NTFS files, and sets of SQL tables. DryadTable objects may DryadLINQ preserves the LINQ programming model include metadata read from the file system describing ta- and extends it to data-parallel programming by defining ble properties such as schemas for the data items con- a small set of new operators and datatypes. tained in the table, and partitioning schemes which the The DryadLINQ data model is a distributed imple- DryadLINQ optimizer can use to generate more efficient mentation of LINQ collections. Datasets may still con- executions. These optimizations, along with issues such

4 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association as data serialization and compression, are discussed in If the DryadLINQ system has no further information Section 4. about f, Apply (or Fork) will cause all of the compu- The primary restriction imposed by the DryadLINQ tation to be serialized onto a single computer. More system to allow distributed execution is that all the func- often, however, the user supplies annotations on f that tions called in DryadLINQ expressions must be side- indicate conditions under which Apply can be paral- effect free. Shared objects can be referenced and read lelized. The details are too complex to be described in freely and will be automatically serialized and distributed the space available, but quite general “conditional homo- where necessary. However, if any shared object is morphism” is supported—this means that the application modified, the result of the computation is undefined. can specify conditions on the partitioning of a dataset DryadLINQ does not currently check or enforce the ab- under which Apply can be run independently on each sence of side-effects. partition. DryadLINQ will automatically re-partition the The inputs and outputs of a DryadLINQ compu- data to match the conditions if necessary. tation are specified using the GetTable and DryadLINQ allows programmers to specify annota- ToDryadTable operators, e.g.: tions of various kinds. These provide manual hints to guide optimizations that the system is unable to perform var input = GetTable("file://in.tbl"); automatically, while preserving the semantics of the pro- var result = MainProgram(input, ...); gram. As mentioned above, the Apply operator makes var output = ToDryadTable(result, "file://out.tbl"); use of annotations, supplied as simple .NET attributes, to Tables are referenced by URI strings that indicate the indicate opportunities for parallelization. There are also storage system to use as well as the name of the parti- Resource annotations to discriminate functions that re- tioned dataset. Variants of ToDryadTable can simulta- quire constant storage from those whose storage grows neously invoke multiple expressions and generate mul- along with the input collection size—these are used by tiple output DryadTables in a single distributed Dryad the optimizer to determine buffering strategies and de- job. This feature (also encountered in parallel databases cide when to operators in the same process. The such as Teradata) can be used to avoid recomputing or programmer may also declare that a dataset has a partic- materializing common subexpressions. ular partitioning scheme if the file system does not store DryadLINQ offers two data re-partitioning operators: sufficient metadata to determine this automatically. HashPartition and RangePartition. The DryadLINQ optimizer produces good automatic These operators are needed to enforce a partitioning on execution plans for most programs composed of standard an output dataset and they may also be used to over- LINQ operators, and annotations are seldom needed un- ride the optimizer’s choice of execution plan. From a less an application uses complex Apply statements. LINQ perspective, however, they are no-ops since they just reorganize a collection without changing its con- tents. The system allows the implementation of addi- 3.3 Building on DryadLINQ tional re-partitioning operators, but we have found these Many programs can be directly written using the two to be sufficient for a wide class of applications. DryadLINQ primitives. Nevertheless, we have begun to The remaining new operators are Apply and Fork, build libraries of common subroutines for various appli- which can be thought of as an “escape-hatch” that a pro- cation domains. The ease of defining and maintaining grammer can use when a computation is needed that can- such libraries using C#’s functions and interfaces high- not be expressed using any of LINQ’s built-in opera- lights the advantages of embedding data-parallel con- tors. Apply takes a function f and passes to it an iter- structs within a high-level language. ator over the entire input collection, allowing arbitrary The MapReduce programming model from [15] can streaming computations. As a simple example, Apply be compactly stated as follows (eliding the precise type can be used to perform “windowed” computations on a signatures for clarity): sequence, where the ith entry of the output sequence is a function on the range of input values [i, i + d] for a public static MapReduce( // returns set of Rs fixed window of length d. The applications of Apply are source, // set of Ts mapper, // function from T Ms much more general than this and we discuss them fur- → keySelector, // function from M K ther in Section 7. The Fork operator is very similar to → reducer // function from (K,Ms) Rs → Apply except that it takes a single input and generates ){ multiple output datasets. This is useful as a performance var mapped = source.SelectMany(mapper); optimization to eliminate common subcomputations, e.g. var groups = mapped.GroupBy(keySelector); to implement a document parser that outputs both plain return groups.SelectMany(reducer); text and a bibliographic entry to separate tables. }

USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 5 Section 4 discusses the execution plan that is auto- used by subsequent OrderBy nodes to choose an appro- matically generated for such a computation by the priate distributed sort algorithm as described below in DryadLINQ optimizer. Section 4.2.3. The properties are seeded from the LINQ We built a general-purpose library for manipulating expression tree and the input and output tables’ metadata, numerical data to use as a platform for implementing and propagated and updated during EPG rewriting. machine-learning algorithms, some of which are de- Propagating these properties is substantially harder in scribed in Section 5. The applications are written as the context of DryadLINQ than for a traditional database. traditional programs calling into library functions, and The difficulties stem from the much richer data model make no explicit reference to the distributed nature of and expression language. Consider one of the simplest the computation. Several of these algorithms need to it- operations: input.Select(x => f(x)). If f is a simple erate over a data transformation until convergence. In a expression, e.g. x.name, then it is straightforward for traditional database this would require support for recur- DryadLINQ to determine which properties can be prop- sive expressions, which are tricky to implement [14]; in agated. However, for arbitrary f it is in general impos- DryadLINQ it is trivial to use a C# loop to express the sible to determine whether this transformation preserves iteration. The companion technical report [38] contains the partitioning properties of the input. annotated source for some of these algorithms. Fortunately, DryadLINQ can usually infer properties in the programs typical users write. Partition and sort key 4 System Implementation properties are stored as expressions, and it is often fea- sible to compare these for equality using a combination This section describes the DryadLINQ parallelizing of static typing, static analysis, and reflection. The sys- compiler. We focus on the generation, optimization, and tem also provides a simple mechanism that allows users execution of the distributed execution plan, correspond- to assert properties of an expression when they cannot be ing to step 3 in Figure 2. The DryadLINQ optimizer is determined automatically. similar in many respects to classical database optimiz- ers [25]. It has a static component, which generates an execution plan, and a dynamic component, which uses 4.2 DryadLINQ Optimizations Dryad policy plug-ins to optimize the graph at run time. DryadLINQ performs both static and dynamic optimiza- tions. The static optimizations are currently greedy 4.1 Execution Plan Graph heuristics, although in the future we may implement cost-based optimizations as used in traditional databases. When it receives control, DryadLINQ starts by convert- The dynamic optimizations are applied during Dryad job ing the raw LINQ expression into an execution plan execution, and consist in rewriting the job graph depend- graph (EPG), where each node is an operator and edges ing on run-time data statistics. Our optimizations are represent its inputs and outputs. The EPG is closely re- sound in that a failure to compute properties simply re- lated to a traditional database query plan, but we use sults in an inefficient, though correct, execution plan. the more general terminology of execution plan to en- compass computations that are not easily formulated as “queries.” The EPG is a directed acyclic graph—the 4.2.1 Static Optimizations existence of common subexpressions and operators like Fork means that EPGs cannot always be described by DryadLINQ’s static optimizations are conditional graph trees. DryadLINQ then applies term-rewriting optimiza- rewriting rules triggered by a predicate on EPG node tions on the EPG. The EPG is a “skeleton” of the Dryad properties. Most of the static optimizations are focused data-flow graph that will be executed, and each EPG on minimizing disk and network I/O. The most important node is replicated at run time to generate a Dryad “stage” are: (a collection of vertices running the same computation Pipelining: Multiple operators may be executed in a on different partitions of a dataset). The optimizer an- single process. The pipelined processes are themselves notates the EPG with metadata properties. For edges, LINQ expressions and can be executed by an existing these include the .NET type of the data and the compres- single-computer LINQ implementation. sion scheme, if any, used after serialization. For nodes, Removing redundancy: DryadLINQ removes unnec- they include details of the partitioning scheme used, and essary hash- or range-partitioning steps. ordering information within each partition. The output of a node, for example, might be a dataset that is hash- Eager Aggregation: Since re-partitioning datasets is partitioned by a particular key, and sorted according to expensive, down-stream aggregations are moved in that key within each partition; this information can be front of partitioning operators where possible.

6 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association I/O reduction: Where possible, DryadLINQ uses Dryad’s TCP-pipe and in-memory FIFO channels in- DS DS DS DS DS stead of persisting temporary data to files. The system by default compresses data before performing a parti- H H H tioning, to reduce network traffic. Users can manually override compression settings to balance CPU usage O D D D D D with network load if the optimizer makes a poor de- (1) (2) (3) cision. M M M M M

4.2.2 Dynamic Optimizations S S S S S

DryadLINQ makes use of hooks in the Dryad API to Figure 5: Distributed sort optimization described in Section 4.2.3. dynamically mutate the execution graph as information Transformation (1) is static, while (2) and (3) are dynamic. from the running job becomes available. Aggregation gives a major opportunity for I/O reduction since it can SM SM SM SM SM SM SM map S S S S S S S be optimized into a tree according to locality, aggregat- sort G G G G G G G groupby ing data first at the computer level, next at the rack level, map SM R R R R R R R reduce and finally at the cluster level. The topology of such an D D D D D D D distribute aggregation tree can only be computed at run time, since G R (1) (2) (3) it is dependent on the dynamic scheduling decisions MS MS MS MS MS mergesort groupby which allocate vertices to computers. DryadLINQ au- X G G G G G tomatically uses the dynamic-aggregation logic present R R R R R reduce partial aggregation partial X X X in Dryad [26]. MS MS mergesort Dynamic data partitioning sets the number of ver- G G groupby

tices in each stage (i.e., the number of partitions of each R R reduce reduce dataset) at run time based on the size of its input data. X X consumer Traditional databases usually estimate dataset sizes stat- Figure 6: Execution plan for MapReduce, described in Section 4.2.4. ically, but these estimates can be very inaccurate, for ex- Step (1) is static, (2) and (3) are dynamic based on the volume and ample in the presence of correlated queries. DryadLINQ location of the data in the inputs. supports dynamic hash and range partitions—for range partitions both the number of partitions and the partition- based on the number of partitions in the preceding com- ing key ranges are determined at run time by sampling putation, and the number of partitions in the M+S stage the input dataset. is chosen based on the volume of data to be sorted (tran- sitions (2) and (3) in Figure 5).

4.2.3 Optimizations for OrderBy 4.2.4 Execution Plan for MapReduce DryadLINQ’s logic for sorting a dataset d illustrates many of the static and dynamic optimizations available This section analyzes the execution plan generated by to the system. Different strategies are adopted depending DryadLINQ for the MapReduce computation from Sec- on d’s initial partitioning and ordering. Figure 5 shows tion 3.3. Here, we examine only the case when the input the evolution of an OrderBy node O in the most com- to GroupBy is not ordered and the reduce function is plex case, where d is not already range-partitioned by determined to be associative and commutative. The auto- the correct sort key, nor are its partitions individually or- matically generated execution plan is shown in Figure 6. dered by the key. First, the dataset must be re-partitioned. The plan is statically transformed (1) into a Map and a The DS stage performs deterministic sampling of the in- Reduce stage. The Map stage applies the SelectMany put dataset. The samples are aggregated by a histogram operator (SM) and then sorts each partition (S), performs vertex H, which determines the partition keys as a func- a local GroupBy (G) and finally a local reduction (R). tion of data distribution (load-balancing the computation The D nodes perform a hash-partition. All these opera- in the next stage). The D vertices perform the actual re- tions are pipelined together in a single process. The Re- partitioning, based on the key ranges computed by H. duce stage first merge-sorts all the incoming data streams Next, a merge node M interleaves the inputs, and a S (MS). This is followed by another GroupBy (G) and the node sorts them. M and S are pipelined in a single pro- final reduction (R). All these Reduce stage operators are cess, and communicate using iterators. The number of pipelined in a single process along with the subsequent partitions in the DS+H+D stage is chosen at run time operation in the computation (X). As with the sort plan

USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 7 in Section 4.2.3, at run time (2) the number of Map in- 4.4 Leveraging Other LINQ Providers stances is automatically determined using the number of input partitions, and the number of Reduce instances is One of the greatest benefits of using the LINQ frame- chosen based on the total volume of data to be Reduced. work is that DryadLINQ can leverage other systems that If necessary, DryadLINQ will insert a dynamic aggrega- use the same constructs. DryadLINQ currently gains tion tree (3) to reduce the amount of data that crosses the most from the use of PLINQ [19], which allows us to run network. In the figure, for example, the two rightmost in- the code within each vertex in parallel on a multi-core put partitions were processed on the same computer, and server. PLINQ, like DryadLINQ, attempts to make the their outputs have been pre-aggregated locally before be- process of parallelizing a LINQ program as transparent ing transferred across the network and combined with the as possible, though the systems’ implementation strate- output of the leftmost partition. gies are quite different. Space does not permit a detailed The resulting execution plan is very similar to explanation, but PLINQ employs the iterator model [25] the manually constructed plans reported for Google’s since it is better suited to fine-grain concurrency in a MapReduce [15] and the Dryad histogram computation shared-memory multi-processor system. Because both in [26]. The crucial point to note is that in DryadLINQ PLINQ and DryadLINQ use expressions composed from MapReduce is not a primitive, hard-wired operation, and the same LINQ constructs, it is straightforward to com- all user-specified computations gain the benefits of these bine their functionality. DryadLINQ’s vertices execute powerful automatic optimization strategies. LINQ expressions, and in general the addition by the DryadLINQ code generator of a single line to the vertex’s program triggers the use of PLINQ, allowing the vertex 4.3 Code Generation to exploit all the cores in a cluster computer. We note The EPG is used to derive the Dryad execution plan af- that this remarkable fact stems directly from the careful ter the static optimization phase. While the EPG encodes design choices that underpin LINQ. all the required information, it is not a runnable program. We have also added interoperation with the LINQ-to- DryadLINQ uses dynamic code generation to automati- SQL system which lets DryadLINQ vertices directly ac- cally synthesize LINQ code to be run at the Dryad ver- cess data stored in SQL databases. Running a database tices. The generated code is compiled into a .NET as- on each cluster computer and storing tables partitioned sembly that is shipped to cluster computers at execution across these databases may be much more efficient than time. For each execution-plan stage, the assembly con- using flat di