Hyracks: a Flexible and Extensible Foundation for Data-Intensive Computing

Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, Rares Vernica Computer Science Department, University of California, Irvine Irvine, CA 92697 [email protected] Abstract—Hyracks is a new partitioned-parallel software plat- Statistics reported by companies such as Yahoo! and Face- form designed to run data-intensive computations on large book regarding the sources of jobs on their clusters indicate shared-nothing clusters of computers. Hyracks allows users to that these declarative languages are gaining popularity as express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce the interface of choice for large scale data processing. For partitions of output data, while connectors repartition operators’ example, Yahoo! recently reported that over 60 percent of outputs to make the newly produced partitions available at their production Hadoop jobs originate from Pig programs the consuming operators. We describe the Hyracks end user today, while Facebook reports that a remarkable 95 percent of model, for authors of dataflow jobs, and the extension model their production Hadoop jobs are now written in Hive rather for users who wish to augment Hyracks’ built-in library with new operator and/or connector types. We also describe our than in the lower-level MapReduce model. In light of the initial Hyracks implementation. Since Hyracks is in roughly the rapid movement to these higher-level languages, an obvious same space as the open source Hadoop platform, we compare question emerges for the data-intensive computing community: Hyracks with Hadoop experimentally for several different kinds If we had set out from the start to build a parallel platform of use cases. The initial results demonstrate that Hyracks has to serve as a target for compiling higher-level declarative significant promise as a next-generation platform for data- intensive applications. data-processing languages, what should that platform have looked like? It is our belief that the MapReduce model adds I. INTRODUCTION significant accidental complexity1 as well as inefficiencies to In recent years, the world has seen an explosion in the the translation from higher-level languages. amount of data owing to the growth of the Internet. In the In this paper, we present the design and implementation same time frame, the declining cost of hardware has made it of Hyracks, which is our response to the aforementioned possible for companies (even modest-sized companies) to set question. Hyracks is a flexible, extensible, partitioned-parallel up sizeable clusters of independent computers to store and pro- framework designed to support efficient data-intensive com- cess this growing sea of data. Based on their experiences with puting on clusters of commodity computers. Key contributions web-scale data processing, Google proposed MapReduce [1], of the work reported here include: a programming model and an implementation that provides a 1) The provision of a new platform that draws on time-tested simple interface for programmers to parallelize common data- contributions in parallel databases regarding efficient par- intensive tasks. Shortly thereafter, Hadoop [2], an open-source allel query processing, such as the use of operators with implementation of MapReduce, was developed and began to multiple inputs or the employment of pipelining as a gain followers. Similarly, Microsoft soon developed Dryad means to move data between operators. Hyracks includes [3] as a generalized execution engine to support their coarse- a built-in collection of operators that can be used to grained data-parallel applications. assemble data processing jobs without needing to write It has since been noted that, while MapReduce and Dryad processing logic akin to Map and Reduce code. are powerful programming models capable of expressing ar- 2) The provision of a rich API that enables Hyracks oper- bitrary data-intensive computations, it requires fairly sophis- ator implementors to describe operators’ behavioral and ticated skills to translate end-user problems into jobs with resource usage characteristics to the framework in order Map and Reduce primitives for MapReduce or into networks to enable better planning and runtime scheduling for jobs of channels and vertices for Dryad. As a result, higher-level that utilize their operators. declarative languages such as Sawzall [4] from Google, Pig 3) The inclusion of a Hadoop compatibility layer that en- [5] from Yahoo!, Jaql [6] from IBM, Hive [7] from Facebook, ables users to run existing Hadoop MapReduce jobs and DryadLINQ [8] and Scope [9] from Microsoft have been unchanged on Hyracks as an initial ”get acquainted” developed to make data-intensive computing accessible to strategy as well as a migration strategy for ”legacy” data- more programmers. The implementations of these languages intensive applications. translate declarative programs on partitioned data into DAGs of MapReduce jobs or into DAGs of Dryad vertices and 1Accidental complexity is complexity that arises in computer systems which channels. is non-essential to the problem being solved [10]. 4) Performance experiments comparing Hadoop against the (internal sub-steps or phases). At runtime, each activity of an Hyracks Hadoop compatibility layer and against the na- operator is realized as a set of (identical) tasks that are clones tive Hyracks model for several different types of jobs, of the activity and that operate on individual partitions of the thereby exploring the benefits of Hyracks over Hadoop data flowing through the activity. owing to implementation choices and the benefits of Let us examine a simple example based on a computation relaxing the MapReduce model as a means of job speci- over files containing CUSTOMER and ORDERS data drawn fication. from the TPC-H [11] dataset. In particular, let us examine a 5) An initial method for scheduling Hyracks tasks on a Hyracks job to compute the total number of orders placed by cluster that includes basic fault recovery (to guarantee customers in various market segments. The formats of the two job completion through restarts), and a brief performance input files for this Hyracks job are of the form: study of one class of job on Hyracks and Hadoop under CUSTOMER (C CUSTKEY, C MKTSEGMENT, . ) various failure rates to demonstrate the potential gains ORDERS (O ORDERKEY, O CUSTKEY, . ) offered by a less pessimistic approach to fault handling. where the dots stand for remaining attributes that are not 6) The provision of Hyracks as an available open source directly relevant to our computation. platform that can be utilized by others in the data- To be more precise about the intended computation, the intensive computing community as well. goal for the example job is to compute the equivalent of the The remainder of this paper is organized as follows. Section following SQL query: II provides a quick overview of Hyracks using a simple select C_MKTSEGMENT, count(O_ORDERKEY) example query. Section III provides a more detailed look from CUSTOMER join ORDERS on C_CUSTKEY = O_CUSTKEY group by C_MKTSEGMENT at the Hyracks programming model as seen by different classes of users, including end users, compilers for higher- {NC1: cust1.dat} level languages, and implementors of new operators for the {NC2: cust2.dat} Scanner E1[hash(C_CUSTKEY)] Hyracks platform. Section IV discusses the implementation of (CUSTOMER) HashJoin HashGroupby Hyracks, including its approaches to job control, scheduling, E4[1:1] C_CUSTKEY C_MKTSEGMENT Writer fault handling, and efficient data handling. Section V presents = O_CUSTKEY E3 Agg: count(O_ORDKEY) Scanner [hash a set of performance results comparing the initial implemen- (ORDERS) E2[hash(O_CUSTKEY)] (C_MKTSEGMENT)] tation of Hyracks to Hadoop for several disparate types of {NC3: ord1.dat, NC2: ord1.dat} jobs under both fault-free operation and in the presence of {NC1: ord2.dat, NC5: ord2.dat} failures. Section VI reviews key features of Hyracks and their Fig. 1: Example Hyracks job specification relationship to work in parallel databases and data-intensive Scanner computing. Finally, Section VII summarizes the paper and E1 discusses our plans for future work. (CUSTOMER) HashJoin HashGroupby E3 Hash Output E4 JoinBuild JoinProbe Aggregate Generator Writer II. HYRACKS OVERVIEW E2 Hyracks is a partitioned-parallel dataflow execution plat- Scanner form that runs on shared-nothing clusters of computers. Large (ORDERS) collections of data items are stored as local partitions dis- Fig. 2: Example Hyracks Activity Node graph tributed across the nodes of the cluster. A Hyracks job (the unit of work in Hyracks), submitted by a client, processes one A simple Hyracks job specification to perform this compu- or more collections of data to produce one or more output tation can be constructed as shown in Figure 1. For each data collections (also in the form of partitions). Hyracks provides source (CUSTOMER and ORDERS), a file scanner operator a programming model and an accompanying infrastructure is used to read the source data. A hash-based join operator to efficiently divide computations on large data collections receives the resulting streams of data (one with CUSTOMER (spanning multiple machines) into computations that work on instances and another with ORDERS instances) and pro- each partition of the data separately. In this section we utilize duces a stream of CUSTOMER-ORDERS pairs that match a small example to describe the steps involved in Hyracks on the specified condition (C CUSTKEY = O CUSTKEY). job execution. We also provide an introductory architectural The result of the join is then aggregated using a hash-based overview of the Hyracks software platform. group operator on the value of the C MKTSEGMENT field. This group operator is provided with a COUNT aggregation A. Example function to compute the count of O ORDERKEY occurrences As will be explained more completely in section III, within a group. Finally, the output of the aggregation is written a Hyracks job is a dataflow DAG composed of operators out to a file using a file writer operator.

Hyracks: a Flexible and Extensible Foundation for Data-Intensive Computing

Towards an Analytics Query Engine *

Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications

Programming a Parallel Future

Declarative Data Analytics: a Survey

Big Data Analytic Approaches Classification

A Dataflow Language for Large Scale Processing of RDF Data

Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis

IBM Infosphere Biginsights Version 2.1: Installation Guide Chapter 1

HIL: a High-Level Scripting Language for Entity Integration

Schemas and Types for JSON Data

Technology Overview