The Family of MapReduce and Large Scale Data Processing Systems

Sherif Sakr Anna Liu Ayman G. Fayoumi, NICTA and University of New NICTA and University of New King Abdulaziz University South Wales South Wales Jeddah, Saudia Arabia Sydney, Australia Sydney, Australia [email protected] [email protected] [email protected]

ABSTRACT information. For example, Facebook collects 15 TeraBytes of data In the last two decades, the continuous increase of computational each day into a PetaByte-scale data warehouse [116]. Jim Gray, power has produced an overwhelming flow of data which has called called the shift a "fourth paradigm" [61]. The first three paradigms for a paradigm shift in the computing architecture and large scale were experimental, theoretical and, more recently, computational data processing mechanisms. MapReduce is a simple and power- science. Gray argued that the only way to cope with this paradigm ful programming model that enables easy development of scalable is to develop a new generation of computing tools to manage, vi- parallel applications to process vast amounts of data on large clus- sualize and analyze the data flood. In general, current computer ters of commodity machines. It isolates the application from the architectures are increasingly imbalanced where the latency gap details of running a distributed program such as issues on data dis- between multi-core CPUs and mechanical hard disks is growing tribution, scheduling and fault tolerance. However, the original im- every year which makes the challenges of data-intensive comput- plementation of the MapReduce framework had some limitations ing much harder to overcome [15]. Hence, there is a crucial need that have been tackled by many research efforts in several followup for a systematic and generic approach to tackle these problems with works after its introduction. This article provides a comprehensive an architecture that can also scale into the foreseeable future [110]. survey for a family of approaches and mechanisms of large scale In response, Gray argued that the new trend should instead focus on data processing mechanisms that have been implemented based on supporting cheaper clusters of computers to manage and process all the original idea of the MapReduce framework and are currently this data instead of focusing on having the biggest and fastest single gaining a lot of momentum in both research and industrial commu- computer. nities. We also cover a set of introduced systems that have been In general, the growing demand for large-scale data mining and implemented to provide declarative programming interfaces on top data analysis applications has spurred the development of novel of the MapReduce framework. In addition, we review several large solutions from both the industry (e.g., web-data analysis, click- scale data processing systems that resemble some of the ideas of stream analysis, network-monitoring log analysis) and the sciences the MapReduce framework for different purposes and application (e.g., analysis of data produced by massive-scale simulations, sen- scenarios. Finally, we discuss some of the future research direc- sor deployments, high-throughput lab equipment). Although par- allel database systems [40] serve some of these data analysis ap- tions for implementing the next generation of MapReduce-like so- 1 2 3 lutions. plications (e.g. Teradata , SQL Server PDW , Vertica , Green- plum4, ParAccel5, Netezza6), they are expensive, difficult to ad- minister and lack fault-tolerance for long-running queries [104]. 1. INTRODUCTION MapReduce [37] is a framework which is introduced by We live in the era of Big Data where we are witnessing a contin- for programming commodity computer clusters to perform large- uous increase on the computational power that produces an over- scale data processing in a single pass. The framework is designed whelming flow of data which has called for a paradigm shift in such that a MapReduce cluster can scale to thousands of nodes in the computing architecture and large scale data processing mech- a fault-tolerant manner. One of the main advantages of this frame- work is its reliance on a simple and powerful programming model.

arXiv:1302.2966v1 [cs.DB] 13 Feb 2013 anisms. Powerful telescopes in astronomy, particle accelerators in physics, and genome sequencers in biology are putting massive vol- In addition, it isolates the application developer from all the com- umes of data into the hands of scientists. For example, the Large plex details of running a distributed program such as: issues on data Synoptic Survey Telescope [1] generates on the order of 30 Ter- distribution, scheduling and fault tolerance [103]. aBytes of data every day. Many enterprises continuously collect Recently, there has been a great deal of hype about cloud com- large datasets that record customer interactions, product sales, re- puting [10]. In principle, cloud computing is associated with a sults from advertising campaigns on the Web, and other types of new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to more central- ized and larger scale datacenters in order to reduce the costs associ-

1http://teradata.com/ 2http://www.microsoft.com/sqlserver/en/us/solutions- technologies/data-warehousing/pdw.aspx 3http://www.vertica.com/ 4http://www.greenplum.com/ 5http://www.paraccel.com/ 6 . http://www-01.ibm.com/software/data/netezza/ ated with the management of hardware and software resources. In map(String key, String value): reduce(String key, Iterator values): particular, cloud computing has promised a number of advantages // key: document name // key: a word for hosting the deployments of data-intensive applications such as: // value: document contents // values: a list of counts • Reduced time-to-market by removing or simplifying the time- for each word w in value: int result = 0; EmitIntermediate(w, “1”); consuming hardware provisioning, purchasing and deploy- for each v in values: result += ParseInt(v); ment processes. Emit(AsString(result)); • Reduced monetary cost by following a pay-as-you-go busi- ness model. • Unlimited (virtually) throughput by adding servers if the work- Figure 1: An Example MapReduce Program [37]. load increases. In principle, the success of many enterprises often rely on their abil- ity to analyze expansive volumes of data. In general, cost-effective computation takes a set of key/value pairs input and produces a set processing of large datasets is a nontrivial undertaking. Fortunately, of key/value pairs as output. The user of the MapReduce frame- MapReduce frameworks and cloud computing have made it easier work expresses the computation using two functions: Map and Re- than ever for everyone to step into the world of big data. This tech- duce. The Map function takes an input pair and produces a set of nology combination has enabled even small companies to collect intermediate key/value pairs. The MapReduce framework groups and analyze terabytes of data in order to gain a competitive edge. together all intermediate values associated with the same interme- For example, the Amazon Elastic Compute Cloud (EC2)7 is offered diate key I and passes them to the Reduce function. The Reduce as a commodity that can be purchased and utilised. In addition, function receives an intermediate key I with its set of values and Amazon has also provided the Amazon Elastic MapReduce8 as an merges them together. Typically just zero or one output value is online service to easily and cost-effectively process vast amounts of produced per Reduce invocation. The main advantage of this model data without the need to worry about time-consuming set-up, man- is that it allows large computations to be easily parallelized and re- agement or tuning of computing clusters or the compute capacity executed to be used as the primary mechanism for fault tolerance. upon which they sit. Hence, such services enable third-parties to Figure 1 illustrates an example MapReduce program expressed in perform their analytical queries on massive datasets with minimum pseudo-code for counting the number of occurrences of each word effort and cost by abstracting the complexity entailed in building in a collection of documents. In this example, the map function and maintaining computer clusters. emits each word plus an associated count of occurrences while the The implementation of the basic MapReduce architecture had reduce function sums together all counts emitted for a particular some limitations. Therefore, several research efforts have been word. In principle, the design of the MapReduce framework has triggered to tackle these limitations by introducing several advance- considered the following main principles [126]: ments in the basic architecture in order to improve its performance. • Low-Cost Unreliable Commodity Hardware: Instead of us- This article provides a comprehensive survey for a family of ap- ing expensive, high-performance, reliable symmetric mul- proaches and mechanisms of large scale data analysis mechanisms tiprocessing (SMP) or massively parallel processing (MPP) that have been implemented based on the original idea of the MapRe- machines equipped with high-end network and storage sub- duce framework and are currently gaining a lot of momentum in systems, the MapReduce framework is designed to run on both research and industrial communities. In particular, the remain- large clusters of commodity hardware. This hardware is man- der of this article is organized as follows. Section 2 describes the aged and powered by open-source operating systems and util- basic architecture of the MapReduce framework. Section 3 dis- ities so that the cost is low. cusses several techniques that have been proposed to improve the • Extremely Scalable RAIN Cluster: Instead of using central- performance and capabilities of the MapReduce framework from ized RAID-based SAN or NAS storage systems, every MapRe- different perspectives. Section 4 covers several systems that sup- duce node has its own local off-the-shelf hard drives. These port a high level SQL-like interface for the MapReduce framework. nodes are loosely coupled where they are placed in racks Section 5 reviews several large scale data processing systems that that can be connected with standard networking hardware resemble some of the ideas of the MapReduce framework, without connections. These nodes can be taken out of service with sharing its architecture or infrastructure, for different purposes and almost no impact to still-running MapReduce jobs. These application scenarios. In Section 6, we conclude the article and clusters are called Redundant Array of Independent (and In- discuss some of the future research directions for implementing the expensive) Nodes (RAIN). next generation of MapReduce/Hadoop-like solutions. • Fault-Tolerant yet Easy to Administer: MapReduce jobs can run on clusters with thousands of nodes or even more. These 2. MAPREDUCE FRAMEWORK: BASIC AR- nodes are not very reliable as at any point in time, a cer- CHITECTURE tain percentage of these commodity nodes or hard drives will The MapReduce framework is introduced as a simple and power- be out of order. Hence, the MapReduce framework applies ful programming model that enables easy development of scalable straightforward mechanisms to replicate data and launch backup parallel applications to process vast amounts of data on large clus- tasks so as to keep still-running processes going. To han- ters of commodity machines [37, 38]. In particular, the implemen- dle crashed nodes, system administrators simply take crashed tation described in the original paper is mainly designed to achieve hardware off-line. New nodes can be plugged in at any time high performance on large clusters of commodity PCs. One of the without much administrative hassle. There is no complicated main advantages of this approach is that it isolates the application backup, restore and recovery configurations like the ones that from the details of running a distributed program, such as issues on can be seen in many DBMSs. data distribution, scheduling and fault tolerance. In this model, the • Highly Parallel yet Abstracted: The most important contribu- tion of the MapReduce framework is its ability to automat- 7http://aws.amazon.com/ec2/ ically support the parallelization of task executions. Hence, 8http://aws.amazon.com/elasticmapreduce/ it allows developers to focus mainly on the problem at hand back to the master, who is responsible for forwarding these locations to the reduce workers. 5. When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers which is then sorted by the intermediate keys so that all occurrences of the same key are grouped together. The sorting operation is needed because typically many dif- ferent keys map to the same reduce task. 6. The reduce worker passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master program wakes up the user program. At this point, the MapReduce invocation in the user program returns the program control back to the user code.

Input Map phase Intermediate files Reduce phase Output During the execution process, the master pings every worker pe- (on local disk) files riodically. If no response is received from a worker within a certain amount of time, the master marks the worker as failed. Any map Figure 2: An Overview of the Flow of Execution a MapReduce tasks marked completed or in progress by the worker are reset back Operation [37]. to their initial idle state and therefore become eligible for schedul- ing by other workers. Completed map tasks are re-executed on a task failure because their output is stored on the local disk(s) of rather than worrying about the low level implementation de- the failed machine and is therefore inaccessible. Completed reduce tails such as memory management, file allocation, parallel, tasks do not need to be re-executed since their output is stored in a multi-threaded or network programming. Moreover, MapRe- global file system. duce’s shared-nothing architecture [113] makes it much more scalable and ready for parallelization. Hadoop9 is an open source Java library [124] that supports data- 3. EXTENSIONS AND ENHANCEMENTS OF intensive distributed applications by realizing the implementation THE MAPREDUCE FRAMEWORK of the MapReduce framework10. It has been widely used by a large In practice, the basic implementation of the MapReduce is very number of business companies for production purposes11. On the useful for handling data processing and data loading in a heteroge- implementation level, the Map invocations of a MapReduce job are nous system with many different storage systems. Moreover, it distributed across multiple machines by automatically partitioning provides a flexible framework for the execution of more compli- the input data into a set of M splits. The input splits can be pro- cated functions than that can be directly supported in SQL. How- cessed in parallel by different machines. Reduce invocations are ever, this basic architecture suffers from some limitations. [39] distributed by partitioning the intermediate key space into R pieces reported about some possible improvements that can be incorpo- using a partitioning function (e.g. hash(key) mod R). The num- rated into the MapReduce framework. Examples of these possible ber of partitions (R) and the partitioning function are specified by improvements include: the user. Figure 2 illustrates an example of the overall flow of a • MapReduce should take advantage of natural indices when- MapReduce operation which goes through the following sequence ever possible. of actions: • Most MapReduce output can be left unmerged since there 1. The input data of the MapReduce program is split into M is no benefit of merging them if the next consumer is just pieces and starts up many instances of the program on a clus- another MapReduce program. ter of machines. • MapReduce users should avoid using inefficient textual for- 2. One of the instances of the program is elected to be the mas- mats. ter copy while the rest are considered as workers that are In the following subsections we discuss some research efforts assigned their work by the master copy. In particular, there that have been conducted in order to deal with these challenges are M map tasks and R reduce tasks to assign. The master and the different improvements that has been made on the basic picks idle workers and assigns each one or more map tasks implementation of the MapReduce framework in order to achieve and/or reduce tasks. these goals. 3. A worker who is assigned a map task processes the contents of the corresponding input split and generates key/value pairs 3.1 Processing Join Operations from the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by One main limitation of the MapReduce framework is that it does the Map function are buffered in memory. not support the joining of multiple datasets in one task. How- 4. Periodically, the buffered pairs are written to local disk and ever, this can still be achieved with additional MapReduce steps. partitioned into R regions by the partitioning function. The For example, users can map and reduce one dataset and read data locations of these buffered pairs on the local disk are passed from other datasets on the fly. Blanas et al. [18] have reported about a study that evaluated the performance of different distributed 9http://hadoop.apache.org/ join algorithms using the MapReduce framework. In particular, 10In the rest of this article, we use the two names: MapReduce they have evaluated the following implementation strategies of dis- and Hadoop, interchangeably tributed join algorithms: 11http://wiki.apache.org/hadoop/PoweredBy • Standard repartition join: The two input relations are dy- model. These tasks can immediately enjoy high paral- lelism with only a few lines of administration and con- figuration code. This “simplified” philosophy can also be seen in many GFS designs. Developers can focus on formulating their tasks to the Map-Reduce interface, without worrying about such issues as implementing memory management, file allocation, parallel, multi- threaded, or network programming. Highly Parallel yet Abstracted : The most impor- • tant contribution of Map-Reduce is perhaps its auto- matic parallelization and execution. Even though it might not be optimized for a specific task, the produc- tivity gain from developing an application with Map- Reduce is far higher than doing it from scratch on the same requirements. Map-Reduce allows developers to FigureFigure 4: An2: Overview Data andof The control Map-Reduce-Merge flow for Frame- the Map- focus mainly on the problem at hand rather than wor- workReduce-Merge [126]. framework. The coordinator man- rying about the administrative details. ages two sets of mappers and reducers. After these Figure 3: Decision tree for choosing between various join High Throughput : Deployed on low-cost hardware tasks are done, it launches a set of mergers that read strategies on the MapReduce Framework [18]. • and modeled in simplified, generic frameworks, Map- outputs• Per-split from semi-join selected: This join reducers approach and triesmerge to improve them the with Reduce systems are hardly optimized to perform like user-definedsemi-join approach logic. with a further step to address the fact that namically partitioned on the join key and the corresponding not every record in the filtered version of the smaller relation a massively parallel processing systems deployed with pairs of partitions are joined using the standard partitioned will join with a particular split of the larger relation. There- the same number of nodes. However, these disadvan- sort-merge join approach. fore,parallel an extra processing process step (MPP) is executed machines to determine equipped the target with high- tages (or advantages) allow Map-Reduce jobs to run on • Improved repartition join: One potential problem with the split(s)end network of each filtered and join storage key. subsystems, most search en- thousands of nodes at relatively low cost. A schedul- standard repartition join is that all the records for a given join Figure 3 illustrates a decision tree that summarizes the tradeoffs gines run on large clusters of commodity hardware. ing system places each Map and Reduce task at a near- key from both input relations have to be buffered. There- of the studied join strategies according to the results of that study. optimal node (considering the vicinity to data and load fore, when the key cardinality is small or when the data is Based onThis statistics, hardware such as is the managed relative data and size powered and the fraction by open-source of balancing), so that many Map-Reduce tasks can share highly skewed, all the records for a given join key may not fit the joinoperating key referenced, systems this decision and utilities, tree tries so to determinethat the what cost is low. in memory. The improved repartition join strategy fixes the is the right join strategy for a given circumstance. If data is not the same cluster. Extremely Scalable RAIN Cluster : Instead of us- buffering problem by introducing the following key changes: preprocessed,• the right join strategy depends on the size of the data High Performance by the Large : Even though – In the map function, the output key is changed to a com- transferreding via centralized the network. RAID-based If the network cost SAN of or broadcasting NAS storage an sys- • posite of the join key and the table tag. The table tags input relationtems,R everyto every Map-Reduce node is less expensive node has than its transferring own local off- Map-Reduce systems are generic, and not usually tuned are generated in a way that ensures records from one in- both R the-shelfand projected hardL, then drives. the broadcast These join nodes algorithm are should loosely cou- to be high performance for specific tasks, they still can put relation will be sorted ahead of those from the other be used.pled When in preprocessing rackable systems is allowed, connected semi-join, per-split with generic semi- LAN achieve high performance simply by being deployed on input relation on a given join key. join andswitches. directed join Loose with sufficient coupling partitions and shared-nothing are the best choices. architec- a large number of nodes. In [6], the authors mentioned – The partitioning function is customized so that the hash- Semi-join and per-split semi-join offer further flexibility since their a then world-record Terabyte [11] sorting benchmark code is computed from just the join key part of the com- preprocessingture make steps are Map-Reduce/GFS insensitive to how the clusters log table highly is orga- scalable. These nodes can be taken out of service with almost no by using Map-Reduce on thousands of machines. In posite key. This way records with the same join key are nized, and thus suitable for any number of reference . In ad- short, sheer parallelism can generate high performance, still assigned to the same reduce task. dition, theimpact preprocessing to still-running steps of these Map-Reduce two algorithms jobs. are cheaper These clus- and Map-Reduce programs can take advantage of it. – As records from the smaller input are guaranteed to since thereters is are no shuffling called ofRedundant the log data. Array of Independent (and be ahead of those from L for a given join key, only To tackleInexpensive) the limitation Nodes of the extra(RAIN) processing [18]. requirements GFS is essentially for a Shared-Disk Storage yet Shared-Nothing Com- the records from the smaller input are buffered and the performingRAIN join managementoperations in the MapReducesystem. framework, the Map- • puting : In a Map-Reduce environment, every node records of the larger input are streamed to generate the Reduce-Merge model [126] have been introduced to enable the pro- has its own local hard drives. Mappers and reduc- join output. cessingFault-Tolerant of multiple datasets. Figureyet Easy 4 illustrates to Administer the framework of: Due to ers use these local disks to store intermediate files and • Broadcast join: Instead of moving both input relations across this model• its where high the scalability, map phase Map-Reduce transforms an input jobs key/value can run pair on clus- these files are read remotely by reducers, i.e., Map- the network as in the repartition-based joins, the broadcast (k1,v1)tersinto a with list of thousands intermediate of key/value nodes pairs or even[(k2,v2 more.)]. The These join approach moves only the smaller input relation so that reduce function aggregates the list of values [v2] associated with Reduce is a shared-nothing architecture. However, nodes are not very reliable. At any point in time, a it avoids the preprocessing sorting requirement of both input k2 and produces a list of values [v3] which is also associated with Map-Reduce jobs read input from and write output to certain percentage of these commodity nodes or hard relations and more importantly avoids the network overhead k2. Note that inputs and outputs of both functions belong to the GFS, which is shared by every node. GFS replicates drives will be out of order. GFS and Map-Reduce are for moving the larger relation. same lineage (α). Another pair of map and reduce functions pro- disk chunks and uses pooled disks to support ultra • Semi-join: This join approach tries to avoid the problem of duce thedesigned intermediate not output to view(k3,[ thisv4]) from certain another rate lineage of failure (β). as an large files. Map-Reduce’s shared-nothing architecture the broadcast join approach where it is possible to send many Based onanomaly; keys k2 and insteadk3, the they merge use function straightforward combines the mechanisms two makes it much more scalable than one that shares disk records of the smaller input relation across the network while reducedto outputs replicate from different data andlineages launch into a list backup of key/value tasks out- so as to or memory. In the mean time, Map and Reduce tasks they may not be actually referenced by any records in the puts [(kkeep4,v5)]. still-running This final output processes becomes a going. new lineage To handle (γ). If α crashed β share an integrated GFS that makes thousands of disks other relation. It achieves this goal at the cost of an extra = thennodes, this merge system function administrators does a self-merge simply which take is similar crashed to hard- scan of the smaller input relation where it determines the set self-join in relational algebra. The main differences between the behave like one. of unique join keys in the smaller relation, send them to the processingware model off-line. of this frameworkNew nodes and can the originalbe plugged MapReduce in at isany time other relation to specify the list of the actual referenced join the productionwithout of amuch key/value administrative list from the reduce hassle. function There instead is no com- Set-Oriented Keys and Values; File Abstracted : • keys and then send only these records across the network for of just thatplicated of values. backup, This change restore is introduced and recovery because the configurations merge With GFS’s help, Map-Reduce can process thousands executing the real execution of the join operation. functionand/or requires procedures input datasets liketo be the organized ones (partitioned, that can then be seen in of file chunks in parallel. The volume can be far beyond many DBMS. the size limit set for an individual file by the underlying OS file system. Developers see data as keys and values, Simplified and Restricted yet Powerful : Map- no longer raw bits and bytes, nor file descriptors. • Reduce is a restricted programming model, it only provides straightforward map and reduce interfaces. Functional Programming Primitives : The Map- However, most search-engine (and generic) data pro- • Reduce interface is based on two functional-progra- cessing tasks can be effectively implemented in this mming primitives [6]. Their signatures are re-produced

1031 Algorithm 2 Map function for the Department dataset. time1: map system(const for Key& executingkey , a /* Map-Join-Reduce dept id */ job launches two kinds2: of processes:const Value&MapTaskvalue, and/*ReduceTask dept info */). Mappers run inside 3: dept id = key ; { the4: MapTaskbonus adjustment process while = value joiners.bonus and reducersadjustment; are invoked inside the5: ReduceTaskEmit ((dept process.id), (bonus Therefore,adjustment)); Map-Join-Reduce’s process 6: model} allows for the pipelining of intermediate results between joiners and reducers since joiners and reducers are run inside the sameAlgorithm ReduceTask 3 Reduce process. function for the Employee dataset. [5, 7] have presented another approach to improve the join phase 1: reduce (const Key& key , /* (dept id, emp id) */ in2: the MapReduceconst ValueIterator& framework. Thevalue approach aims to optimize the communication3: /* an cost iterator by focusing for a bonuses on selecting collection the most */) appropriate { attributes4: bonus thatsum are used = /* to sum partition up bonuses and replicate for each the emp dataid */among the 5: Emit (key, (bonus sum)); reduce6: process. Therefore, it begins by identifying the map-key the set of} attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share" which is the number of buckets into which its val- uesfunction, are hashed,processor to formfunction, a componentpartition of the identifier selector of, and a Reduceconfig- process.urable iterator Relations. We have will their use tuples the employee-bonusreplicated in limited example fashion to ofexplain which the the degree data and of replication control flow depends of this on frameworkthe shares for and those how map-keythese components attributes that collaborate. are missing from their schema. The ap- proachThe considermerge twofunction important ( merger special) join is like cases:mapchainor joinsreduce (rep-, in resentswhich a developers sequence of can 2-way implement join operations user-defined where the data output process- of oneing operation logic. While in this a sequencecall to a map is used function as an input ( mapper to another) processes op- Figure 3: Example to join Employee and Depart- erationa key/value in a pipelined pair, andfashion) a call and tostar ajoins reduce (represents function joining ( reducer of ) Figure 5: A Sample Execution of The Map-Reduce-Merge ment tables and compute employee bonuses (see aprocesses large fact tablea key-grouped with several value smaller collection, dimension a merger tables). processes In each Framework [126]. 3.1). case,two the pairs proposed of key/values, algorithm that is able each to determine comes from the map-keya distinguish- and § determineable source. the shares that yield the least replication. The proposed approachAt the is Mergenot always phase, superior users to might the conventional want to apply way of different using eitherAlgorithm sorted or 1 hashed)Map function by keys andfor thethese Employee keys have dataset. to be passed map-reducedata-processing to implement logic on joins. data However, based on there their are sources. some cases An into1: map the function(const Key& to bekey merged., /* emp Inid the */ original framework, the re- whereexample the proposed is the build approachand resultsprobe inphases clear wins of a suchhash as: join, 1) Ana- where duced2: outputconst isfinal. Value& Hence,value users/* emp packinfo whatever */) is needed in [v3] 3: emp id = key ; { lyticbuild queriesprogramming in which a logicvery large is applied fact table on is one joined table with then smallerprobe while passing k2 for the next stage is not required. Figure 5 illus- 4: dept id = value .dept id; dimensionthe other. tables. To accommodate2) Queries involving this paths pattern, through a processor graphs withis a trates5: /* a sample compute execution bonus using of theemp Map-Reduce-Mergeinfo */ framework. user-defined function that processes data from one source 6: output key = (dept id, emp id); high out-degree, such as the Web or a social network. In this example, there are two datasets Employee and Department only. Users can define two processors in Merge. 7: output value = (bonus); emp-id where8: Emit Employee’s(output keykey, attribute output value); is and the Department’s After map and reduce tasks are about done, a Map-Reduce- key9: is dept-id. The execution of this example query aims to 3.2 Supporting Iterative Processing } Merge coordinator launches mergers on a cluster of nodes join these two datasets and compute employee bonuses. On the (seeThe Fig. basic 2). MapReduce When a framework merger starts does notup, directly it is assigned supportthese with a left hand side of Figure 5, a mapper reads Employee entries and iterativemerger data number. analysis Using applications. this number, Instead, a programmers user-definable must module im- computesothers are a bonus packed for into each a entry.dept Ainfo reducer“value.” then sums One up example these plementcalled partition iterative programs selector bycan manually determine issuing from multiple which MapRe- reducers bonuses for every employee and sorts them by dept-id, then duce jobs and orchestrating their execution using a driver program. query is to join these two datasets and compute employee this merger retrieves its input data. Mappers and reducers emp-id. On the right hand side, a mapper reads Department en- In practice, there are two key problems with manually orchestrating bonuses. are also assigned with a number. For mappers, this num- tries and computes bonus adjustments. A reducer then sorts these an iterative program in MapReduce: Before these two datasets are joined in a merger, they are ber represents the input file split. For reducers, this number department entries. At the end, a merger matches the output records • Even though much of the data may be unchanged from itera- first processed by a pair of mappers and reducers. A com- represents an input bucket, in which mappers partition and from the two reducers on dept-id and applies a department- tion to iteration, the data must be re-loaded and re-processed plete data flow is shown in Fig. 3. On the left hand side, a store their output data to. For Map-Reduce users, these based bonus adjustment on employee bonuses. Yang et al. [127] at each iteration, wasting I/O, network bandwidth and CPU mapper reads Employee entries and computes a bonus for numbers are simply system implementation detail, but in have also proposed an approach for improving the Map-Reduce- resources. each entry. A reducer then sums up these bonuses for ev- Map-Reduce-Merge, users utilize these numbers to associate Merge framework by adding a new primitive called Traverse. This • The termination condition may involve the detection of when ery employee and sorts them by dept id, then emp id. On input/output between mergers and reducers in partition se- primitive can process index file entries recursively, select data par- a fixpoint has been reached. This condition may itself require the right hand side, a mapper reads Department entries and lectors. titions based on query conditions and feed only selected partitions an extra MapReduce job on each iteration, again incurring computes bonus adjustments. A reducer then sorts these de- to other primitives. Likeoverhead mappers in terms and ofreducers, scheduling a merger extra tasks, can readingbe considered extra partment entries. At the end, a merger matches the output The Map-Join-Reduce [68] represents another approach that has as havingdata from logical disk iterators and moving that data read across data the from network. inputs. Each beenrecords introduced from thewith twoa filtering-join-aggregation reducers on dept id programming using the modelsort- mapperThe HaLoop andsystem reducer [24] have is designed one logical to support iterator iterative and process- it moves asmerge an extension algorithm, of the applies standard a department-based MapReduce’s filtering-aggregation bonus adjust- ingfrom on the the MapReduce begin to frameworkthe end of by a extending data stream, the basic which MapRe- is an programmingment on employee model. In bonuses. particular, Pseudocode in addition to for the these standard mappers map- duceinput framework file split with for two a mapper, main functionalities: or a merge-sorted stream for perand and reducers reducer are operation shown of in the Alg. standard 1, 2, MapReduce3, and 4. framework, a1. reducer. Caching the A invariantmerger datareads in the data first from iteration two and sources, then reusing so it theyAfter introduce these atwo third pairs operation, of Map-Reduce join (called joiner), tasks to are the finished, frame- canthem be viewed in later as iterations. having two logical iterators. These iter- work.a merger Hence, task to join takes multiple their datasets intermediate for aggregation, outputs, users and spec- joins ators2. Caching usually the move reducer forward outputs, astheir which mapper/reducer makes checking for coun- a ifythem a set on ofdeptjoin()id functions. We will and describe the join the order details between of major them. Then, merge terparts,fixpoint but more their efficient, relative without movement an extra againstMapReduce each job. others thecomponents runtime system in following automatically sections. joins the multiple input datasets canIn order be instrumented to accommodate to the implement requirements a user-defined of iterative data merge anal- al- according to the join order and invoke join() functions to process ysisgorithm. applications, Our HaLoop Map-Reduce-Merge has incorporated framework the following provides changes a the3.2 joined Implementation records. They have also introduced a one-to-many shuf- touser-configurable the basic Hadoop MapReduce module ( iterator-manager framework: ) that it is called flingWe strategy have implemented which shuffles a each Map-Reduce-Merge intermediate key/value framework, pair to for• theIt exposes information a new thatapplication controls programming the movement interface of these to userscon- manyin which joiners Map at one and time. Reduce Using components a tailored partition are inherited strategy, they from figurablethat simplifies iterators . the Later, expression we will of iterativedescribe MapReduce several iteration pro- canGoogle utilize Map-Reduce the one-to-many except shuffling minor scheme signature to join multiple changes. datasets The patternsgrams. from relational join algorithms. A Merge phase innew one Merge phasemodule instead of includes a sequence several of MapReduce new components: jobs. Themerge run- driver,• HaLoop’s as shown master in node Alg. contains 5, is needed a new loop to coordinate control module these

1033 that repeatedly starts new map-reduce steps that compose the opportunities for sharing the execution of their work. Thus, this loop body until a user-specified stopping condition is met. sharing can reduce the overall amount of work which consequently • It uses a new task scheduler that leverages data locality. leads to the reduction of the monetary charges incurred while uti- • It caches and indices application data on slave nodes. In lizing the resources of the processing infrastructure. The MRShare principle, the task tracker not only manages task execution system [98] have been presented as a sharing framework which is but also manages caches and indices on the slave node and tailored to transform a batch of queries into a new batch that will redirects each task’s cache and index accesses to local file be executed more efficiently by merging jobs into groups and eval- system. uating each group as a single query. Based on a defined cost model, In principle, HaLoop relies on the same file system and has the they described an optimization problem that aims to derive the op- same task queue structure as Hadoop but the task scheduler and task timal grouping of queries in order to avoid performing redundant tracker modules are modified, and the loop control, caching, and in- work and thus resulting in significant savings on both processing dexing modules are newly introduced to the architecture. The task time and money. In particular, the approach considers exploiting tracker not only manages task execution but also manages caches the following sharing opportunities: and indices on the slave node, and redirects each task’s cache and • Sharing Scans. To share scans between two mapping pipelines index accesses to local file system. Mi and Mj, the input data must be the same. In addition, In the MapReduce framework, each map or reduce task contains the key/value pairs should be of the same type. Given that, its portion of the input data and the task runs by performing the it becomes possible to merge the two pipelines into a sin- map/reduce function on its input data records where the life cycle gle pipeline and scan the input data only once. However, it of the task ends when finishing the processing of all the input data should be noted that such combined mapping will produce records has been completed. The iMapReduce framework [133] two streams of output tuples (one for each mapping pipeline supports the feature of iterative processing by keeping alive each Mi and Mj) . In order to distinguish the streams at the re- map and reduce task during the whole iterative process. In partic- ducer stage, each tuple is tagged with a tag() part. This ular, when all of the input data of a persistent task are parsed and tagging part is used to indicate the origin mapping pipeline processed, the task becomes dormant, waiting for the new updated during the reduce phase. input data. For a map task, it waits for the results from the reduce • Sharing Map Output. If the map output key and value types tasks and is activated to work on the new input records when the re- are the same for two mapping pipelines Mi and Mj then the quired data from the reduce tasks arrive. For the reduce tasks, they map output streams for Mi and Mj can be shared. In particu- wait for the map tasks’ output and are activated synchronously as lar, if Mapi and Map j are applied to each input tuple. Then, in MapReduce. Jobs can terminate their iterative process in one of the map output tuples coming only from Mapi are tagged two ways: with tag(i) only. If a map output tuple was produced from 1. Defining fixed number of iterations: Iterative algorithm stops an input tuple by both Mapi and Map j, it is then tagged by after it iterates n times. tag(i)+tag(j). Therefore, any overlapping parts of the 2. Bounding the distance between two consecutive iterations: map output will be shared. In principle, producing a smaller Iterative algorithm stops when the distance is less than a map output leads to savings on sorting and copying interme- threshold. diate data over the network. The iMapReduce runtime system does the termination check after • Sharing Map Functions. Sometimes the map functions are each iteration. To terminate the iterations by a fixed number of identical and thus they can be executed once. At the end of iterations, the persistent map/reduce task records its iteration num- the map stage two streams are produced, each tagged with ber and terminates itself when the number exceeds a threshold. To its job tag. If the map output is shared, then clearly only bound the distance between the output from two consecutive iter- one stream needs to be generated. Even if only some filters ations, the reduce tasks can save the output from two consecutive are common in both jobs, it is possible to share parts of map iterations and compute the distance. If the termination condition functions. is satisfied, the master will notify all the map and reduce tasks to In practice, sharing scans and sharing map-output yield I/O sav- terminate their execution. ings while sharing map functions (or parts of them) additionally Other projects have been implemented for supporting iterative yield CPU savings. processing on the MapReduce framework. For example, Twister12 While the MRShare system focus on sharing the processing be- is a MapReduce runtime with an extended programming model tween queries that are executed concurrently, the ReStore system [43, that supports iterative MapReduce computations efficiently [42]. 44] has been introduced so that it can enable the queries that are It uses a publish/subscribe messaging infrastructure for communi- submitted at different times to share the intermediate results of pre- cation and data transfers, and supports long running map/reduce viously executed jobs and reusing them for future submitted jobs tasks. In particular, it provides programming extensions to MapRe- to the system. In particular, each MapReduce job produces output duce with broadcast and scatter type data transfers. Microsoft has that is stored in the distributed file system used by the MapReduce also developed a project that provides an iterative MapReduce run- system (e.g. HDFS). These intermediate results are kept (for a de- time for Windows Azure called Daytona13. fined period) and managed so that they can be used as input by subsequent jobs. ReStore can make use of whole jobs or sub-jobs 3.3 Data and Process Sharing reuse opportunities. To achieve this goal, the ReStore consists of With the emergence of cloud computing, the use of an analyti- two main components: cal query processing infrastructure (e.g., Amazon EC2) can be di- • Repository of MapReduce job outputs: It stores the outputs of rectly mapped to monetary value. Taking into account that different previously executed MapReduce jobs and the physical plans MapReduce jobs can perform similar work, there could be many of these jobs. • Plan matcher and rewriter: Its aim is to find physical plans in the repository that can be used to rewrite the input jobs 12http://www.iterativemapreduce.org/ using the available matching intermediate results. 13http://research.microsoft.com/en-us/projects/daytona/ In principle, the approach of the ReStore system can be viewed as group pairs, having the same join key from the two relations, analogous to the steps of building and using materialized views for on the same split and hence on the same node. As a result, relational databases [55]. join operations can be then processed locally within each node at query time. Implementing the trojan joins do not 3.4 Support of Data Indices and Column Stor- require any changes to be made to the existing implementa- age tion of the Hadoop framework. The only changes are made on the internal management of the data splitting process. In One of the main limitations of the original implementation of addition, trojan indices can be freely combined with trojan the MapReduce framework is that it is designed in a way that the joins. jobs can only scan the input data in a sequential-oriented fash- The design and implementation of a column-oriented and binary ion. Hence, the query processing performance of the MapReduce backend storage format for Hadoop has been presented in [48]. framework is unable to match the performance of a well-configured In general, a straightforward way to implement a column-oriented parallel DBMS [104]. In order to tackle this challenge, [41] have storage format for Hadoop is to store each column of the input presented the Hadoop++ system which aims to boost the query dataset in a separate file. However, this raises two main challenges: performance of the Hadoop system without changing any of the • It requires generating roughly equal sized splits so that a job system internals. They achieved this goal by injecting their changes can be effectively parallelized over the cluster. through user-defined function (UDFs) which only affect the Hadoop • It needs to ensure that the corresponding values from differ- system from inside without any external effect. In particular, they ent columns in the dataset are co-located on the same node introduce the following main changes: running the map task. • Trojan Index: The original Hadoop implementation does not The first challenge can be tackled by horizontally partitioning the provide index access due to the lack of a priori knowledge of dataset and storing each partition in a separate subdirectory. The the schema and the MapReduce jobs being executed. Hence, second challenge is harder to tackle because of the default 3-way the Hadoop++ system is based on the assumption that if we block-level replication strategy of HDFS that provides fault toler- know the schema and the anticipated MapReduce jobs, then ance on commodity servers but does not provide any co-location we can create appropriate indices for the Hadoop tasks. In guarantees. [48] tackle this challenge by implementing a modified particular, trojan index is an approach to integrate indexing HDFS block placement policy which guarantees that the files cor- capability into Hadoop in a non-invasive way. These indices responding to the different columns of a split are always co-located are created during the data loading time and thus have no across replicas. Hence, when reading a dataset, the column input penalty at query time. Each trojan Index provides an optional format can actually assign one or more split-directories to a single index access path which can be used for selective MapRe- split and the column files of a split-directory are scanned sequen- duce jobs. The scan access path can still be used for other tially where the records are reassembled using values from corre- MapReduce jobs. These indices are created by injecting ap- sponding positions in the files. A lazy record construction tech- propriate UDFs inside the Hadoop implementation. Specifi- nique is used to mitigate the deserialization overhead in Hadoop, cally, the main features of trojan indices can be summarized as well as eliminate unnecessary disk I/O. The basic idea behind as follows: lazy record construction is to deserialize only those columns of a – No External Library or Engine: Trojan indices inte- record that are actually accessed in a map function. Each column grate indexing capability natively into the Hadoop frame- of the input dataset can be compressed using one of the following work without imposing a distributed SQL-query engine compression schemes: on top of it. 1. Compressed Blocks: This scheme uses a standard compres- – Non-Invasive: They do not change the existing Hadoop sion algorithm to compress a block of contiguous column framework. The index structure is implemented by pro- values. Multiple compressed blocks may fit into a single viding the right UDFs. HDFS block. A header indicates the number of records in – Optional Access Path: They provide an optional index a compressed block and the block’s size. This allows the access path which can be used for selective MapReduce block to be skipped if no values are accessed in it. However, jobs. However, the scan access path can still be used for when a value in the block is accessed, the entire block needs other MapReduce jobs. to be decompressed. – Seamless Splitting: Data indexing adds an index over- 2. Dictionary Compressed Skip List: This scheme is tailored head for each data split. Therefore, the logical split in- for map-typed columns. It takes advantage of the fact that cludes the data as well as the index as it automatically the keys used in maps are often strings that are drawn from a splits the indexed data at logical split boundaries. limited universe. Such strings are well suited for dictionary – Partial Index: Trojan Index need not be built on the compression. A dictionary is built of keys for each block of entire split. However, it can be built on any contiguous map values and store the compressed keys in a map using a subset of the split as well. skip list format. The main advantage of this scheme is that – Multiple Indexes: Several Trojan Indexes can be built a value can be accessed without having to decompress an on the same split. However, only one of them can be entire block of values. the primary index. During query processing, an appro- One advantage of this approach is that adding a column to a priate index can be chosen for data access based on the dataset is not an expensive operation. This can be done by sim- logical query plan and the cost model. ply placing an additional file for the new column in each of the • Trojan Join: Similar to the idea of the trojan index, the Hadoop++ split-directories. On the other hand, a potential disadvantage of system assumes that if we know the schema and the expected this approach is that the available parallelism may be limited for workload, then we can co-partition the input data during the smaller datasets. Maximum parallelism is achieved for a MapRe- loading time. In particular, given any two input relations, duce job when the number of splits is at least equal to the number they apply the same partitioning function on the join attributes of map tasks. of both the relations at data loading time and place the co- 3. The table data section which is actually a column-store where all the fields in the same column are stored continuously to- gether. RCFile utilizes a column-wise data compression within each row group and provides a lazy decompression technique to avoid unnec- essary column decompression during query execution. In particu- lar, the metadata header section is compressed using the RLE (Run Length Encoding) algorithm. The table data section is not com- pressed as a whole unit. However, each column is independently compressed with the Gzip compression algorithm. When process- ing a row group, RCFile does not need to fully read the whole con- tent of the row group into memory. However, it only reads the metadata header and the needed columns in the row group for a given query and thus it can skip unnecessary columns and gain the I/O advantages of a column-store. The metadata header is always decompressed and held in memory until RCFile processes the next CFile Figure 6: An Example Structure of [84]. row group. However, RCFile does not decompress all the loaded columns and uses a lazy decompression technique where a column will not be decompressed in memory until RCFile has determined The Llama system [84] have introduced another approach of that the data in the column will be really useful for query execution. providing column storage support for the MapReduce framework. The notion of Trojan Data Layout has been coined in [69] which In this approach, each imported table is transformed into column exploits the existing data block replication in HDFS to create dif- groups where each group contains a set of files representing one ferent Trojan Layouts on a per-replica basis. This means that rather or more columns. Llama introduced a column-wise format for than keeping all data block replicas in the same layout, it uses dif- Hadoop, called CFile, where each file can contain multiple data ferent Trojan Layouts for each replica which is optimized for a dif- blocks and each block of the file contains a fixed number of records ferent subclass of queries. As a result, every incoming query can be (Figure 6). However, the size of each logical block may vary since scheduled to the most suitable data block replica. In particular, Tro- records can be variable-sized. Each file includes a block index, jan Layouts change the internal organization of a data block and not which is stored after all data blocks, stores the offset of each block among data blocks. They co-locate attributes together according to and is used to locate a specific block. In order to achieve storage query workloads by applying a column grouping algorithm which efficiency, Llama uses block-level compression by using any of the uses an interestingness measure that denotes how well a set of at- well-known compression schemes. In order to improve the query tributes speeds up most or all queries in a workload. The column processing and the performance of join operations, Llama columns groups are then packed in order to maximize the total interesting- are formed into correlation groups to provide the basis for the ver- ness of data blocks. At query time, an incoming MapReduce job tical partitioning of tables. In particular, it creates multiple vertical is transparently adapted to query the data block replica that mini- groups where each group is defined by a collection of columns, mizes the data access time. The map tasks are then routed of the one of them is specified as the sorting column. Initially, when a MapReduce job to the data nodes storing such data block replicas. new table is imported into the system, a basic vertical group is cre- ated which contains all the columns of the table and sorted by the table’s primary key by default. In addition, based on statistics of 3.5 Effective Data Placement query patterns, some auxiliary groups are dynamically created or In the basic implementation of the Hadoop project, the objective discarded to improve the query performance. The Clydesdale sys- of the data placement policy is to achieve good load balance by tem [70, 12], a system which has been implemented for targeting distributing the data evenly across the data servers, independently workloads where the data fits a star schema, uses CFile for stor- of the intended use of the data. This simple data placement policy ing its fact tables. It also relies on tailored join plans and block works well with most Hadoop applications that access just a single iteration mechanism [135] for optimizing the execution of its target file. However, there are some other applications that process data workloads. from multiple files which can get a significant boost in performance RCFile [56] (Record Columnar File) is another data placement with customized strategies. In these applications, the absence of structure that provides column-wise storage for Hadoop file system data colocation increases the data shuffling costs, increases the net- (HDFS). In RCFile, each table is firstly stored as horizontally parti- work overhead and reduces the effectiveness of data partitioning. tioned into multiple row groups where each row group is then ver- CoHadoop [45] is a lightweight extension to Hadoop which is de- tically partitioned so that each column is stored independently. In signed to enable colocating related files at the file system level particular, each table can have multiple HDFS blocks where each while at the same time retaining the good load balancing and fault block organizes records with the basic unit of a row group. De- tolerance properties. It introduces a new file property to identify pending on the row group size and the HDFS block size, an HDFS related data files and modify the data placement policy of Hadoop block can have only one or multiple row groups. In particular, a to colocate copies of those related files in the same server. These row group contains the following three sections: changes are designed in a way to retain the benefits of Hadoop, in- 1. The sync marker which is placed in the beginning of the cluding load balancing and fault tolerance. In principle, CoHadoop row group and mainly used to separate two continuous row provides a generic mechanism that allows applications to control groups in an HDFS block. data placement at the file-system level. In particular, a new file- 2. A metadata header which stores the information items on level property called a locator is introduced and the Hadoop’s data how many records are in this row group, how many bytes placement policy is modified so that it makes use of this locator are in each column and how many bytes are in each field in a property. Each locator is represented by a unique value (ID) where column. each file in HDFS is assigned to at most one locator and many files the record should be sent to, and immediately sends it via the appro- priate socket.´±½¿¬±® A reduce ­± ¬¸¿¬task accepts ¬¸» ½«®®»²¬ the pipelined ­»¬ ±º data® ¼¿¬¿²±¼»­ it receives ®«²­ ±«¬ ±º ­°¿½»ô ݱó from each mapØ¿¼±±° task and ©·´´ stores ­©·¬½¸ it in an ¬± in-memory ¿ ²»© ­»¬ buffer. ±º ¼¿¬¿²±¼»­ Once the ¬± ­¬±®» ­«¾­»¯«»²¬ reduce task learns that every map task has completed, it performs ­«¾­»¬­ a final merge of all the sorted runs. In addition, the reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next½±´±½¿¬·±² job, sidestepping º±® ¿ª¿·´¿¾·´·¬§òthe need for expensive Ò±¬» ¬¸¿¬ fault-tolerant º±® ¬¸·­ ®»¿­±² ¬¸» ½®»¿¬·±² ±®ó storage in HDFS for what amounts to a temporary file. However, the computation of the reduce function from the previous job and the map function of the next job cannot be overlapped as the final result of the reduce step cannot be produced until all map tasks have completed,¿ which ­·²¹´» prevents ±® º»© effective ´±½¿¬±®­ pipelining. ´»¿¼­ Therefore, ¬± ­µ»© ·²the ¬¸» re- ¼¿¬¿ ¼·­¬®·¾«¬·±²ò É» ducer treats®»½±³³»²¼ the output of a pipelinedø¿²¼ »¨°»½¬÷ map task ¬¸¿¬ as ¿°°´·½¿¬·±²­tentative until the «­» ´±½¿¬±®­ ­»²­·¾´§ ­± Figure 7: Example file colocation in CoHadoop [45]. JobTracker¬¸¿¬ informs ¿ ¾¿´¿²½»¼ the reducer ¼¿¬¿ that ¼·­¬®·¾«¬·±² the map task has ½¿² committed ¾» ¿½¸·»ª»¼ò successfully. Theײ ß°°»²¼·¨reducer can merge ßô ©» together ¿²¿´§¦» spill ¬¸» files ¼¿¬¿ generated ¼·­¬®·¾«¬·±² ¿²¼ º¿«´¬ ¬±´ó ½¿² ¾» «­»¼ô º±® ·²­¬¿²½»ô ¬± ½±´±½¿¬» ½±®®»­°±²¼·²¹ °¿®¬·¬·±²­ ¿²¼ can be assigned to the same locator. Files with the same locator are by the same»®¿²½» uncommitted °®±°»®¬·»­ mapper, ±º but ݱؿ¼±±° will not combine ¿²¼ those ½±³°¿®» spill ·¬ ©·¬¸ Ø¿¼±±° ª·¿ ¬¸»·® ®»°´·½¿­ ±² ¬¸» ­¿³» ­»¬ ±º ¼¿¬¿²±¼»­ò files with the output of other map tasks until it has been notified placed on the same set of datanodes, whereas files with no locator ¿ ­·³°´» °®±¾¿¾·´·­¬·½ ³±¼»´ò λ¹¿®¼·²¹ ¼¿¬¿ ´±­­ ¼«» ¬± º¿·´·²¹ are placed via Hadoop’s default strategy. It should be noted that that the map task has committed. Thus, if a map task fails, each re- duce task can ignore any tentative spill files produced by the failed ´»ª»´this °®±°»®¬§ colocation ½¿´´»¼ process ¿ ´±½¿¬±® involves all data blocks, including replicas. map attempt.½¿²¬´§ The JobTracker ´±©»® ·² will ݱؿ¼±±° take care of ¬¸¿² scheduling ·² Ø¿¼±±°ò a new Ù·ª»² ¬¸¿¬ ¼¿¬¿ ´±­­ ³»²¬Figure °±´·½§ 7 shows ­± ¬¸¿¬ an example·¬ ³¿µ»­ of colocating«­» ±º ¬¸·­ two ´±½¿¬±® files, A and °®±°»®¬§òB, via a ײ ±«® common locator. All of A’s two HDFS blocks and B’s three blocks map task attempt,±½½«®­ô as ¸±©»ª»®ô in standard ݱؿ¼±±° Hadoop. In principle, ´±­»­ ³±®» the main ¼¿¬¿ò ̸» °®±¾¿¾·´·¬§ ¿²¼ ·³°´»³»²¬¿¬·±²ô »¿½¸ ´±½¿¬±® ·­ ®»°®»­»²¬»¼ ¾§ ¿² ·²¬»¹»® ª¿´«»ôlimitation of the MapReduce Online approach is that it is based are stored on the same set of datanodes. To manage the locator in- ¿³±«²¬ ±º ¼¿¬¿ ´±­­ ¾¿´¿²½» ±«¬æ ̸» »¨°»½¬»¼ ¿³±«²¬ ±º ´±­¬ ¼¿¬¿ ¾«¬ ±¬¸»®formation ¼¿¬¿ and ¬§°»­ keep ³¿§ track of ¿´­± collocated ¾» «­»¼ò files, ̸»®»CoHadoop ·­ ¿²introduces Òæï ®»´¿¬·±²óon HDFS. Therefore, it is not suitable for streaming applications, a new data structure, the locator table, which stores a mapping of in which data·­ ¬¸»streams ­¿³» have º±® to be¾±¬¸ processed ¿°°®±¿½¸»­ò without any λ¹¿®¼·²¹ disk in- ¬¸» ¼¿¬¿ ¼·­¬®·¾«¬·±² locators to the list of files that share this locator. In practice, the Co- volvement.±ª»® A similar ¬¸» approach½´«­¬»®ô has±«® been ³±¼»´ presented ·²¼·½¿¬»­ in [85] ¬¸¿¬ which ݱؿ¼±±° ±²´§ ­´·¹¸¬´§ Hadoop extension enables a wide variety of applications to exploit defines an incremental·²½®»¿­»­MapReduce ¬¸» ª¿®·¿¬·±² job ±ºas one ¬¸» that ´±¿¼ processes ©¸»² ´±½¿¬±®­ data ¿®» «­»¼ ­»²­·¾´§ò ½¿¬±®ò Ú·´»­ ©·¬¸ ¬¸» ­¿³» ´±½¿¬±® ¿®» °´¿½»¼ ±² ¬¸» ­¿³» ­»¬in ±º large batches of tuples and runs continuously according to a spe- data colocation by simply specifying related files such as: colo- Ñ«® »¨°»®·³»²¬­ øÍ»½¬·±² ëòí÷ ­«¹¹»­¬ ¬¸¿¬ ¬¸·­ ·²½®»¿­» ¼±»­ ²±¬ cating log files with reference files for joins, collocating partitions cific window range and slide of increment. In particular, it produces a MapReduce²»¹¿¬·ª»´§ result that includes ¿ºº»½¬ ¬¸» all data ¯«»®§ within °»®º±®³¿²½» a window (of ±®time º¿«´¬ ¬±´»®¿²½»ò ¼»º¿«´¬for ­¬®¿¬»¹§ò grouping and Ò±¬» aggregation, ¬¸¿¬ ½±´±½¿¬·±² colocating index·²ª±´ª»­ files with ¿´´ ¼¿¬¿ their data ¾´±½µ­ô ·²ó files and colocating columns of a table. or data size) every slide and considers landmark MapReduce jobs ½´«¼·²¹ ®»°´·½¿­ò Ú·¹«®» ï ­¸±©­ ¿² »¨¿³°´» ±º ½±´±½¿¬·²¹ ¬©±where the trailing edge of the window is fixed and the system incor- 3.6ß ¿²¼ PipeliningÞô ª·¿ ¿ ½±³³±² and Streaming ´±½¿¬±®ò ß´´ Operations ±º ßù­ ¬©± ØÜÚÍ ¾´±½µ­porates newìò data intoÛÈÐÔÑ×Ì×ÒÙ the existing result. Map functions ÜßÌß are ÝÑÔÑÝßÌ×ÑÒ trivially continuous, and process data on a tuple-by-tuple basis. However, ¿²¼ Þù­The ¬¸®»» original ¾´±½µ­ implementation ¿®» ­¬±®»¼ of±² the ¬¸» MapReduce ­¿³» ­»¬ framework ±º ¼¿¬¿²±¼»­ò has ̸» °®±°±­»¼ ¼¿¬¿ ½±´±½¿¬·±² ³»½¸¿²·­³ ½¿² ¾» »¨°´±·¬»¼ ¾§ before the reduce function may process the mapped data, the data been designed in a way that the entire output of each map and ̱ ³¿²¿¹» ¬¸» ´±½¿¬±® ·²º±®³¿¬·±² ¿²¼ µ»»° ¬®¿½µ ±º ½±´±½¿¬»¼must be partitioned³¿²§ ¼·ºº»®»²¬ across the reduce ¿°°´·½¿¬·±²­ò operators and ײ sorted.¬¸·­ ­»½¬·±²ô When ©» »¨¿³°´¿®·´§ ¼·­ó reduce task to be materialized into a local file before it can be ´±½¿¬±® ¬¿¾´»ô ·²¬± ¬¸»the map operator first receives a new key-value pair, it calls the map consumed by the next stage. This materialization step allows for function and inserts the result into the latest increment in the map ²¿³»²±¼»the implementation ±º ØÜÚÍò̸» of a simple ´±½¿¬±® and ¬¿¾´» elegant ­¬±®»­ checkpoint/restart ¿ ³¿°°·²¹ fault ±º ´±½¿¬±®­ ­»­­·±²·¦¿¬·±² ¯«»®·»­ ·² ¬¸» ½±²¬»¨¬ ±º ´±¹ °®±½»­­·²¹ò results. The operator then assigns output key-value pairs to reduce tolerance mechanism. The MapReduce Online approach. [33, 34] tasks, grouping them according to the partition function. Continu- have been proposed as a modified architecture of the MapReduce ìòï Ô±¹ Ю±½»­­·²¹ ±² д¿·² Ø¿¼±±° ¬¿¾´» ½±®®»­°±²¼·²¹ ¬± ¬¸» »¨¿³°´» ½´«­¬»®ò Í·³·´¿® ¬± ¬¸» ¾´±½µóous reduce operators participate in the sort as well, grouping values framework in which intermediate data is pipelined between oper- ¬±ó²±¼» ³¿°°·²¹ ·² ØÜÚÍô ¬¸» ´±½¿¬±® ¬¿¾´» ·­ ²±¬ ­§²½¸®±²·¦»¼by their keys beforeײ ´±¹ calling °®±½»­­·²¹ô the reduce function. ¿ ´±¹ ±º »ª»²¬­‰­«½¸ ¿­ ¿ ½´·½µ­¬®»¿³ô ¿ ators while preserving the programming interfaces and fault toler- The Incoop system [17] has been introduced as a MapReduce ¬± ¼·­µòance ײ­¬»¿¼ômodels of previous ·¬ ·­ ³¿·²¬¿·²»¼ MapReduce ¼§²¿³·½¿´´§ frameworks. This ·² pipelining ³»³±®§ ©¸·´» ´±¹ ±º °¸±²» ½¿´´ ®»½±®¼­ô ¿°°´·½¿¬·±² ´±¹­ô ±® ¿ ­»¯«»²½» ±º implementation that has been adapted for incremental computa- approach provides important advantages to the MapReduce frame- ¬®¿²­¿½¬·±²­‰·­ ½±²¬·²«±«­´§ ½±´´»½¬»¼ò Ô±¹ ¼¿¬¿ ·­ «­«¿´´§ ½±´ó tions which detects the changes on the input datasets and enables work such as: ·³¿¹» ©¸»² ¬¸» ²¿³»²±¼» ·­ ®»­¬¿®¬»¼ò ̱ º¿½·´·¬¿¬» ®»½±²­¬®«½¬·±²ôthe automatic´»½¬»¼ update ¿¬ of ³«´¬·°´» the outputs ¿°°´·½¿¬·±² of the MapReduce ­»®ª»®­ jobs ¿¬ by ¼·ºº»®»²¬ em- ¬·³»­ ¿²¼ ·­ ´¿®¹» • The reducers can begin their processing of the data as soon ×Ò±¼»ô ¿ ¼·­µó®»­·¼»²¬ ¼¿¬¿ploying a fine-grained result reuse mechanism. In particular, it al- as it is produced by mappers. Therefore, they can generate lows MapReduce programs which are not designed for incremental and refine an approximation of their final answer during the processing to be executed transparently in an incremental manner. course of execution. In addition, they can provide initial es- To achieve this goal, the design of Incoop introduces new tech- timates of the results several orders of magnitude faster than º ©·¬¸ ´±½¿¬±® ´ ·­ ½®»¿¬»¼ô ©» ¯«»®§ ¬¸» ´±½¿¬±® ¬¿¾´»niques ¬± that­«½¸ are incorporated ¿­ «­»® ¿²¼ into ¿½½±«²¬ the Hadoop ·²º±®³¿¬·±²‰¿®» MapReduce frame- ±º¬»² ¾®±«¹¸¬ ·² ¬± »²ó the final results. ½¸»½µ ©¸»¬¸»® ±® ²±¬ ·¬ ½±²¬¿·²­ ¿² »²¬®§ º±® ´ò ׺ ²±¬ô ©» ¿¼¼work. ¿ For example,®·½¸ ¬¸» instead ´±¹ ¼¿¬¿ of relying ¼«®·²¹ on HDFS ¬¸» ¿²¿´§­·­ò to store the É»input ¼»­½®·¾» ¬©± ½±³³±² • It widens the domain of problems to which MapReduce can to MapReduce jobs, Incoop devises a file system called Inc-HDFS ²»© »²¬®§ beø´å applied. º÷ ¬± ¬¸» For ´±½¿¬±® example, ¬¿¾´» it facilitates ¿²¼ «­» the ¬¸» ability ¼»º¿«´¬ to design ¼¿¬¿ °´¿½»ó ¯«»®·»­ º±® ´±¹ °®±½»­­·²¹ ¿²¼ ¼»³±²­¬®¿¬» ¸±© ¬¸»§ ¿®» «­«¿´´§ (Incremental HDFS) that provides mechanisms to identify similar- ³»²¬ °±´·½§MapReduce ±º ØÜÚÍ jobs ¬± that½¸±±­» run continuously, ¬¸» ¼¿¬¿²±¼»­ accepting ¬± ­¬±®» new ¬¸» data ®»°´·½¿­ò »ª¿´«¿¬»¼ ·² °´¿·² Ø¿¼±±°ò ities in the input data of consecutive job runs. In particular, Inc- as it arrives and analyzing it immediately (continuous queries). Ѭ¸»®©·­»ô ´±½¿¬±® ´ ·­ µ²±©² ¿²¼ ©» ±¾¬¿·² ¬¸» ´·­¬ ±º ¿´´ HDFS splits the input into chunks whose boundaries depend on the This allows MapReduce to be used for applications such as ©·¬¸ ´±½¿¬±® ´ file contentsìòïòï so that small Í»­­·±²·¦¿¬·±² changes to input do not change all chunk event monitoring and stream processing. boundaries. Therefore, this partitioning mechanism can maximize Þ´±½µÓ¿°• Pipelining¼¿¬¿ ­¬®«½¬«®» delivers data ±º to ¬¸» downstream ²¿³»²±¼» operators ¬± ±¾¬¿·² more promptly, ¿ ´·­¬ ±º ·¬­ ̸» ³±­¬ ½±³³±² ¯«»®§ ·² ´±¹ °®±½»­­·²¹ ·­ ­»­­·±²·¦¿¬·±²ô ·² the opportunities for reusing results from previous computations, ¾´±½µ­ ¿²¼which ¬¸»·® can ­¬±®¿¹» increase ´±½¿¬·±²­ opportunities ø­»¬ for ±º parallelism, ¼¿¬¿²±¼»­÷ò improve ̱ ­¬±®» ® ©¸·½¸ ´±¹ ®»½±®¼­ ¿®» ¼·ª·¼»¼ ·²¬± «­»® ­»­­·±²­ò ̸·­ ·­ ¼±²» ¾§ while preserving compatibility with HDFS by offering the same utilization and reduce response time. ® ¼¿¬¿²±¼»­ º®±³ ¬¸·­interface and¹®±«°·²¹ semantics. ¬¸» In addition, ®»½±®¼­ Incoop ·² ¬¸» controls ´±¹­ ¾§ the¿² granular- ¿½½±«²¬ ±® «­»® ·¼ô ­±®¬·²¹ In this approach, each reduce task contacts every map task upon ´·­¬ò É» ­»´»½¬ ¬¸» ® ¼¿¬¿²±¼»­ ¬¸¿¬ ­¬±®» ¬¸» ¸·¹¸»­¬ ²«³¾»®ity ±º of tasks so that large tasks can be divided into smaller subtasks initiation of the job and opens a TCP socket which will be used to that can be re-used even when the large tasks cannot. Therefore, it ¾´±½µ­pipeline ©·¬¸ ´±½¿¬±®the output´ of the map function. As each map output record ´·­¬­ ·²¬± ­»­­·±²­ò ̸» ­¬¿²¼¿®¼ Ó¿°Î»¼«½» ¿´¹±®·¬¸³ º±® ¬¸·­ introduces a new Contraction phase that leverages Combiner func- ¬¸¿² ®is produced, the mapper determines which partition (reduce task) ¯«»®§ ·­ ¬¸» ®»°¿®¬·¬·±²ó¾¿­»¼ ­±´«¬·±² ¼»­½®·¾»¼ ¾»´±©ò ¾¿­»¼ ±² ØÜÚÍù ¼»º¿«´¬ ¼¿¬¿ °´¿½»³»²¬ °±´·½§ò ̸«­ ¼¿¬¿ °´¿½»ó Þ¿­·½ Ó¿°Î»¼«½» ͱ´«¬·±²æ ̸» ³¿°°»®­ ®»¿¼ ¬¸» ´±¹ ®»½±®¼­ ³»²¬ ·² ݱؿ¼±±° ·­ ¾»­¬ »ºº±®¬ ½±«²¬ ·¼÷ ¿´±²¹ ©·¬¸ ¬¸» ±®·¹·²¿´ ®»½±®¼ ¿­ ¿ µ»§óª¿´«» °¿·®ò ̸» ¿®» ²±¬ ³±ª»¼ ¿®±«²¼ ¬± »²º±®½» ­¬®·½¬ ½±´±½¿¬·±²ò É» ½¸±­» ¬¸·­ ¿°ó °®±¿½¸ ¾»½¿«­» ·¬ ®»¬¿·²­ ØÜÚÍù º¿«´¬ ¬±´»®¿²½» °®±°»®¬·»­ ø­»» ¾»ó ®»¼«½»®­ô ©¸»®» ¬¸» ´±¹ ®»½±®¼­ º±® »¿½¸ ¹®±«° ¿®» ­±®¬»¼ ø¾§ ¬·³» ´±©÷ ¿²¼ ¼±»­ ²±¬ ·²½«® ®»±®¹¿²·¦¿¬·±² ½±­¬­ò ر©»ª»®ô ݱؿ¼±±° ­¬¿³°÷ ¿²¼ ¼·ª·¼»¼ ·²¬± ­»­­·±²­ò

²±¼» ±º ¬¸» ½´«­¬»® ¿²¼ ²± º¿·´«®»­ ±½½«®ò ìòïòî Ö±·² ß­ ·²¼·½¿¬»¼ ¿¾±ª»ô ½±´±½¿¬·±² ¸¿­ ­±³» »ºº»½¬ ±² ¬¸» ¼¿¬¿ ¼·­ó ß²±¬¸»® ·³°±®¬¿²¬ ¯«»®§ º±® ´±¹ °®±½»­­·²¹ ·­ ¿² »¯«·ó¶±·² ¾»ó ¬®·¾«¬·±² ±ª»® ¬¸» ½´«­¬»®ò ײ º¿½¬ô ¬¸» ¼¿¬¿ ¼·­¬®·¾«¬·±² ¼»°»²¼­ ±² ¬©»»² ´±¹ ¿²¼ ®»º»®»²½» ¼¿¬¿å »ò¹òô ¶±·²·²¹ ¬®¿²­¿½¬·±² ´±¹­ ©·¬¸ ½±®®»­°±²¼·²¹ ¿½½±«²¬ ·²º±®³¿¬·±²ò ß´¬¸±«¹¸ ª¿®·±«­ Ó¿°Î»¼«½» ¶±·² ¿´¹±®·¬¸³­ ¸¿ª» ¾»»² °®±°±­»¼ Åîô ïíÃô ¬¸» ³±­¬ ©·¼»´§ «­»¼ »ª¿´«¿¬·±² °´¿² º±® ¬¸·­ ¯«»®§ ·² Ó¿°Î»¼«½» ·­ ¿ ®»°¿®¬·¬·±² ¶±·²ò

ëéé tions to reduce the network traffic by anticipating a small part of the The What-if Engine’s accuracy come from how it uses a mix of processing done by the Reducer tasks and control their granularity. simulation and model-based estimation at the phase level of MapRe- Furthermore, Incoop improves the effectiveness of memoization by duce job execution [57, 59, 60]. For a given MapReduce program, implementing an affinity-based scheduler that applies a work steal- the role of the cost-based optimizer component is to enumerate and ing algorithm to minimize the amount of data movement across search efficiently through the high dimensional space of configu- machines. This modified scheduler strikes a balance between ex- ration parameter settings, making appropriate calls to the What-if ploiting the locality of previously computed results and executing Engine, in order to find a good configuration setting, it clusters pa- tasks on any available machine to prevent straggling effects. On rameters into lower-dimensional subspaces such that the globally- the runtime, instances of incremental Map tasks take advantage of optimal parameter setting in the high-dimensional space can be previously stored results by querying the memoization server. If generated by composing the optimal settings found for the sub- they find that the result has already been computed, they fetch the spaces. Stubby [82] has been presented as a cost-based optimizer result from the location of their memoized output and conclude. for MapReduce workflows that searches through the subspace of Similarly, the results of a Reduce task are remembered by storing the full plan space that can be enumerated correctly and costed them persistently and locally where a mapping from a collision- based on the information available in any given setting. Stubby resistant hash of the input to the location of the output is inserted in enumerates the plan space based on plan-to-plan transformations the memoization server. and an efficient search algorithm. The DEDUCE system [79] has been presented as a middleware The Manimal system [66, 25] is designed as a static analysis- that attempts to combine real-time stream processing with the ca- style mechanism for detecting opportunities for applying relational pabilities of a large scale data analysis framework like MapReduce. style optimizations in MapReduce programs. Like most programming- In particular, it extends the IBM’s System S stream processing en- language optimizers, it is a best-effort system where it does not gine and augments its capabilities with those of the MapReduce guarantee that it will find every possible optimization and it only framework. In this approach, the input data set to the MapReduce indicates an optimization when it is entirely safe to do so. In partic- operator can be either pre-specified at compilation time or could ular, the analyzer component of the system is responsible for exam- be provided at runtime as a punctuated list of files or directories. ining the MapReduce program and sends the resulting optimization Once the input data is available, the MapReduce operator spawns a descriptor to the optimizer component. In addition, the analyzer MapReduce job and produces a list of punctuated list of files or di- also emits an index generation program that can yield a B+Tree rectories, which point to the output data. Therefore, a MapReduce of the input file. The optimizer uses the optimization descriptor, operator can potentially spawn multiple MapReduce jobs over the plus a catalog of pre-computed indexes, to choose an optimized ex- application lifespan but such jobs are spawned only when the pre- ecution plan, called an execution descriptor. This descriptor, plus ceding job (if any) has completed its execution. Hence, multiple a potentially-modified copy of the user’s original program, is then jobs can be cascaded together to create a data-flow of MapReduce sent for execution on the Hadoop cluster. These steps are performed operators where the output from the MapReduce operators can be transparently from the user where the submitted program does not read to provide updates to the stream processing operators. need to be modified by the programmer in any way. In particular, the main task of the analyzer is to produce a set of optimization 3.7 System Optimizations descriptors which enable the system to carry out a phase roughly akin to logical rewriting of query plans in a relational database. In general, running a single program in a MapReduce framework The descriptors characterize a set of potential modifications that may require tuning a number of parameters by users or system ad- remain logically identical to the original plan. The catalog is a sim- ministrators. The settings of these parameters control various as- ple mapping from a filename to zero or more (X,O) pairs where pects of job behavior during execution such as memory allocation X is an index file and O is an optimization descriptor. The opti- and usage, concurrency, I/O optimization, and network bandwidth mizer examines the catalog to see if there is any entry for input file. usage. The submitter of a Hadoop job has the option to set these If not, then it simply indicates that Manimal should run the un- parameters either using a program-level interface or through XML changed user program without any optimization. If there is at least configuration files. For any parameter whose value is not speci- one entry for the input file, and a catalog-associated optimization fied explicitly during job submission, default values, either shipped descriptor is compatible with analyzer-output, then the optimizer along with the system or specified by the system administrator, are can choose an execution plan that takes advantage of the associated used [11]. Users can run into performance problems because they index file. do not know how to set these parameters correctly, or because they A key feature of MapReduce is that it automatically handles fail- do not even know that these parameters exist. [58] have focused ures, hiding the complexity of fault-tolerance from the program- on the optimization opportunities presented by the large space of mer. In particular, if a node crashes, MapReduce automatically configuration parameters for these programs. They introduced a restart the execution of its tasks. In addition, if a node is available Profiler component to collect detailed statistical information from but is performing poorly, MapReduce runs a speculative copy of unmodified MapReduce programs and a What-if Engine for fine- its task (backup task) on another machine to finish the computa- grained cost estimation. In particular, the Profiler component is tion faster. Without this mechanism of speculative execution, a job responsible for two main aspects: would be as slow as the misbehaving task. This situation can arise 1. Capturing information at the fine granularity of phases within for many reasons, including faulty hardware and system misconfig- the map and reduce tasks of a MapReduce job execution. uration. On the other hand, launching too many speculative tasks This information is crucial to the accuracy of decisions made may take away resources from useful tasks. Therefore, the accu- by the What-if Engine and the Cost-based Optimizer compo- racy in estimating the progress and time-remaining long running nents. jobs is an important challenge for a runtime environment like the 2. Using dynamic instrumentation to collect run-time monitor- MapReduce framework. In particular, this information can play an ing information from unmodified MapReduce programs. The important role in improving resource allocation, enhancing the task dynamic nature means that monitoring can be turned on or scheduling, enabling query debugging or tuning the cluster con- off on demand. figuration. The ParaTimer system [94, 95] has been proposed to count: table sum of int; tackle this challenge. In particular, ParaTimer provides techniques total: table sum of float; for handling several challenges including failures and data skew. sumOfSquares: table sum of float; To handle unexpected changes in query execution times such as x: float = input; those due to failures, ParaTimer provides users with a set of time- emit count $<$- 1; remaining estimates that correspond to the predicted query execu- emit total $<$- x; tion times in different scenarios (i.e., a single worst-case failure, or emit sumOfSquares $<$- x * x; data skew at an operator). Each of these indicators can be annotated with the scenario to which it corresponds, giving users a detailed picture of possible expected behaviors. To achieve this goal, Para- Figure 8: An Example Sawzall Program Timer estimates time-remaining by breaking queries into pipelines where the time-remaining for each pipeline is estimated by consid- ering the work to be done and the speed at which that work will be performed, taking (time-varying) parallelism into account. To get 4. SYSTEMS OF DECLARATIVE INTERFACES processing speeds, ParaTimer relies on earlier debug runs of the FOR THE MAPREDUCE FRAMEWORK same query on input data samples generated by the user. In addi- For programmers, a key appealing feature in the MapReduce tion, ParaTimer identifies the critical path in a query plan where it framework is that there are only two main high-level declarative then estimates progress along that path, effectively ignoring other primitives (map and reduce) that can be written in any program- paths. [129] have presented an approach to estimate the progress of ming language of choice and without worrying about the details of MapReduce tasks within environments of clusters with heteroge- their parallel execution. On the other hand, the MapReduce pro- nous hardware configuration. In these environments, choosing the gramming model has its own limitations such as: node on which to run a speculative task is as important as choos- • Its one-input data format (key/value pairs) and two-stage data ing the task. They proposed an algorithm for speculative execution flow is extremely rigid. As we have previously discussed, to called LATE (Longest Approximate Time to End) which is based on perform tasks that have a different data flow (e.g. joins or n three principles: prioritizing tasks to speculate, selecting fast nodes stages) would require the need to devise inelegant workarounds. on which to run and capping speculative tasks to prevent thrash- • Custom code has to be written for even the most common ing. In particular, the algorithm speculatively execute the task that operations (e.g. projection and filtering) which leads to the it suspects will finish farthest into the future, because this task pro- fact that the code is usually difficult to reuse and maintain vides the greatest opportunity for a speculative copy to overtake unless the users build and maintain their own libraries with the original and reduce the job’s response time. To really get the the common functions they use for processing their data. best chance of beating the original task with the speculative task, Moreover, many programmers could be unfamiliar with the MapRe- the algorithm only launches speculative tasks on fast nodes (and duce framework and they would prefer to use SQL (in which they not the first available node). The RAFT (Recovery Algorithms for are more proficient) as a high level declarative language to express Fast-Tracking) system [107, 108] has been introduced, as a part of their task while leaving all of the execution optimization details to the Hadoop++ system [41], for tracking and recovering MapRe- the backend engine. In addition, it is beyond doubt that high level duce jobs under task or node failures. In particular, RAFT uses two language abstractions enable the underlying system to perform au- main checkpointing mechanisms: local checkpointing and query tomatic optimization. In the following subsection we discuss re- metadata checkpointing. On the one hand, the main idea of local search efforts that have been proposed to tackle these problems and checkpointing is to utilize intermediate results, which are by default add SQL-like interfaces on top of the MapReduce framework. persisted by Hadoop, as checkpoints of ongoing task progress com- putation. In general, map tasks spill buffered intermediate results 4.1 Sawzall to local disk whenever the output buffer is on the verge to over- Sawzall [105] is a scripting language used at Google on top of flow. RAFT exploits this spilling phase to piggy-back checkpoint- MapReduce. A Sawzall program defines the operations to be per- ing metadata on the latest spill of each map task. For each check- formed on a single record of the data. There is nothing in the lan- point, RAFT stores a triplet of metadata that includes the taskID guage to enable examining multiple input records simultaneously, which represents a unique task identifier, spillID which represents or even to have the contents of one input record influence the pro- the local path to the spilled data and offset which specifies the last cessing of another. The only output primitive in the language is byte of input data that was processed in that spill. To recover from the emit statement, which sends data to an external aggregator (e.g. a task failure, the RAFT scheduler reallocates the failed task to the Sum, Average, Maximum, Minimum) that gathers the results from same node that was running the task. Then, the node resumes the each record after which the results are then correlated and pro- task from the last checkpoint and reuses the spills previously pro- cessed. The authors argue that aggregation is done outside the lan- duced for the same task. This simulates a situation where previous guage for a couple of reasons: 1) A more traditional language can spills appear as if they were just produced by the task. In case use the language to correlate results but some of the aggregation that there is no local checkpoint available, the node recomputes the algorithms are sophisticated and are best implemented in a native task from the beginning. On the other hand, the idea behind query language and packaged in some form. 2) Drawing an explicit line metadata checkpointing is to push intermediate results to reducers between filtering and aggregation enables a high degree of paral- as soon as map tasks are completed and to keep track of those in- lelism and hides the parallelism from the language itself. coming key-value pairs that produce local partitions and hence that Figure ?? depicts an example Sawzall program where the first are not shipped to another node for processing. Therefore, in case three lines declare the aggregators count, total and sum of squares. of a node failure, the RAFT scheduler can recompute local parti- The keyword table introduces an aggregator type which are called tions. tables in Sawzall even though they may be singletons. These par- ticular tables are sum tables which add up the values emitted to them, ints or floats as appropriate. The Sawzall language is imple- mented as a conventional compiler, written in C++, whose target SQL Pig Latin language is an interpreted instruction set, or byte-code. The com- SELECT category, AVG () good_urls = FILTER urls BY pagerank > 0.2; GROUP BY FROM urls groups = good_urls category; piler and the byte-code interpreter are part of the same binary, so FILTER BY COUNT 6 WHERE pagerank > 0.2 big_groups = groups (good_urls)>10 ; output = FOREACH big_groups GENERATE the user presents source code to Sawzall and the system executes it GROUP BY category category, AVG (good_urls.pagerank); directly. It is structured as a library with an external interface that HAVING COUNT (*) > 10 6 accepts source code which is then compiled and executed, along with bindings to connect to externally-provided aggregators. The datasets of Sawzall programs are often stored in Google File Sys- Figure 9: An Example SQL Query and Its Equivalent Pig Latin tem (GFS) [52]. The business of scheduling a job to run on a cluster Program [50]. of machines is handled by a software called Workqueue which cre- ates a large-scale time sharing system out of an array of computers and their disks. It schedules jobs, allocates resources, reports status Parser and collects the results. Google has also developed FlumeJava [28], a Java library for de- veloping and running data-parallel pipelines on top of MapReduce. Logical Optimizer FlumeJava is centered around a few classes that represent paral- lel collections. Parallel collections support a modest number of parallel operations which are composed to implement data-parallel computations where an entire pipeline, or even multiple pipelines, MapReduce can be translated into a single Java program using the FlumeJava Compiler abstractions. To achieve good performance, FlumeJava internally implements parallel operations using deferred evaluation. The in- MapReduce vocation of a parallel operation does not actually run the operation, Optimizer but instead simply records the operation and its arguments in an internal execution plan graph structure. Once the execution plan for the whole computation has been constructed, FlumeJava op- timizes the execution plan and then runs the optimized execution Hadoop plan. When running the execution plan, FlumeJava chooses which strategy to use to implement each operation (e.g., local sequential loop vs. remote parallel MapReduce) based in part on the size of Figure 10: Pig compilation and execution steps [100]. the data being processed, places remote computations near the data on which they operate and performs independent operations in par- allel. functions and the ability to operate over plain input files without any schema information [51]. In particular, Pig Latin has a simple 4.2 Pig Latin data model consisting of the following four types: Olston et al. [100] have presented a language called Pig Latin 1. Atom: An atom contains a simple atomic value such as a that takes a middle position between expressing task using the high- string or a number, e.g. "alice". level declarative querying model in the spirit of SQL and the low- 2. Tuple: A tuple is a sequence of fields, each of which can be level/procedural programming model using MapReduce. Pig Latin any of the data types, e.g. ("alice", "lakers"). is implemented in the scope of the Apache Pig project14 and is 3. Bag: A bag is a collection of tuples with possible duplicates. used by programmers at Yahoo! for developing data analysis tasks. The schema of the constituent tuples is flexible where not all Writing a Pig Latin program is similar to specifying a query ex- tuples in a bag need to have the same number and type of ecution plan (e.g. a data flow graph). To experienced program- fields  ("alice", "lakers")  mers, this method is more appealing than encoding their task as e.g. ("alice", ("iPod", "apple")) an SQL query and then coercing the system to choose the desired 4. Map: A map is a collection of data items, where each item plan through optimizer hints. In general, automatic query optimiza- has an associated key through which it can be looked up. tion has its limits especially with uncataloged data, prevalent user- As with bags, the schema of the constituent data items is defined functions and parallel execution, which are all features of flexible However, the keys are required to be data atoms, e.g. the data analysis tasks targeted by the MapReduce framework. Fig-  "k1" → ("alice", "lakers")  ure 9 shows an example SQL query and its equivalent Pig Latin pro- "k2" → "20" gram. Given a URL table with the structure (url,category, pagerank), To accommodate specialized data processing tasks, Pig Latin has the task of the SQL query is to find each large category and its aver- extensive support for user-defined functions (UDFs). The input and age pagerank of high-pagerank urls (> 0.2). A Pig Latin program is output of UDFs in Pig Latin follow its fully nested data model. Pig described as a sequence of steps where each step represents a single Latin is architected such that the parsing of the Pig Latin program data transformation. This characteristic is appealing to many pro- and the logical plan construction is independent of the execution grammers. At the same time, the transformation steps are described platform. Only the compilation of the logical plan into a physi- using high-level primitives (e.g. filtering, grouping, aggregation) cal plan depends on the specific execution platform chosen. Cur- much like in SQL. rently, Pig Latin programs are compiled into sequences of MapRe- Pig Latin has several other features that are important for casual duce jobs which are executed using the Hadoop MapReduce envi- ad-hoc data analysis tasks. These features include support for a ronment. In particular, a Pig Latin program goes through a series flexible, fully nested data model, extensive support for user-defined of transformation steps [100] before being executed as depicted in Figure 10. The parsing steps verifies that the program is syntac- 14http://incubator.apache.org/pig tically correct and that all referenced variables are defined. The FROM ( MAP doctext USING ’python wc_mapper.py’ AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING ’python wc_reduce.py’;

Figure 11: An Example HiveQl Query [116]. output of the parser is a canonical logical plan with a one-to-one 4.4 Tenzing correspondence between Pig Latin statements and logical operators The Tenzing system [30] has been presented by Google as an which are arranged in a directed acyclic graph (DAG). The logical SQL query execution engine which is built on top of MapReduce plan generated by the parser is passed through a logical optimizer. and provides a comprehensive SQL92 implementation with some In this stage, logical optimizations such as projection pushdown SQL99 extensions (e.g. ROLLUP() and CUBE() OLAP exten- are carried out. The optimized logical plan is then compiled into a sions). Tenzing also supports querying data in different formats series of MapReduce jobs which are then passed through another such as: row stores (e.g. MySQL database), column stores, optimization phase. The DAG of optimized MapReduce jobs is (Google’s built in key-value store) [29], GFS (Google File Sys- then topologically sorted and jobs are submitted to Hadoop for ex- tem) [52], text and . In particular, the Tenzing sys- ecution. tem has four major components: • The distributed worker pool: represents the execution system 4.3 Hive which takes a query execution plan and executes the MapRe- 15 The Hive project is an open-source data warehousing solution duce jobs. The pool consists of master and worker nodes plus which has been built by the Facebook Data Infrastructure Team on an overall gatekeeper called the master watcher. The workers top of the Hadoop environment [116]. The main goal of this project manipulate the data for all the tables defined in the metadata is to bring the familiar relational database concepts (e.g. tables, layer. columns, partitions) and a subset of SQL to the unstructured world • The query server: serves as the gateway between the client of Hadoop while still maintaining the extensibility and flexibility and the pool. The query server parses the query, applies dif- that Hadoop provides. Thus, it supports all the major primitive ferent optimization mechanisms and sends the plan to the types (e.g. integers, floats, strings) as well as complex types (e.g. master for execution. In principle, the Tenzing optimizer ap- maps, lists, structs). Hive supports queries expressed in an SQL- plies some basic rule and cost-based optimizations to create 16 like declarative language, HiveQL , and therefore can be easily an optimal execution plan. understood by anyone who is familiar with SQL. These queries are • Client interfaces: Tenzing has several client interfaces in- compiled into MapReduce jobs that are executed using Hadoop. cluding a command line client (CLI) and a Web UI. The CLI In addition, HiveQL enables users to plug in custom MapReduce is a more powerful interface that supports complex script- scripts into queries [118]. For example, the canonical MapReduce ing while the Web UI supports easier-to-use features such as word count example on a table of documents (Figure 1) can be ex- query and table browsers tools. There is also an API to di- pressed in HiveQL as depicted in Figure 11 where the MAP clause rectly execute queries on the pool and a standalone binary indicates how the input columns (doctext) can be transformed us- which does not need any server side components but rather ing a user program (’python wc_mapper.py’) into output columns can launch its own MapReduce jobs. (word and cnt). The REDUCE clause specifies the user program • The metadata server: provides an API to store and fetch to invoke (’python wc_reduce.py’) on the output columns of the metadata such as table names and schemas and pointers to subquery. the underlying data. HiveQL supports Data Definition Language (DDL) statements A typical Tenzing query is submitted to the query server (through which can be used to create, drop and alter tables in a database [117]. the Web UI, CLI or API) which is responsible for parsing the query It allows users to load data from external sources and insert query into an intermediate parse tree and fetching the required metadata results into Hive tables via the load and insert Data Manipulation from the metadata server. The query optimizer goes through the Language (DML) statements respectively. However, HiveQL cur- intermediate format, applies various optimizations and generates a rently does not support the update and deletion of rows in exist- query execution plan that consists of one or more MapReduce jobs. ing tables (in particular, INSERT INTO, UPDATE and DELETE For each MapReduce, the query server finds an available master statements) which allows the use of very simple mechanisms to using the master watcher and submits the query to it. At this stage, deal with concurrent read and write operations without implement- the execution is physically partitioned into multiple units of work ing complex locking protocols. The metastore component is the where idle workers poll the masters for available work. The query Hive’s system catalog which stores metadata about the underlying server monitors the generated intermediate results, gathers them as table. This metadata is specified during table creation and reused they arrive and streams the output back to the client. In order to every time the table is referenced in HiveQL. The metastore dis- increase throughput, decrease latency and execute SQL operators tinguishes Hive as a traditional warehousing solution when com- more efficiently, Tenzing has enhanced the MapReduce implemen- pared with similar data processing systems that are built on top of tation with the following main changes: MapReduce-like architectures like Pig Latin [100]. • Streaming and In-memory Chaining: The implementation of Tenzing does not serialize the intermediate results of MapRe- 15http://hadoop.apache.org/hive/ duce jobs to GFS. Instead, it streams the intermediate results 16http://wiki.apache.org/hadoop/Hive/LanguageManual SELECT ... function) or a reducer (Partition function). The definitions of row that can be used at function creation time to defer a decision FROM functionname( and partition functions ensure that they can be executed in parallel on a particular data type; at query plan-time, the function ON table-or-query in a scalable manner. In the Row Function, each row from the input will be asked to describe the type. This idea appears also table or query will be operated on by exactly one instance of the [PARTITION BY expr, ...] in Microsoft’s SCOPE data processing system [5], in partic- SQL/MR function. Semantically, each row is processed indepen- [ORDER BY expr, ...] dently, allowing the execution engine to control parallelism. For ular to support extraction of structured data from flat files. [clausename(arg, ...) ...] each input row, the row function may emit zero or more rows. In SQL/MR functions fully embrace and extend this approach: the Partition Function, each group of rows as defined by the PAR- ) they avoid the need for create-time configuration of a func- TITION BY clause will be operated on by exactly one instance of ... tion, allow polymorphism of the input schema, and also en- the SQL/MR function. If the ORDER BY clause is provided, the Figure 12: Basic Syntax of SQL/MR Query Function [49]. rows within each partition are provided to the function instance in able the optional use of custom argument clauses (more on the specified sort order. Semantically, each partition is processed these in Section 3) to provide additional query plan-time pa- independently, allowing parallelization by the execution engine at between theFigure Map and 4: Reduce SQL/MR tasks using functionthe network and query syntax. the level of a partition. For each input partition, the SQL/MR par- rameters to the function. These query-time customization uses GFS only for backup purposes. In addition, it uses tition function may output zero or more rows. features allow SQL/MR functions to operate over a wide memory chaining mechanism where the reducer and the map- per of the same intermediate results are co-located in the range of inputs and behave more like general purpose library ure 4. The SQL/MR function invocation appears4.6 in HadoopDB the SQL functions than conventional UDFs. same process. There has been a long debate on the comparison between MapRe- • SortFROM Avoidanceclause: Certain and operators consists such of as hash the join function and hash name followed by a 18 Recently, interest in distributed parallel data processing duce framework and parallel database systems [114]. Pavlo et aggregationparenthetically require shuffling enclosed but not sorting. set of The clauses. MapRe- Theal. first, [104] and have only conducted a large scale comparison between the frameworks has increased. Examples include Google’s Map- ducestrictly API was requiredenhanced to clause,automatically is the turn offON sortingclause, for whichHadoop specifies implementation the of MapReduce framework and parallel SQL these operations, when possible, so that the mapper feeds Reduce [7], Microsoft’s Dryad [13], and the open source input to this invocation of the SQL/MR function.database management The ON systems in terms of performance and devel- data to the reducer which automatically bypasses the inter- opment complexity. The results of this comparison have shown Hadoop project [1]. These frameworks are powerful tools mediateclause sorting must step. containTenzing also a implements valid query. a block-based A table reference is also for parallel data processing because users need only to im- that parallel database systems displayed a significant performance shufflevalid, mechanism but canthat combines really many be thoughtsmall rows into of com- as syntacticadvantage sugar over MapReduce for in executing a variety of data inten- plement well-defined procedural methods. The framework presseda query blocks which that is selects treated as all one columnsrow in order fromto avoid the specifiedsive analysis table. tasks. On the other hand, the Hadoop implementation reducer side sorting and avoid some of the overhead associ- then handles the parallel execution of those methods on data When a query is used, it must be containedwas within very much paren- easier and more straightforward to set up and use ated with row serialization and deserialization in the under- in comparison to that of the parallel database systems. MapReduce distributed over a large cluster of servers. A key advantage lyingtheses MapReduce just framework as a subquery code. appearing in the FROM clause must of these systems is that developers write simple procedu- have also shown to have superior performance in minimizing the 4.5 SQL/MapReducebe parenthesized. It is important to note thatamount the of work input that is lost when a hardware failure occurs. In ral methods that are then applied in parallel using a well- schema to the SQL/MR function is specifiedaddition, implicitly MapReduce at (with its open source implementations) rep- defined data partitioning and aggregation procedure. A dis- In general, a user-defined function (UDF) is a powerful database resents a very cheap solution in comparison to the very financially feature thatquery allows usersplan-time to customize in database the form functionality. of the Fried- output schema for the advantage of these frameworks is that developers must often expensive parallel DBMS solutions (the price of an installation of man et al.query [49] introduced used in the the SQL/MapReduceON clause. (SQL/MR) UDF a parallel DBMS cluster usually consists of 7 figures of U.S. Dol- write code to accomplish tasks that could easily have been framework which is designed to facilitate parallel computation of lars)[114]. expressed in SQL or another query language. In particu-procedural3.1.1 functions across Partitioning hundreds of servers working together The HadoopDB project19 is a hybrid system that tries to combine lar, code reuse for ad hoc queries is limited as there is no as a single relational database. The framework is implemented the scalability advantages of MapReduce with the performance and as part of theTheAster next Data Systems clause17 innCluster the SQL/MR shared-nothing invocation rela- is PARTITION higher-level language than the procedural code. efficiency advantages of parallel databases [3]. The basic idea be- tional database.BY , which The framework specifies leverage a comma-separated ideas from the MapRe- listhind of expressions HadoopDB is to connect multiple single node database sys- Higher level systems for MapReduce-like infrastructuresduce programmingused to paradigm partition to provide the input users withto the a straightfor- SQL/MR function.tems (PostgreSQL) These using Hadoop as the task coordinator and net- ward API through which they can implement a UDF in the lan- have been proposed, including Pig [17], Hive [2], and SCOPE expressions may reference any attributes inwork the communication schema of layer. Queries are expressed in SQL but their guage of their choice. Moreover, it allows maximum flexibility as execution are parallelized across nodes using the MapReduce frame- [5]. Both combine the high-level declarative nature of SQL the outputthe schema query of the or UDF table is specified reference by the specified function itself by the ON clause. Sec- while also exposing the lower level procedural, parallel capa- work, however, as much of the single node query work as possi- at query plan-time.tion 3.3 This will means describe that a SQL/MR the role function of the isPARTITION - ble isBY pushedclause inside in of the corresponding node databases. Thus, bilities of a MapReduce framework. While Hive and SCOPE morphic asgreater it can process detail. arbitrary input because its behavior as HadoopDB tries to achieve fault tolerance and the ability to op- seek for SQL compatibility or at least familiarity, to inte-well as output schema are dynamically determined by information erate in heterogeneous environments by inheriting the scheduling available at query plan-time. This also increases reusability as the and job tracking implementation from Hadoop. Parallely, it tries grate with MapReduce code, these systems introduce signif-same SQL/MR3.1.2 function Sorting can be used on inputs with many different icant new syntax to normal SQL; for instance, in addition to achieve the performance of parallel databases by doing most of schemas or withThe differentORDER user-specified BY clause parameters. follows the In particular,PARTITIONthe BY queryclause processing and inside the database engine. Figure 13 illus- SQL/MR allows the user to write custom-defined functions in any to the usual SELECT , SCOPE adds PROCESS , REDUCE , and specifies a sort order for the input to the SQL/MRtrates the function. architecture of HadoopDB which consists of two layers: COMBINE . By contrast, SQL/MR introduces a small amountprogramming language and insert them into queries that leverage 1) A data storage layer or the Hadoop Distributed File System20 traditionalThe SQLORDER functionality. BY clause A SQL/MR is function only valid is defined if a inPARTITION a BY clause of new syntax and semantics by representing parallel func- (HDFS). 2) A data processing layer or the MapReduce Frame- manner thathas is similar also been to MapReduce’s used. The map andORDER reduce BY functions.clause maywork. reference In this architecture, any HDFS is a block-structured file system tions as a table. Overall, these languages represent good im- The syntax for using a SQL/MR function is depicted in Fig- attributes in the schema of the query or tablemanaged reference by a con-central NameNode. Individual files are broken into ure 12 where the SQL/MR function invocation appears in the SQL provements to MapReduce, by introducing a form of declar- tained in the ON clause and accepts a comma-separatedblocks of a fixed list size and distributed across multiple DataNodes ative query language. We have taken the complementary FROM clause and consists of the function name followed by a set in the cluster. The NameNode maintains metadata about the size of clausesof that any are enclosed expressions in parentheses. that The areON validclause in specifies a standard SQL ORDER approach of enhancing a massively-parallel, SQL-compliant and location of blocks and their replicas. The MapReduce Frame- the input toBY theclause. invocation The of the data SQL/MR within function. each It is important unique partitionwork follows specified a simple master-slave architecture. The master is a database with a MapReduce-like programming model. Thisto note thatby the the inputPARTITION schema to the SQL/MR BY clause function will is specified be sortedsingle independentlyJobTracker and the slaves or worker nodes are TaskTrackers. approach enables SQL/MR functions to leverage the struc-implicitly at query plan-time in the form of the output schema for the queryusing used in the the ON sort clause. order specified in the ORDER BY clause.18http://databasecolumn.vertica.com/database- ture of data that is inherent in relational databases via In practice, a SQL/MR function can be either a mapper (Row innovation/-a-major-step-backwards/ schemas, and enables optimization by cost-based query op- 3.1.3 Custom Argument Clauses 19http://db.cs.yale.edu/hadoopdb/hadoopdb.html timizers that leverage relational algebra and statistics for 17http://www.asterdata.com/ 20http://hadoop.apache.org/hdfs/ Following the ORDER BY clause, the user may add any num- query-rewriting. ber of custom argument clauses. The form of a custom argu- ment clause is the clause name followed by a parenthesized 3. SYNTAX AND FUNCTIONALITY list of comma-separated literal arguments. The SQL/MR In this section we present the syntax of invoking our function will receive a key-value map of these clause names SQL/MR functions from within a standard SQL query (Sec- and arguments when it is initialized. The use of custom ar- tion 3.1), the execution model provided by SQL/MR func- gument clauses allows query-time customization of SQL/MR tions (Section 3.2), and the API provided for implementing functionality and is one way in which SQL/MR enables dy- SQL/MR functions (Section 3.3). We also discuss the in- namic polymorphism. stallation of SQL/MR functions (Section 3.4) and the use of other files during SQL/MR execution (Section 3.5). 3.1.4 Usage as a Relation The result of a SQL/MR function is a relation; therefore, 3.1 Query Syntax that result may participate in a query in exactly the same The syntax for using a SQL/MR function is shown in Fig- way as any other valid table reference or subquery that can MapReduce best meets the fault tolerance and ability to operate in SQL Query mapper’s split, then all records would be processed. Fortunately, Interactive Shell Applications heterogeneous environment properties. It achieves fault tolerance MapReduce Job Hadoop’s API’s are very flexible, making it easySMS to Planner re-define how a by detecting and reassigning Map tasks of failed nodes to other tumblingWindows given input is partitioned into splits. MapReduce was par- Compiler Modules nodes in the cluster (preferably nodes with replicas of the input Map allelized by a new function called ptumblingWindowJob which directly Script Hadoop core Parser Explain data). It achieves the ability to operate in a heterogeneous environ- accesses the splitsMaster node to pair them into consecutive splits as shown by MapReduce Rewriter Expr ment via redundant task execution. Tasks that are taking a long time S0 S20 . ThisHDFS allows mappersFramework to peek into the “next” split. Variables Catalog − Loader Data to complete on slow nodes get redundantly executed on other nodes NameNode JobTracker Evaluation R I/O Desc Value that have completed their assigned tasks. The time to complete the R  R MapReduce Local … task becomes equal to the time for the fastest node to complete the  InputFormat Implementations redundantly executed task. By breaking tasks into small, granular Database Connector I/O Layer tasks, the effect of faults and “straggler” nodes can be minimized. File Systems Databases Streams   Task with   MapReduce has a flexible query interface; Map and Reduce func- InputFormat (hdfs, gpfs, local) (DBMS, HBase) (Web, Pipes) tions are just arbitrary computations written in a general-purpose Node 1 Node 2 Node n language. Therefore, it is possible for each task to do anything on Figure 4: ATaskTracker partitioned log file withTaskTracker “success” (white) andTaskTracker “ex- Figure 14: Jaql System Architecture [16]. its input, just as long as its output follows the conventions defined ception” (shaded) records. Figure 5: System architecture. by the model. In general, most MapReduce-based systems (such as Database DataNode Database DataNode Database DataNode Hadoop, which directly implements the systems-level details of the pound types that can be arbitrarily nested. In more detail, an array is top-levelan orderedptumblingWindow collection of values andfunction. can be used In to addition, model dataptumbling- struc- MapReduce paper) do not accept declarative SQL. However, there Example 7 The following script is very similar to the previous one that is shown n Example 6, but the read and tumblingWindow have windowtures suchis as sufficiently vectors, lists, abstract sets or so bags. that A it record can is be an applied unordered to any are some exceptions (such as Hive). been composed into a single function, ptumblingWindow, that has collection.collection of Using name-value features pairs like and the can ones model described structs, dictionaries here, we have Figure 13: 1: The The Architecture Architecture of HadoopDB of HadoopDB [3]. As shown in previous work, the biggest issue with MapReduce been designed to run in parallel: builtand parallelmaps. Despite enumeration, its simplicity, sliding JDM windows, is very sampling, flexible. It and allows various is performance [23]. By not requiring the user to first model and joinJaql algorithms, to operate with to name a variety a few. of different data representations for load data before processing, many of the performance enhancing ptumblingWindowlocalityThe JobTracker by matching(lines handles(’log’), a TaskTracker the runtimeisHeader scheduling to) Map of tasks MapReduce that process jobs databoth input and output, including delimited text files, JSON files, bi- nary files, Hadoop’s SequenceFiles, relational databases, key-value tools listed above that are used by database systems are not possible. -> localtransformand to maintains it. ItcleanRec load-balances information($) on by each ensuring TaskTracker’s all available load and TaskTrackers avail- -> arewriteable assigned(lines resources. tasks.(’clean’)); The TaskTrackersDatabase Connector regularlyis the update interface the between JobTracker7.stores JAQL’S or XML documents. SYSTEM Functions IMPLEMENTATION are first-class values in Jaql. Traditional business data analytical processing, that have standard They can be assigned to a variable and are high-order in that they withindependent their status database through systems heartbeat residing . on nodes in the cluster and At a high-level, the Jaql architecture depicted in Figure 5 is sim- reports and many repeated queries, is particularly, poorly suited for TaskTrackers. The Connector connects to the database, executes can be passed as parameters or used as a return value. Functions The new script remains at a fairly high level, but uses a new ilar to most database systems. Scripts are passed into the system the one-time query processing model of MapReduce. Thethe SQL InputFormat query and returns library results represents as key-value the pairs. interface The Catalog between theare the key ingredient for reusability as any Jaql expression can be tumblingWindow from the interpreter or an application, compiled by the parser and Ideally, the fault tolerance and ability to operate in heterogeneous parallelstoragecomponent version and processing of maintains metadata layers. about InputFormatimplemented the databases, implementations in their Jaql location, using parseencapsulated in a function, and a function can be parameterized in low–level primitives for split manipulation. Next, we describe the rewrite engine, and either explained or evaluated over data from the environment properties of MapReduce could be combined with the text/binaryreplica locations files (or and connect data partitioning to arbitrary properties. data sources) The Data and Loader transform powerful ways. lowthe levelcomponent data operations into key-valueis responsible needed for pairs forptumblingWindow globally that Map repartitioning tasks, but can omit data process. on the a fullgiven HadoopI/OAt layer. a high-level, Jaql modules the Jaql provide architecture organization depicted and in abstractionFigure 14 is over performance of parallel databases systems. In the following sec- reusable components, which are introspected during compilation. Jaqlprovides sourcepartition for several conciseness.key upon InputFormat loading and breaking implementations apart single nodeincluding data into one thatsimilar to most database systems. Scripts are passed into the sys- tions, we will describe our attempt to build such a hybrid system. Scriptstem from may the bind interpreter variables or an to application, values, or more compiled often by to the expressions parser Buildingallowsmultiple a Blocks: single smaller JDBC-compliant partitionsSplit manipulation or chunks. database The inSMS Jaql to planner be is accessed accom-extends by all plishedthe through HiveQL three translator low-level [116] (Section functions: 4.3) 1) andinputSplits transforms SQL, 2) into thatand serverewrite as engine, temporary and either views. explained This or section evaluated describes over data the from major tasks in one job in a given cluster. the I/O layer. The storage layer is similar to a federated database. 5. HADOOPDB readSplitMapReduce, and 3) jobsworklist that connect. The toinputSplits tables stored asfunction files in takes HDFS. components of the architecture, starting from the lowest layer. an I/O descriptorAbouzeid et as al. input [4] have (see Sectiondemonstrated 7.1) and HadoopDB returns in an action array run-of It provides an API to access data of different systems including local or distributed file systems (e.g., Hadoop’s HDFS), database In this section, we describe the design of HadoopDB. The goal of split5.2 descriptors.ning HadoopDB’s the following The readSplit two different Componentsfunction application takes types: a split 1) descriptor A semantic 7.1 I/O Layer systems (e.g., DB2, Netezza, HBase), or from streamed sources this design is to achieve all of the properties described in Section 3. as inputHadoopDBweb and application returns extends an iteratorthat provides the over Hadoop the biological split’s framework content. data analysis (see So far, of Fig. these protein 1) by pro- The storage layer is similar to a federated database. Rather than sequences. 2) A classical business data warehouse. like the Web. Unlike federated databases, however, most of the The basic idea behind HadoopDB is to connect multiple single- twoviding functions the following emulate how four a components:map task iterates over its partition requiringaccessed data data is to stored be loaded within theinto same a system-specific cluster and the I/O storage API de- format node database systems using Hadoop as the task coordinator and of the input.4.7 Jaql The missing piece is to assign splits to map tasks. basedscribes on data a pre-defined partitioning, schema, which enables the storage parallelism layer provideswith data affin-an API to network communication layer. Queries are parallelized across The 5.2.1worklistJaql21 Databaseisgives a query us such language Connector control which by is providing designed for a virtual Javascriptinput Ob- accessity during data evaluation.in-situ in other Jaql derives systems, much including of this local flexibility or distributed from nodes using the MapReduce framework; however, as much of to MapReduceTheject DatabaseNotation where (JSON) Connector the22 user, a data controls is format the the interface that number has become between of map popular tasks independent be- fileHadoop’s systems I/O (e.g., API. Hadoop’sIt reads and HDFS, writes many IBM’s common GPFS), file database formats sys- the single node query work as possible is pushed inside of the to rundatabase andcause what of systems its data simplicity to residing pass and to each modeling on nodes map flexibility. task. in the A worklist cluster JSON is and atakes simple, TaskTrack-tems(e.g., (e.g., delimited DB2, files, Netezza, JSON text, HBase), Hadoop or from Sequence streamed files). sources Custom like corresponding node databases. HadoopDB achieves fault tolerance aners. arrayyet It as extends flexible input—given way Hadoop’s to representn values, InputFormat datan thatmap ranges class tasks fromand will is flat, be part relational started of the Input-theadapters Web. are Unlike easily writtenfederated to map databases, a data set however, to or from most Jaql’s of data the ac- data to semi-structured, XML data. Jaql is primarily used to ana- cessedmodel. data The is input stored can within even simply the same be valuescluster constructed and the I/O in API the de- and the ability to operate in heterogeneous environments by whereFormat the ith Implementationsmap task, 0 library.i < n , Each is assigned MapReduce the array’s job suppliesith the input value.lyze large-scale For example, semi-structured≤worklist data.is It used is a functional, for synthetic declarative data scribesscript itself. data partitioning, The Jaql interpreter which evaluates enables parallelism the script locally with ondata the affin- inheriting the scheduling and job tracking implementation from Connector with an SQL query and connection parameters such as: generationquery where language all that which is neededrewrites for high-level a given queries map task when is appropri- a seed itycomputer during that evaluation. compiled Jaql the script, derives but much spawns of interpreters this flexibility on re- from Hadoop, yet it achieves the performance of parallel databases by mote nodes using MapReduce. The Jaql compiler automatically valuewhich andate distributionJDBC into a low-level driver parameters. to query use, consisting query fetch of Map-Reduce size and other jobs that query are tuningHadoop’s I/O API. doing much of the query processing inside of the database engine. evaluated using the Apache Hadoop project. Core features include detects parallelization opportunities in a Jaql script and translates it Puttingparameters. it Together: The ConnectorinputSplits connectsis used to the to obtain database, the phys- executes the Jaql reads and writes many common file formats (e.g., delimited user extensibility and parallelism. Jaql consists of a scripting lan- to a set of MapReduce jobs. icalSQL splits query (e.g., S and0 returnsS3). These results splits as are key-value paired into pairs. logical The splits Connectorfiles, JSON text, Hadoop Sequence files). Custom adapters are eas- 5.1 Hadoop Implementation Background couldguage theoretically and compiler,− connect as well to as any a runtime JDBC-compliant component [16]. database It is that (e.g., Sable0 = to ( S process0, S 1) , dataS10 with= ( S no1, schema S 2), . . .or )) only that with are givena partial as schema. input ily written to map a data set to or from Jaql’s data model. The input At the heart of HadoopDB is the Hadoop framework. Hadoop to worklistresidesHowever, in. the For Jaql cluster. each canalso such However, exploit split, rigid the different map schema task databases information uses readSplit require when it different is can5. even RELATED simply be values LARGE constructed SCALE in the script DATA itself. PRO- Adapaters consits of two layers: (i) a data storage layer or the Hadoop Dis- to readreadavailable, the query first optimizations. physical for both type split checking while We implemented running and improvedtumblingWindow connectors performance. forse- Jaql MySQLare parameterizedCESSING through SYSTEMS descriptor records which exploit Jaql’s tributed File System (HDFS) and (ii) a data processing layer or the quentially.anduses PostgreSQL. If a very needed, simple the In data themapper model, future will a weJDM peek plan value into tois the integrate either first an part atom, other of an the databases ar- flexibleIn this data section, model we to give represent an overview a wide of variety several of large configurations. scale data MapReduce Framework. nextincluding physicalray or a split open-source record. to findMost the common column-store end of atomic its last types databases record. are supported, While such we as includ- as- MonetDB processing systems that resemble some of the ideas of the MapRe- HDFS is a block-structured file system managed by a central sumeand thating InfoBright. a strings, log record numbers, By does extending nulls not span and dates.Hadoop’s more Arrays than InputFormat,two and physical records are splits, we com- integrate7.2duce framework Evaluation for different purposes and application scenarios. it would be straightforward to generalize to larger log records. ItThe must Jaql be noted, interpreter however, evaluates the design the architecturesscript locally and on the the imple- computer NameNode. Individual files are broken into blocks of a fixed size seamlessly21 with Hadoop’s MapReduce Framework. To the frame- While thehttp://code.google.com/p/jaql/ implementation of ptumblingWindow consists of a thatmentations compiled of thesethe script, systems but do spawns not follow interpreters the architecture on remote of the nodes and distributed across multiple DataNodes in the cluster. The work,22 the databases are data sources similar to data blocks in HDFS. handful of simplehttp://www.json.org/ functions, these functions access very low-level usingMapReduce MapReduce. framework A Jaql and script thus, they may do directly not utilize invoke and MapReduce nor are NameNode maintains metadata about the size and location of Hadoop5.2.2 API’s Catalog so it is unlikely to be understood by the casual jobs using Jaql’s mapReduceFn , but more often, developers use blocks and their replicas. user. The level of abstraction that is needed is comparable to high-level Jaql and depend on the compiler to rewrite the Jaql script The MapReduce Framework follows a simple master-slave ar- directlyThe programming catalog maintains a MapReduce metainformation job. However, about the physical databases. Thisinto one or more MapReduce jobs, as described in Section 7.3.2. chitecture. The master is a single JobTracker and the slaves or transparencyincludes the enabled following: a solution (i) connection to the problem parameters and functions such as database The mapReduceFn function is higher-order; it expects in- worker nodes are TaskTrackers. The JobTracker handles the run- allowedlocation, these driver details class to be and hidden credentials, in the implementation (ii) metadata of such the as dataput/output descriptors, a map function, and an optional reduce time scheduling of MapReduce jobs and maintains information on sets contained in the cluster, replica locations, and data partition- each TaskTracker’s load and available resources. Each job is bro- ing properties. ken down into Map tasks based on the number of data blocks that The current implementation of the HadoopDB catalog stores its require processing, and Reduce tasks. The JobTracker assigns tasks metainformation as an XML file in HDFS. This file is accessed1279 by to TaskTrackers based on locality and load balancing. It achieves the JobTracker and TaskTrackers to retrieve information necessary SCOPE is highly extensible. Users can easily define their own Finally, the output command writes the result to the file functions and implement their own versions of operators: extrac- “qcount.result”. tors (parsing and constructing rows from a file), processors (row- wise processing), reducers (group-wise processing), and combin- In either case, users do not need to implement any operators or ers (combining rows from two inputs). This flexibility greatly wonder how to efficiently execute the query on a large cluster. extends the scope of the language and allows users to solve prob- The SCOPE compiler and optimizer are responsible for translating lems that cannot be easily expressed in traditional SQL. a script into an efficient, parallel execution plan.

SCOPE provides functionality similar to views in SQL. This fea- The rest of the paper is organized as follows. We first give an ture greatly enhances modularity and code reusability. It can also brief overview of the software platform developed at Microsoft be used to restrict access to sensitive data. for storing and analyzing massive data sets in Section 2. We SCOPE supports writing a program using traditional nested SQL present the SCOPE scripting language in more detail in Section 3. expressions or as a series of simple data transformations. The In Section 4, we describe other SCOPE components and show latter style is often preferred by programmers who are used to how a SCOPE script is compiled, optimized, and executed. Expe- thinking of a computation as a series of steps. We illustrate the rimental evaluation using TPC-H queries is provided in Section 5. usage of SCOPE by the following example. We discuss related work in Section 6 and conclude in Section 7. Example 1 : A QCount query computes search query frequencies: 2. PLATFORM OVERVIEW how many times different query strings have occurred. There are several variants of QCount queries, for example, a QCount query SQL-Like MapReduce-Like SCOPE Script may return only SELECTthe top query, N most frequente queries = EXTRACT orquery it may return queries that have occurredCOUNT(*) more AS count than M times.FROM “search.log" Nevertheless, USING LogExtractor; all QCount queries involveFROM "search.log" simple aggregations1 = SELECTover query,a large COUNT(*) data as set, count FROM e GROUP BY query; followed by someUSING filtering LogExtractor conditions. GROUP BY query s2 = SELECT query, count FROM s1 WHERE count > 1000; SCOPE Compiler HAVING count > 1000 SCOPE In this example, we want to find from thes3 =search SELECT query, log countthe popularFROM s2 ORDER BY count DESC; queries that have been requested at least 1000ORDER times. BY count Expressing DESC; Optimizer OUTPUT TO " qcount.result "; SCOPE Runtime this in SCOPE is very easy. OUTPUT s3 TO “ qcount.result "; SELECT query, COUNT(*) AS count Cosmos Execution Environment FROM "search.log"Figure 15:USING Two EquivalentLogExtractor SCOPE Scripts in SQL-like Style and MapReduce-Like Style [27]. GROUP BY query Cosmos Storage System HAVING count > 1000 ORDER BY count DESC; they related to the infrastructure of the framework’s open source OUTPUT TO "qcount.result";implementations such as Hadoop. The select command is similar to SQL’s select command except 5.1 SCOPE that it uses a built-in extractor, LogExtractor, which parses each Cosmos log record and extractsSCOPE (Structured the requested Computations columns. Optimized By default, for Parallel a Execu- Files tion) is a scripting language which is targeted for large-scale data command takes analysisthe output and isof used the dailyprevious for a command variety of data as analysisits input. and data In this case, themining output applications command inside writes Microsoft the result [27]. of SCOPE the select is a declarative to FigureFigure 16: SCOPE/Cosmos 1: Cosmos Software Execution PlatformLayers [27]. the file “qcount.result”.language. It allows users to focus on the data transformations re- Microsoft has developed a distributed computing platform, called quired to solve the problem at hand and hides the complexity of The same query can also be written in SCOPE as a step-by-step Cosmos, for storing and analyzing massive data sets. Cosmos is the underlying platform and implementation details. The SCOPE The Cosmos Storage System is an append-only file system that computation. compiler and optimizer are responsible for generating an efficient designedreliably to storesrun on petabytes large clusters of data. Theconsisting system is of optimized thousands for large of com- modity servers. Disk storage is distributed with each server having e = EXTRACT executionquery plan and the runtime for executing the plan with minimal sequential I/O. All writes are append-only and concurrent writers FROM “searchoverhead..log" one orare more serialized direct-attached by the system. disks. Data is distributed and replicated for Like SQL, data is modeled as sets of rows composed of typed USING LogExtractor; High-levelfault tolerance design andobjectives compressed for the to save Cosmos storage platform and increase include: I/O columns. SCOPE is highly extensible. Users can easily define throughput. In Cosmos, an application is modeled as a dataflow s1 = SELECT theirquery, own functionsCOUNT(*) and implementas count their own versions of operators: 1. graph:Availability a directed: Cosmos acyclic graph is resilient (DAG) with to multiple vertices representing hardware fail- FROM e extractors (parsing and constructing rows from a file), processors processesures to avoid and edges whole representing system outages.data flows. File The data runtime is replicated com- (row-wise processing), reducers (group-wise processing) and com- GROUP BY query; ponentmany times of the executionthroughout engine the issystem called theand Job file Manager meta-data which is ma- biners (combining rows from two inputs). This flexibility greatly represents the central and coordinating process for all processing s2 = SELECT extendsquery, the count scope of the language and allows users to solve prob- verticesnaged by within a quorum an application. group of 2 f+1 servers so as to tolerate f FROM s1lems that cannot be easily expressed in traditional SQL. SCOPE failures.The SCOPE scripting language resembles SQL but with C# ex- WHERE countprovides > a functionality1000; which is similar to that of SQL views. This 2. pressions.Reliability Thus,: Cosmos it reduces is the architected learning curve tofor recognize users and eases transient feature enhances modularity and code reusability. It is also used to thehardware porting ofconditions existing SQL to scriptsavoid into corrupting SCOPE. Moreover, the system. SCOPE System s3 = SELECT restrictquery, access count to sensitive data. SCOPE supports writing a pro- FROM s2 expressionscomponents can are use C# check-summed libraries where custom end-to-end C# classes and can apply com- me- gram using traditional SQL expressions or as a series of simple pute functions of scalar values, or manipulate whole rowsets. A ORDER BYdata count transformations. DESC; Figure 15 illustrates two equivalent scripts chanisms to crash faulty components. The on-disk data is pe- SCOPEriodically script scrubbed consists ofto adetect sequence corrupt of commands or bit whichrot data are before data it OUTPUT s3 TOin “ theqcount.result two different styles"; (SQL-like and MapReduce-Like) to find transformation operators that take one or more rowsets as input, from the search log the popular queries that have been requested performis used by some the operation system. on the data and output a rowset. Every The script is alsoat least easy 1000 to understand. times. In the MapReduce-Like The extract command style, the ex-EXTRACT 3. rowsetScalability has a: well-defined Cosmos is schema designed to which from all the its ground rows must up ad- to be a tracts all query commandstring from extracts the log all query file. stringThe first from select the log command file. The first SE- here.scalable The system, SCOPE compilercapable parsesof storing the script, and processing checks the syntax petabytes counts the numberLECT ofcommand occurrences counts the of number each of query occurrences string. of The each query andof data. resolves Storage names. and The compute result of thecapacity compilation is easily is an increased internal by string. The second SELECT command retains only rows with a parseadding tree more which servers is then to translated the cluster. to a physical execution plan. A second select commandcount greater retains than 1000. only The rows third withSELECT a countcommand greater sorts the 4. physicalPerformance execution: Cosmos plan is a specification runs on clusters of Cosmos comprised job which of de- thou- than 1000. Therows third on selectcount. Finally, command the OUTPUT sorts thecommand rows onwrites count. the result to scribes a data flow DAG where each vertex is a program and each the file "qcount.result". edgesands represents of individual a data channel. servers. The Data translation is distributed into an execution among the Microsoft has developed a distributed computing platform, called plan is performed by traversing the parse tree in a bottom-up man- Cosmos, for storing and analyzing massive data sets. Cosmos is de- ner. For each operator, SCOPE has an associated default imple- signed to run on large clusters consisting of thousands of commod- mentation rules. Many of the traditional optimization rules from ity servers. Figure 16 shows the main components of the Cosmos database systems are clearly also applicable in this new context, platform which is described as follows: for example, removing unnecessary columns, pushing down selec- • Cosmos Storage: A distributed storage subsystem designed tion predicates and pre-aggregating when possible. However, the to reliably and efficiently store extremely large sequential highly distributed execution environment offers new opportunities files. and challenges, making it necessary to explicitly consider the ef- • Cosmos Execution Environment: An environment for deploy- fects of large-scale parallelism during optimization. For example, ing, executing and debugging distributed applications. choosing the right partition scheme and deciding when to partition • SCOPE: A high-level scripting language for writing data anal- are crucial for finding an optimal plan. It is also important to cor- ysis jobs. The SCOPE compiler and optimizer translate scripts rectly reason about partitioning, grouping and sorting properties to efficient parallel execution plans. and their interaction, to avoid unnecessary computations [134]. 5.2 Dryad/DryadLinq 5.3 Spark Dryad is a general-purpose distributed execution engine intro- The Spark system [131, 132] have been proposed to support the duced by Microsoft for coarse-grain data-parallel applications [65]. applications which need to reuse a working set of data across mul- A Dryad application combines computational vertices with com- tiple parallel operations (e.g. iterative machine learning algorithms munication channels to form a dataflow graph. Dryad runs the ap- and interactive data analytic) while retaining the scalability and plication by executing the vertices of this graph on a set of avail- fault tolerance of MapReduce. To achieve these goals, Spark intro- able computers, communicating as appropriate through files, TCP duces an abstraction called resilient distributed datasets (RDDs). pipes and shared-memory FIFOs. The Dryad system allows the An RDD is a read-only collection of objects partitioned across a developer fine control over the communication graph as well as set of machines that can be rebuilt if a partition is lost. Therefore, the subroutines that live at its vertices. A Dryad application devel- users can explicitly cache an RDD in memory across machines and oper can specify an arbitrary directed acyclic graph to describe the reuse it in multiple MapReduce-like parallel operations. RDDs do applicationŠs communication patterns and express the data trans- not need to be materialized at all times. RDDs achieve fault tol- port mechanisms (files, TCP pipes and shared-memory FIFOs) be- erance through a notion of lineage. In particular, each RDD ob- tween the computation vertices. This direct specification of the ject contains a pointer to its parent and information about how the graph gives the developer greater flexibility to easily compose ba- parent was transformed. Hence, if a partition of an RDD is lost, sic common operations, leading to a distributed analogue of piping the RDD has sufficient information about how it was derived from together traditional Unix utilities such as grep, sort and head. other RDDs to be able to rebuild just that partition. Dryad is notable for allowing graph vertices to use an arbitrary Spark is implemented in the Scala programming language27 [99]. number of inputs and outputs. The overall structure of a Dryad It is built on top of Mesos [62], a cluster operating system that job is determined by its communication flow. A job is a directed lets multiple parallel frameworks share a cluster in a fine-grained acyclic graph where each vertex is a program and edges represent manner and provides an API for applications to launch tasks on a data channels. It is a logical computation graph that is automati- cluster. It provides isolation and efficient resource sharing across cally mapped onto physical resources by the runtime. At run time frameworks running on the same cluster while giving each frame- each channel is used to transport a finite sequence of structured work freedom to implement its own programming model and fully items. A Dryad job is coordinated by a process called the job control the execution of its jobs. Mesos uses two main abstractions: manager that runs either within the cluster or on a user’s worksta- tasks and slots. A task represents a unit of work. A slot repre- tion with network access to the cluster. The job manager contains sents a computing resource in which a framework may run a task, the application-specific code to construct the job’s communication such as a core and some associated memory on a multicore ma- graph along with library code to schedule the work across the avail- chine. It employs the two-level scheduling mechanism. At the first able resources. All data is sent directly between vertices and thus level, Mesos allocates slots between frameworks using fair sharing. the job manager is only responsible for control decisions and is not At the second level, each framework is responsible for dividing its a bottleneck for any data transfers. Therefore, much of the simplic- work into tasks, selecting which tasks to run in each slot. This lets ity of the Dryad scheduler and fault-tolerance model come from the frameworks perform application-specific optimizations. For exam- assumption that vertices are deterministic. ple Spark’s scheduler tries to send each task to one of its preferred Dryad has its own high-level language called DryadLINQ [128]. locations using a technique called delay scheduling [130] It generalizes execution environments such as SQL and MapRe- To use Spark, developers need to write a driver program that im- duce in two ways: 1) Adopting an expressive data model of strongly plements the high-level control flow of their application and launches typed .NET objects. 2) Supporting general-purpose imperative and various operations in parallel. Spark provides two main abstrac- declarative operations on datasets within a traditional high-level tions for parallel programming: resilient distributed datasets and programming language. DryadLINQ23 exploits LINQ (Language parallel operations on these datasets (invoked by passing a function INtegrated Query24, a set of .NET constructs for programming with to apply on a dataset). In particular, each RDD is represented by a datasets) to provide a powerful hybrid of declarative and impera- Scala object which can be constructed in different ways: tive programming. The system is designed to provide flexible and • From a file in a shared file system (e.g HDFS). efficient distributed computation in any LINQ-enabled program- • By parallelizing a Scala collection (e.g., an array) in the driver ming language including C#, VB and F#25. Objects in DryadLINQ program which means dividing it into a number of slices that datasets can be of any .NET type, making it easy to compute with will be sent to multiple nodes. data such as image patches, vectors and matrices. In practice, a • By transforming an existing RDD. A dataset with elements DryadLINQ program is a sequential program composed of LINQ of type A can be transformed into a dataset with elements of expressions that perform arbitrary side-effect-free transformations type B using an operation called f latMap. on datasets and can be written and debugged using standard .NET • By changing the persistence of an existing RDD. A user can development tools. The DryadLINQ system automatically trans- alter the persistence of an RDD through two actions: lates the data-parallel portions of the program into a distributed ex- – The cache action leaves the dataset lazy but hints that ecution plan which is then passed to the Dryad execution platform. it should be kept in memory after the first time it is A commercial implementation of Dryad and DryadLINQ was re- computed because it will be reused. leased in 2011 under the name LINQ to HPC26. – The save action evaluates the dataset and writes it to a distributed filesystem such as HDFS. The saved version is used in future operations on it. Different parallel operations can be performed on RDDs: 23http://research.microsoft.com/en-us/projects/dryadlinq/ • The reduce operation which combines dataset elements us- 24http://msdn.microsoft.com/en- us/netframework/aa904594.aspx ing an associative function to produce a result at the driver 25http://research.microsoft.com/en- program. us/um/cambridge/projects/fsharp/ 26http://msdn.microsoft.com/en-us/library/hh378101.aspx 27http://www.scala-lang.org/ serve the partitioning and the partial order on the keys. • The Unique-Key where each key/value pair that is produced has a unique key. The key must be unique across all parallel instances. Any produced data is therefore partitioned and grouped by the key. • The Partitioned-by-Key where key/value pairs are partitioned by key. This contract has similar implications as the Super- Key contract, specifically that a partitioning by the keys is given, but there is no order inside the partitions. Figure 17 illustrate the system architecture of Nephele/PACT where a PACT program is submitted to the PACT Compiler which translates the program into a data flow execution plan which is then handed to the Nephele system for parallel execution. Hadoop dis- tributed filesystem (HDFS) is used for storing both the input and the output data. Ú·¹«®» ïæ ¿÷ λ¼«½» ¶ ¾÷ Ó¿¬½¸ Due to the declarative character of the PACT programming model, the PACT compiler can apply different optimization mechanisms Ú·¹«®» îæ ß®½¸·¬»½¬«®» ±º ¬¸» Ò»°¸»´»ñÐßÝÌ­ °®±ó and select from several execution plans with varying costs for a ̸» °®±¹®¿³³»® ½¿² ¿¬¬¿½¸ ±°¬·±²¿´ Ñ«¬°«¬ ݱ²¬®¿½¬­ ¬± ¬±¬§°»Figure 17: The Nephele/PACT System Architecture [8]. single PACT program. For example, the Match contract can be ÐßÝÌ­ ¬± ¼»²±¬» ½»®¬¿·² °®±°»®¬·»­ ±º ¬¸» «­»® ½±¼»ù­ ±«¬°«¬ satisfied using either a repartition strategy which partitions all in- ¼¿¬¿ô ©¸·½¸ ¿®» ®»´»ª¿²¬ ¬± ¬¸» °¿®¿´´»´·¦¿¬·±²ò ̸» ½±³°·´»® puts by keys or a broadcast strategy that fully replicates one input ½¿² »¨°´±·¬ ¬¸¿¬ ·²º±®³¿¬·±² ¿²¼ ¼»¼«½» ·² ­±³» ½¿­»­ô ¬¸¿¬ • The collect operation which sends all elements of the dataset to every partition of the other input. Choosing the right strategy ­«·¬¿¾´» °¿®¬·¬·±²·²¹­ ±® ±®¼»®­ »¨·­¬ ¿²¼ ®»«­» ¬¸»³ò îòî ß®½¸·¬»½¬«®»to the driver program. can dramatically reduce network traffic and execution time. There- • The foreach which passes each element through a user pro- fore, the PACT compiler applies standard SQL optimization tech- Ñ«® °®±¬±¬§°»vided function. ¸¿­ ¿ ¬¸®»»¥¬·»® ¿®½¸·¬»½¬«®» ¿­ ­¸±©² ·² niques [112] where it exploits information provided by the Output Ü¿¬¿ °®±½»­­·²¹ ¬¿­µ­ ¿®» ·³°´»³»²¬»¼ ¾§ °®±ª·¼·²¹ ½«­ó Ú·¹«®»Spark îò does ̱ not »¨»½«¬» currently ¿ support ÐßÝÌ a grouped °®±¹®¿³ reduce ·¬ operation·­ ­«¾³·¬¬»¼ as in ¬± Contracts and apply different cost-based optimization techniques. ¬¸» MapReduce.ÐßÝÌ Ý±³°·´»® The resultsò ̸» of reduce ½±³°·´»® operations ¬®¿²­´¿¬»­ are only collected ¬¸» °®±¹®¿³ at In particular, the optimizer generates a set of candidate execution ¹®¿°¸ò ײ ¬¸» º±´´±©·²¹ ©» ´·­¬ ¿² ·²·¬·¿´ ­»¬ ±º ײ°«¬ ݱ²ó the driver process. Ò»°¸»´» ­§­¬»³ plans in a bottom-up fashion (starting from the data sources) where the more expensive plans are pruned using a set of interesting prop- ¬®¿½¬­ ¿²¼ ¼»­½®·¾» ¬¸»³ ­¸±®¬´§æ º±® °¿®¿´´»´ »¨»½«¬·±²ò ײ°«¬ñ±«¬°«¬ ¼¿¬¿ ·­ ­¬±®»¼ ·² ¬¸» 5.4 Nephle/Pact erties for the operators. These properties are also used to spare Ó¿° ̸» Ó¿° ½±²¬®¿½¬ ·­ «­»¼ ¬± °®±½»­­ »¿½¸ µ»§ñª¿´«» plans from pruning that come with an additional property that may The Nephele/PACT system [13, 8] has been presented as a par- ¬¸» Ò»°¸»´» ­§­¬»³ ¿²¼ ¬¸» ÐßÝÌ ½±³°·´»® ·² ¬¸» º±´´±©·²¹ò amortize their cost overhead later. °¿·® ·²¼»°»²¼»²¬´§ò Ûª»®§ µ»§ñª¿´«» °¿·® ¾»½±³»­ ¿² allel data processor centered around a programming model of so- ·²¼»°»²¼»²¬ ­«¾­»¬ ½±²­·­¬·²¹ ­±´»´§ ±º ·¬­»´ºò called Parallelization Contracts (PACTs) and the scalable paral- îòîòïlel execution Ò»°¸»´» engine Nephele. The PACT programming model is 5.5 Boom Analytics λ¼«½» ̸» λ¼«½» ½±²¬®¿½¬ °¿®¬·¬·±²­ µ»§ñª¿´«» °¿·®­ ¾§ a generalization of map/reduce as it is based on a key/value data The BOOM Analytics (Berkeley Orders of Magnitude) [9] is an ¬¸»·® µ»§­ò Ûª»®§ ¹®±«° ¾»½±³»­ ¿² ·²¼»°»²¼»²¬ ­«¾­»¬ ̸»modelÒ»°¸»´» and the­§­¬»³ concept of ÅéÃParallelization »¨»½«¬»­ ¬¸» Contracts ½±³°·´»¼(PACTs). ÐßÝÌ A °®±ó API-compliant reimplementation of the HDFS distributed file sys- ¿­ ­¸±©² ·² Ú·¹«®» ï ¿÷ò Í·³·´¿® ¬± ³¿°ñ®»¼«½» ©» ¹®¿³­PACT ·² consists¿ °¿®¿´´»´ of exactly º¿­¸·±²ò one second-order Í·³·´¿® ¬± function ­§­¬»³­ which ´·µ» is Ü®§¿¼ called ÅëÃô tem (BOOM-FS) and the Hadoop MapReduce engine (BOOM-MR). ¿´´±© ¬¸» «­» ±º ¿ ½±³¾·²»®ò Ò»°¸»´»Input Contract½±²­·¼»®­and ·²½±³·²¹ an optional ¶±¾­Output ¬± Contract ¾» ÜßÙ­. An ©·¬¸ Input Contractª»®¬·½»­ ¾»ó The implementation of BOOM Analytics uses the Overlog logic ·²¹ ­«¾¬¿­µ­takes a first-order ¿²¼ »¼¹»­ function ®»°®»­»²¬·²¹ with task-specific ½±³³«²·½¿¬·±² user code and one ½¸¿²²»´­ or language [86] which has been originally presented as an event- Ý®±­­ ̸» Ý®±­­ ½±²¬®¿½¬ ±°»®¿¬»­ ±² ³«´¬·°´» ·²°«¬­ ¿²¼ ¾»¬©»»²more data ¬¸»­» sets ­«¾¬¿­µ­ò as input parameters. Û¿½¸ ­«¾¬¿­µThe Input Contract ·­ ¿ ­»¯«»²¬·¿´ invokes its °®±ó driven language and evolved a semantics more carefully grounded ¾«·´¼­ ¿ ¼·­¬®·¾«¬»¼ Ý¿®¬»­·¿² °®±¼«½¬ ±ª»® ·¬­ ·²°«¬ associated first-order function with independent subsets of its input in Datalog, the standard deductive query language from database ­»¬­ò Û¿½¸ »´»³»²¬ ±º ¬¸» Ý¿®¬»­·¿² °®±¼«½¬ ¾»½±³»­ ¹®¿³ôdata ©¸·½¸ in a data-parallel ®»¿¼­ ¼¿¬¿ fashion. º®±³ ·¬­ In this ·²°«¬ context, ½¸¿²²»´­ the two ¿²¼ functions ©®·¬»­ ¬± theory [119]. In general, the Datalog language is defined over re- ¿² ·²¼»°»²¼»²¬ ­«¾­»¬ò ·¬­ ±«¬°«¬of map and ½¸¿²²»´­òreduce are ̸» just ·²·¬·¿´ examples ÜßÙ of the ®»°®»­»²¬¿¬·±² Input Contracts. Other ¼±»­ ²±¬ lational tables as a purely logical query language that makes no example of Input Contracts include: changes to the stored tables. Overlog extends Datalog in three main ݱٮ±«° ̸» ݱٮ±«° ½±²¬®¿½¬ °¿®¬·¬·±²­ »¿½¸ ±º ·¬­ ³«´ó • The Cross contract which operates on multiple inputs and features [32]: ¬·°´» ·²°«¬­ ¿´±²¹ ¬¸» µ»§ò ײ¼»°»²¼»²¬ ­«¾­»¬­ ¿®» ̸»®»¾§ôbuilds ª»®¬·½»­ a distributed ¿®» ³«´¬·°´·»¼ Cartesian product ¬± ¬¸» over ¼»­·®»¼ its input ¼»¹®»» sets. ±º °¿®ó 1. It adds notation to specify the location of data. ¾«·´¬ ¾§ ½±³¾·²·²¹ »¯«¿´ µ»§­ ±º ¿´´ ·²°«¬­ò Ø»²½»ô ¿´´»´·­³ò• The ݱ²²»½¬·±²CoGroup contract °¿¬¬»®²­ partitions ¬¸¿¬ each ¿®» of¿¬¬¿½¸»¼ its multiple ¬± inputs ½¸¿²²»´­ 2. It provides some SQL-style extensions such as primary keys along the key. Independent subsets are built by combining and aggregation. ¬¸» µ»§ñª¿´«» °¿·®­ ±º ¿´´ ·²°«¬­ ©·¬¸ ¬¸» ­¿³» µ»§ ¿®» equal keys of all inputs. 3. It defines a model for processing and generating changes to ¿­­·¹²»¼ ¬± ¬¸» ­¿³» ­«¾­»¬ò Ü«®·²¹• »¨»½«¬·±²The Match contract Ò»°¸»´» operates ¬¿µ»­ on ½¿®» multiple ±º ®»­±«®½» inputs. It ­½¸»¼«´·²¹ômatches tables. ¬¿­µ ¼·­¬®·¾«¬·±²ô ½±³³«²·½¿¬·±² ¿­ ©»´´ ¿­ ­§²½¸®±²·¦¿¬·±² Ó¿¬½¸ ̸» Ó¿¬½¸ ½±²¬®¿½¬ ±°»®¿¬»­ ±² ³«´¬·°´» ·²°«¬­ò key/value pairs from all input data sets with the same key When Overlog tuples arrive at a node either through rule evalua- ·­­«»­ò Ó±®»±ª»®ô(equivalent Ò»°¸»´»ù­to the inner join º¿«´¬ó¬±´»®¿²½» operation). ³»½¸¿²·­³­ ¸»´° tion or external events, they are handled in an atomic local Datalog ׬ ³¿¬½¸»­ µ»§ñª¿´«» °¿·®­ º®±³ ¿´´ ·²°«¬ ¼¿¬¿ ­»¬­ ¬± ³·¬·¹¿¬»An Output ¬¸» Contract ·³°¿½¬ is ±º an ¸¿®¼©¿®» optional component ±«¬¿¹»­ò of a PACT and timestep. Within a timestep, each node sees only locally-stored ©·¬¸ ¬¸» ­¿³» µ»§ò ß´´ µ»§ñª¿´«» °¿·®­ ©·¬¸·² ¿² ·²¼»ó gives guarantees about the data that is generated by the assigned tuples. Communication between Datalog and the rest of the sys- °»²¼»²¬ ­«¾­»¬ ¸¿ª» ¬¸» ­¿³» µ»§ô ¾«¬ ·² ½±²¬®¿­¬ ¬± user function. The set of Output Contracts include: tem (Java code, networks, and clocks) is modeled using events cor- ݱٮ±«° »¿½¸ ­«¾­»¬ ½±²¬¿·²­ ±²´§ ±²» µ»§ñª¿´«» °¿·® • The Same-Key contract where each key/value pair that is gen- responding to insertions or deletions of tuples in Datalog tables. »¨»½«¬·±²ò Ú±® »¨¿³°´»ô ·¬ ·­ °±­­·¾´» ¬± ­»¬ ¬¸» ¼»­·®»¼ ¼»¹®»» º®±³ »¿½¸ ·²°«¬ ø½ºò Ú·¹«®» ï ¾÷÷ò Ø»²½»ô ¬¸» Ó¿¬½¸ erated by the function has the same key as the key/value BOOM Analytics uses a Java-based Overlog runtime called JOL ±º ¼¿¬¿ °¿®¿´´»´·­³ º±® »¿½¸ ­«¾¬¿­µô ¿­­·¹² °¿®¬·½«´¿® ­»¬­ ½±²¬®¿½¬ ¿­­±½·¿¬»­ µ»§ñª¿´«» °¿·®­ º®±³ ·¬­ ·²°«¬­ ´·µ» pair(s) from which it was generated. This means the func- which compiles Overlog programs into pipelined dataflow graphs ±º ­«¾¬¿­µ­tion ¬± will °¿®¬·½«´¿® preserve any ­»¬­ partitioning ±º ½±³°«¬» and order ²±¼»­ property ±® on»¨°´·½·¬´§ the of operators. In particular, JOL provides metaprogramming sup- ¿² ·²²»® ¶±·² ±² ¬¸» µ»§ô ©·¬¸±«¬ ¿½¬«¿´´§ ¶±·²·²¹ ¬¸»³ò ­°»½·º§ ¬¸»keys. ¬§°» ±º ½±³³«²·½¿¬·±² ½¸¿²²»´­ ¾»¬©»»² ­«¾¬¿­µ­ò port where each Overlog program is compiled into a representa- ɸ·´» ײ°«¬ ݱ²¬®¿½¬­ ¿®» ³¿²¼¿¬±®§ ½±³°±²»²¬­ ±º É·¬¸ ®»­°»½¬• The Super-Key ¬± ¬¸» ÐßÝÌwhere ´¿§»® each key/value ©» ´»ª»®¿¹» pair that ¬¸»­» is generated °¿®¿³»¬»®­ tion that is captured in rows of tables. In BOOM Analytics, every- ÐßÝÌ­ô Ñ«¬°«¬ ݱ²¬®¿½¬­ ¿®» ±°¬·±²¿´ò ̸»§ ¿´´±© ¼»ó ¬± ¬®¿²­´¿¬»by the ±°¬·³·¦¿¬·±² function has a superkey ­¬®¿¬»¹·»­ of the ±º key/value ¬¸» ÐßÝÌ pair(s) ½±³°·´»® from thing is data. This includes traditional persistent information like which it was generated. This means the function will pre- file system metadata, runtime state like TaskTracker status, sum- ª»´±°»®­ ¬± ¹«¿®¿²¬»» ½»®¬¿·² ¾»¸¿ª·±®­ ±º ¬¸» «­»® ½±¼» ·²¬± ­½¸»¼«´·²¹ ¸·²¬­ º±® ¬¸» »¨»½«¬·±² »²¹·²»ò ©·¬¸ ®»­°»½¬ ¬± ¬¸» °®±°»®¬·»­ ±º ¬¸» ±«¬°«¬ ¼¿¬¿ò ß­ ¿² »¨¿³°´»ô ¬¸» ¼»ª»´±°»® ½¿² ¿¬¬¿½¸ ¬¸» Í¿³»óÕ»§ Ñ«¬°«¬ ݱ²¬®¿½¬ ¬± ¿ º«²½¬·±² ¬¸¿¬ ®»¬«®²­ ¬¸» ­¿³» µ»§ ¿­ ·¬ ©¿­ ɸ·´» ²»¬©±®µ ¿²¼ ·²ó³»³±®§ ½¸¿²²»´­ ¿´´±© ¬¸» ÐßÝÌ ½±³ó ·²ª±µ»¼ ©·¬¸ò ̸» ÐßÝÌ ½±³°·´»® »¨°´±·¬­ ¬¸»­» ¹«¿®¿²¬»»­ °·´»® ¬± ½±²­¬®«½¬ ´±©ó´¿¬»²½§ »¨»½«¬·±² °·°»´·²»­ ·² ©¸·½¸

³»²¬·±²»¼ Í¿³»óÕ»§ Ñ«¬°«¬ ݱ²¬®¿½¬ô ¬¸» ½±³°·´»® ½¿² º±® ½¸¿²²»´­ ½±´´»½¬ ¬¸» »²¬·®» ±«¬°«¬ ±º ¿ ¬¿­µ ·² ¿ ¬»³°±®¿®§ »¨¿³°´» ¼»¼«½» ¬¸¿¬ ¿²§ µ»§ó°¿®¬·¬·±²·²¹ ±² ¬¸» ·²°«¬ ¼¿¬¿ ±º ¬¸» º«²½¬·±² ­¬·´´ »¨·­¬­ ±² ¬¸» ±«¬°«¬ ¼¿¬¿ò ̸» ½±³°·´»® ½¿² ¸»²½» ¿ª±·¼ «²²»½»­­¿®§ ®»°¿®¬·¬·±²·²¹ ¿²¼ ¬¸»®»º±®» ®»½±ª»® º®±³ »¨»½«¬·±² º¿·´«®»­ò »¨°»²­·ª» ¼¿¬¿ ­¸·°°·²¹ò Ú«®¬¸»® ¼»¬¿·´­ ¿¾±«¬ Ò»°¸»´» ½¿² ¾» º±«²¼ ·² ÅéÃò mary statistics like those used by the JobTracker’s scheduling pol- icy, communication messages, system events and execution state of the system. The BOOM-FS component represents the file system metadata as a collection of relations (file, fqpath, fchunk, datanode, hbchunk) where file system operations are implemented by writing queries over these tables. The file relation contains a row for each file or directory stored in BOOM-FS. The set of chunks in a file is iden- tified by the corresponding rows in the fchunk relation. The datan- ode and hbchunk relations contain the set of live DataNodes and the chunks stored by each DataNode, respectively. The NameNode updates these relations as new heartbeats arrive. If the NameN- ode does not receive a heartbeat from a DataNode within a con- figurable amount of time, it assumes that the DataNode has failed and removes the corresponding rows from these tables. Since a file system is naturally hierarchical, the file system queries that needed to traverse it are recursive. Therefore, the parent-child relationship Figure 18: The ASTERIX System Architecture [14]. of files is used to compute the transitive closure of each file and store its fully-qualified path in the fqpath relation. Because path in- formation is accessed frequently, the fqpath relation is configured Large collections of data items are stored as local partitions dis- to be cached after it is computed. Overlog will automatically up- tributed across the nodes of the cluster. A Hyracks job is submitted date fqpath when a file is changed, using standard relational view by a client and processes one or more collections of data to pro- maintenance logic [119]. BOOM-FS also defines several views to duce one or more output collections (partitions). Hyracks provides compute derived file system metadata such as the total size of each a programming model and an accompanying infrastructure to ef- file and the contents of each directory. The materialization of each ficiently divide computations on large data collections (spanning view can be changed via simple Overlog table definition statements multiple machines) into computations that work on each partition without altering the semantics of the program. In general, HDFS of the data separately. Every Hyracks cluster is managed by a Clus- uses three different communication protocols: the metadata proto- ter Controller process. The Cluster Controller accepts job execu- col which is used by clients and NameNodes to exchange file meta- tion requests from clients, plans their evaluation strategies and then data, the heartbeat protocol which is used by the DataNodes to no- schedules the jobs’ tasks to run on selected machines in the cluster. tify the NameNode about chunk locations and DataNode liveness, In addition, it is responsible for monitoring the state of the cluster and the data protocol which is used by the clients and DataNodes to keep track of the resource loads at the various worker machines. to exchange chunks. BOOM-FS re-implemented these three pro- The Cluster Controller is also responsible for re-planning and re- tocols using a set of Overlog rules. BOOM-FS also achieves the executing some or all of the tasks of a job in the event of a failure. high availability failover mechanism by using Overlog to imple- On the task execution side, each worker machine that participates in ment the hot standby NameNodes feature using Lamport’s Paxos a Hyracks cluster runs a Node Controller process. The Node Con- algorithm [80] troller accepts task execution requests from the Cluster Controller BOOM-MR re-implements the MapReduce framework by re- and also reports on its health via a heartbeat mechanism. placing Hadoop’s core scheduling logic with Overlog. The Job- In principle, Hyracks has been designed with the goal of being a Tracker tracks the ongoing status of the system and transient state runtime platform where users can create their jobs and also to serve in the form of messages sent and received by the JobTracker by as an efficient target for the compilers of higher-level programming capturing this information in four Overlog tables: job, task, taskAt- languages such as Pig, Hive or Jaql. The ASTERIX project [14, 20] tempt and taskTracker. The job relation contains a single row for uses this feature with the aim of building a scalable information each job submitted to the JobTracker. The task relation identifies management system that supports the storage, querying and analy- each task within a job. The attributes of this relation identify the sis of large collections of semi-structured nested data objects. The task type (map or reduce), the input partition (a chunk for map ASTERIX data storage and query processing are based on its own tasks, a bucket for reduce tasks) and the current running status. semistructured model called the ASTERIX Data Model (ADM). The taskAttempt relation maintains the state of each task attempt Each individual ADM data instance is typed and self-describing. (A task may be attempted more than once due to speculation or if All data instances live in datasets (the ASTERIX analogy to tables) the initial execution attempt failed). The taskTracker relation iden- and datasets can be indexed, partitioned and possibly replicated tifies each TaskTracker in the cluster with a unique name. Over- to achieve the scalability and availability goals. External datasets log rules are used to update the JobTracker’s tables by converting which reside in files that are not under ASTERIX control are also inbound messages into tuples of the four Overlog tables. Schedul- supported. An instance of the ASTERIX data model can either be ing decisions are encoded in the taskAttempt table which assigns a primitive type (e.g., integer, string, time) or a derived type, which tasks to TaskTrackers. A scheduling policy is simply a set of rules may include: that join against the taskTracker relation to find TaskTrackers with • Enum: an enumeration type, whose domain is defined by list- unassigned slots and schedules tasks by inserting tuples into taskAt- ing the sequence of possible values. tempt. This architecture allows new scheduling policies to be de- • Record: a set of fields where each field is described by its fined easily. name and type. A record can be either an open record where it contains fields that are not part of the type definition or a 5.6 Hyracks/ASTERIX closed record which cannot. Hyracks is presented as a partitioned-parallel dataflow execution • Ordered list: a sequence of values for which the order is de- platform that runs on shared-nothing clusters of computers [21]. termined by creation or insertion. • Unordered list: an unordered sequence of values which is has not attracted enough attention from the research community, similar to bags in SQL. yet. The traditional challenge of debugging large scale computa- • Union: describes a choice between a finite set of types. tions on distributed system has not been considered as a research A dataset is a target for AQL queries and updates and is also the priority by the MapReduce research community. Related with the attachment point for indexes. A collection of datasets related to an issue of the power of expressiveness of the programming model, application are grouped into a namespace called a dataverse which we feel that this is an area that requires more investigation. We also is analogous to a database in the relational world. In particular, noticed that the over simplicity of the MapReduce programming data is accessed and manipulated through the use of the ASTERIX model have raised some key challenges on dealing with complex Query Language (AQL) which is designed to cleanly match and data models (e.g., nested models, XML and hierarchical model , handle the data structuring constructs of ADM. It borrows from RDF and graphs) efficiently. This limitation has called for the need XQuery and Jaql their programmer-friendly declarative syntax that of next-generation of big data architectures and systems that can describes bulk operations such as iteration, filtering and sorting. provide the required scale and performance attributes for these do- Therefore, AQL is comparable to those languages in terms of ex- main. For example, Google has created the Dremel system [91], pressive power. The major difference with respect to XQuery is commercialized under the name of BigQuery28, to support interac- AQL’s focus on data-centric use cases at the expense of built-in tive analysis of nested data. Google has also presented the Pregel support for mixed content for document-centric use cases. In AS- system [89], open sourced by Apache Giraph and Apache Hama TERIX, there is no notion of document order or node identity for projects, that uses a BSP-based programming model for efficient data instances. Differences between AQL and Jaql stem from the and scalable processing of massive graphs on distributed clusters of usage of the languages. While ASTERIX data is stored in and man- commodity machines. Recently, Twitter has announced the release aged by the ASTERIX system, Jaql runs against data that are stored of the Storm29 system as a distributed and fault-tolerant platform externally in Hadoop files or in the local file system. Figure 18 for implementing continuous and realtime processing applications presents an overview of the ASTERIX system architecture. AQL of streamed data. We believe that more of these domain-specific requests are compiled into jobs for an ASTERIX execution layer, systems will be introduced in the future to form the new generation Hyracks. ASTERIX concerns itself with the data details of AQL of big data systems. Defining the right and most convenient pro- and ADM, turning AQL requests into Hyracks jobs while Hyracks gramming abstractions and declarative interfaces of these domain- determines and oversees the utilization of parallelism based on in- specific Big Data systems is another important research direction formation and constraints associated with the resulting jobs’ oper- that will need to be deeply investigated. ators as well as on the runtime state of the cluster. 7. REFERENCES 6. CONCLUSIONS [1] Large synoptic survey. http://www.lsst.org/. The database community has been always focusing on dealing [2] Daniel J. Abadi, Adam Marcus, Samuel Madden, and Kate with the challenges of Big Data management, although the mean- Hollenbach. SW-Store: a vertically partitioned DBMS for ing of "Big" has been evolving continuously to represent differ- Semantic Web data management. VLDB J., 18(2):385–406, ent scales over the time [22]. According to IBM, we are currently 2009. creating 2.5 quintillion bytes of data, everyday. This data comes [3] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, from many different sources and in different formats including dig- and A. Silberschatz. HadoopDB: An Architectural Hybrid ital pictures, videos, posts to social media sites, intelligent sensors, of MapReduce and DBMS Technologies for Analytical purchase transaction records and cell phone GPS signals. This is a Workloads. PVLDB, 2(1):922–933, 2009. new scale of Big Data which is attracting a huge interest from both [4] A. Abouzeid, K. Bajda-Pawlikowski, J. Huang, D. Abadi, the industrial and research communities with the aim of creating and A. Silberschatz. HadoopDB in Action: Building Real the best means to process and analyze this data in order to make World Applications. In SIGMOD, 2010. the best use of it. In the last decade, the MapReduce framework [5] F. Afrati and J. Ullman. Optimizing joins in a map-reduce has emerged as a popular mechanism to harness the power of large environment. In EDBT, pages 99–110, 2010. clusters of computers. It allows programmers to think in a data- [6] Foto N. Afrati, Anish Das Sarma, David Menestrina, centric fashion where they can focus on applying transformations Aditya G. Parameswaran, and Jeffrey D. Ullman. Fuzzy to sets of data records while the details of distributed execution Joins Using MapReduce. In ICDE, pages 498–509, 2012. and fault tolerance are transparently managed by the MapReduce [7] Foto N. Afrati and Jeffrey D. Ullman. Optimizing Multiway framework. Joins in a Map-Reduce Environment. IEEE TKDE, In this article, we presented a survey of the MapReduce family of 23(9):1282–1298, 2011. approaches for developing scalable data processing systems and so- [8] Alexander Alexandrov, Dominic Battré, Stephan Ewen, lutions. In general we noticed that although the MapReduce frame- Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik work, and its open source implementation of Hadoop, are now con- Nijkamp, and Daniel Warneke. Massively Parallel Data sidered to be sufficiently mature such that they are widely used for Analysis with PACTs on Nephele. PVLDB, developing many solutions by academia and industry in different 3(2):1625–1628, 2010. application domains. We believe that it is unlikely that MapReduce [9] Peter Alvaro, Tyson Condie, Neil Conway, Khaled will substitute database systems even for data warehousing appli- Elmeleegy, Joseph M. Hellerstein, and Russell Sears. Boom cations. We expect that they will always coexist and complement analytics: exploring data-centric, declarative programming each others in different scenarios. We are also convinced that there for the cloud. In EuroSys, pages 223–236, 2010. is still room for further optimization and advancement in different [10] M. Armbrust, A. Fox, G. Rean, A. Joseph, R. Katz, directions on the spectrum of the MapReduce framework that is A. Konwinski, L. Gunho, P. David, A. Rabkin, I. Stoica, required to bring forward the vision of providing large scale data analysis as a commodity for novice end-users. For example, en- 28https://developers.google.com/bigquery/ ergy efficiency in the MapReduce is an important problem which 29https://github.com/nathanmarz/storm/ and M. Zaharia. Above the clouds: A berkeley view of pages 302–319, 2009. cloud computing, Feb 2009. [27] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, [11] Shivnath Babu. Towards automatic optimization of S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel MapReduce programs. In SoCC, pages 137–142, 2010. processing of massive data sets. PVLDB, 1(2):1265–1276, [12] Andrey Balmin, Tim Kaldewey, and Sandeep Tata. 2008. Clydesdale: structured data processing on hadoop. In [28] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen SIGMOD Conference, pages 705–708, 2012. Adams, Robert R. Henry, Robert Bradshaw, and Nathan [13] Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Weizenbaum. FlumeJava: easy, efficient data-parallel Volker Markl, and Daniel Warneke. Nephele/PACTs: a pipelines. In PLDI, pages 363–375, 2010. programming model and execution framework for [29] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, web-scale analytical processing. In SoCC, pages 119–130, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: 2010. A distributed storage system for structured data. ACM [14] Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Trans. Comput. Syst., 26(2), 2008. Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin [30] Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Deutsch, Yannis Papakonstantinou, and Vassilis J. Tsotras. Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee ASTERIX: towards a scalable, semistructured data platform Kwon, and Michael Wong. Tenzing A SQL Implementation for evolving-world models. Distributed and Parallel On The MapReduce Framework. PVLDB, Databases, 29(3):185–216, 2011. 4(12):1318–1327, 2011. [15] Gordon Bell, Jim Gray, and Alexander S. Szalay. Petascale [31] R. Chen, X. Weng, B. He, and M. Yang. Large graph Computational Systems. IEEE Computer, 39(1):110–112, processing in the cloud. In SIGMOD, pages 1123–1126, 2006. 2010. [16] Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey [32] Tyson Condie, David Chu, Joseph M. Hellerstein, and Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Petros Maniatis. Evita raced: metacompilation for Fatma Özcan, and Eugene J. Shekita. Jaql: A Scripting declarative networks. PVLDB, 1(1):1153–1165, 2008. Language for Large Scale Semistructured Data Analysis. [33] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. PVLDB, 4(12):1272–1283, 2011. Hellerstein, Khaled Elmeleegy, and Russell Sears. [17] Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, MapReduce Online. In NSDI, pages 313–328, 2010. Umut A. Acar, and Rafael Pasquini. Incoop: MapReduce [34] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. for Incremental Computations. In SOCC, 2011. Hellerstein, John Gerth, Justin Talbot, Khaled Elmeleegy, [18] S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, and and Russell Sears. Online aggregation and continuous Y. Tian. A comparison of join algorithms for log processing query support in MapReduce. In SIGMOD Conference, in mapreduce. In SIGMOD, pages 975–986, 2010. pages 1115–1118, 2010. [19] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela [35] Robson Leonardo Ferreira Cordeiro, Caetano Traina Jr., Florescu, Jonathan Robie, and Jérôme Siméon. XQuery 1.0: Agma Juci Machado Traina, Julio López, U. Kang, and An XML Query Language, December 2010. Christos Faloutsos. Clustering very large multi-dimensional http://www.w3.org/TR/xquery. datasets with MapReduce. In KDD, pages 690–698, 2011. [20] Vinayak Borkar, Sattam Alsubaiee, Yasser Altowim, [36] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and Hotham Altwaijry, Alexander Behm, Yingyi Bu, Michael J. McPherson. Ricardo: integrating R and Hadoop. In Carey, Raman Grover, Zachary Heilbron, Young-Seok Kim, SIGMOD, pages 987–998, 2010. Chen Li, Pouria Pirzadeh, Nicola Onose, Rares Vernica, and [37] J. Dean and S. Ghemawat. MapReduce: Simplified Data Jian Wen. ASTERIX: An Open Source System for ¸SBig Processing on Large Clusters. In OSDI, pages 137–150, DataTˇ Management and Analysis. PVLDB, 5(2), 2012. 2004. [21] Vinayak R. Borkar, Michael J. Carey, Raman Grover, [38] J. Dean and S. Ghemawat. MapReduce: simplified data Nicola Onose, and Rares Vernica. Hyracks: A flexible and processing on large clusters. Commun. ACM, extensible foundation for data-intensive computing. In 51(1):107–113, 2008. ICDE, pages 1151–1162, 2011. [39] J. Dean and S. Ghemawat. MapReduce: a flexible data [22] Vinayak R. Borkar, Michael J. Carey, and Chen Li. Inside processing tool. Commun. ACM, 53(1):72–77, 2010. "Big Data management": ogres, onions, or parfaits? In [40] David J. DeWitt and Jim Gray. Parallel Database Systems: EDBT, pages 3–14, 2012. The Future of High Performance Database Systems. [23] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Commun. ACM, 35(6):85–98, 1992. Maler, and François Yergeau. Extensible Markup Language [41] J. Dittrich, J. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, (XML) 1.0 (Fifth Edition)), November 2008. and J. Schad. Hadoop++: Making a Yellow Elephant Run http://www.w3.org/TR/REC-xml/. Like a Cheetah (Without It Even Noticing). PVLDB, [24] Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: 3(1):518–529, 2010. Efficient Iterative Data Processing on Large Clusters. [42] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina PVLDB, 3(1):285–296, 2010. Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. [25] Michael J. Cafarella and Christopher Ré. Manimal: Twister: a runtime for iterative MapReduce. In HPDC, Relational Optimization for Data-Intensive Programs. In pages 810–818, 2010. WebDB, 2010. [43] Iman Elghandour and Ashraf Aboulnaga. ReStore: Reusing [26] A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on Results of MapReduce Jobs. PVLDB, 5(6):586–597, 2012. Processing Spatial Data with MapReduce. In SSDBM, [44] Iman Elghandour and Ashraf Aboulnaga. ReStore: reusing results of MapReduce jobs in pig. In SIGMOD Conference, Ion Stoica. A common substrate for cluster computing. In pages 701–704, 2012. HotCloud, USENIX Workshop, 2009. [45] Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, [63] Jiewen Huang, Daniel J. Abadi, and Kun Ren. Scalable Rainer Gemulla, Aljoscha Krettek, and John McPherson. SPARQL Querying of Large RDF Graphs. PVLDB, CoHadoop: Flexible Data Placement and Its Exploitation in 4(11):1123–1134, 2011. Hadoop. PVLDB, 4(9):575–585, 2011. [64] Mohammad Farhan Husain, James P. McGlothlin, [46] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast Mohammad M. Masud, Latifur R. Khan, and Bhavani M. clustering using MapReduce. In KDD, pages 681–689, Thuraisingham. Heuristics-Based Query Processing for 2011. Large RDF Graphs Using Cloud Computing. IEEE TKDE, [47] Leonidas Fegaras, Chengkai Li, Upa Gupta, and Jijo Philip. 23(9):1312–1327, 2011. XML Query Optimization in Map-Reduce. In WebDB, [65] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. 2011. Dryad: distributed data-parallel programs from sequential [48] Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and building blocks. In EuroSys, pages 59–72, 2007. Sandeep Tata. Column-Oriented Storage Techniques for [66] Eaman Jahani, Michael J. Cafarella, and Christopher Ré. MapReduce. PVLDB, 4(7):419–429, 2011. Automatic Optimization for MapReduce Programs. [49] E. Friedman, P. Pawlowski, and J. Cieslewicz. PVLDB, 4(6):385–396, 2011. SQL/MapReduce: A practical approach to self-describing, [67] D. Jiang, B. Chin Ooi, L. Shi, and S. Wu. The Performance polymorphic, and parallelizable user-defined functions. of MapReduce: An In-depth Study. PVLDB, 3(1):472–483, PVLDB, 2(2):1402–1413, 2009. 2010. [50] A. Gates, O. Natkovich, S. Chopra, P. Kamath, [68] David Jiang, Anthony K. H. Tung, and Gang Chen. S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and MAP-JOIN-REDUCE: Toward Scalable and Efficient Data U. Srivastava. Building a HighLevel Dataflow System on Analysis on Large Clusters. IEEE TKDE, 23(9):1299–1311, top of MapReduce: The Pig Experience. PVLDB, 2011. 2(2):1414–1425, 2009. [69] Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Jens [51] Alan Gates. Programming Pig. O’Reilly Media, 2011. Dittrich. Trojan Data Layouts: Right Shoes for a Running [52] S. Ghemawat, H. Gobioff, and S. Leung. The google file Elephant. In SoCC, 2011. system. In SOSP, pages 29–43, 2003. [70] Tim Kaldewey, Eugene J. Shekita, and Sandeep Tata. [53] Amol Ghoting, Prabhanjan Kambadur, Edwin P. D. Clydesdale: structured data processing on MapReduce. In Pednault, and Ramakrishnan Kannan. NIMBLE: a toolkit EDBT, pages 15–25, 2012. for the implementation of parallel data mining and machine [71] U. Kang, Brendan Meeder, and Christos Faloutsos. Spectral learning algorithms on mapreduce. In KDD, pages Analysis for Billion-Scale Graphs: Discoveries and 334–342, 2011. Implementation. In PAKDD, pages 13–25, 2011. [54] Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. [72] U. Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish and Christos Faloutsos. GBASE: a scalable and general Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. graph management system. In KDD, pages 1091–1099, SystemML: Declarative machine learning on MapReduce. 2011. In ICDE, pages 231–242, 2011. [73] U. Kang, Charalampos E. Tsourakakis, and Christos [55] Alon Y. Halevy. Answering queries using views: A survey. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System. VLDB J., 10(4):270–294, 2001. In ICDM, pages 229–238, 2009. [56] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit [74] U. Kang, Charalampos E. Tsourakakis, and Christos Jain, Xiaodong Zhang, and Zhiwei Xu. RCFile: A fast and Faloutsos. PEGASUS: mining peta-scale graphs. Knowl. space-efficient data placement structure in Inf. Syst., 27(2):303–325, 2011. MapReduce-based warehouse systems. In ICDE, pages [75] Shahan Khatchadourian, Mariano P. Consens, and Jérôme 1199–1208, 2011. Siméon. Having a ChuQL at XML on the Cloud. In AMW, [57] Herodotos Herodotou. Hadoop Performance Models. 2011. CoRR, abs/1106.0940, 2011. [76] HyeongSik Kim, Padmashree Ravindra, and Kemafor [58] Herodotos Herodotou and Shivnath Babu. Profiling, Anyanwu. From SPARQL to MapReduce: The Journey What-if Analysis, and Cost-based Optimization of Using a Nested TripleGroup Algebra. PVLDB, MapReduce Programs. PVLDB, 4(11):1111–1122, 2011. 4(12):1426–1429, 2011. [59] Herodotos Herodotou, Fei Dong, and Shivnath Babu. [77] Lars Kolb, Andreas Thor, and Erhard Rahm. Dedoop: MapReduce Programming and Cost-based Optimization? Efficient Deduplication with Hadoop. PVLDB, 5(12), 2012. Crossing this Chasm with Starfish. PVLDB, [78] Lars Kolb, Andreas Thor, and Erhard Rahm. Load 4(12):1446–1449, 2011. Balancing for MapReduce-based Entity Resolution. In [60] Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko ICDE, pages 618–629, 2012. Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath [79] Vibhore Kumar, Henrique Andrade, Bugra Gedik, and Babu. Starfish: A Self-tuning System for Big Data Kun-Lung Wu. DEDUCE: at the intersection of Analytics. In CIDR, pages 261–272, 2011. MapReduce and stream processing. In EDBT, pages [61] Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The 657–662, 2010. Fourth Paradigm: Data-Intensive Scientific Discovery. [80] Leslie Lamport. The Part-Time Parliament. ACM Trans. Microsoft Research, October 2009. Comput. Syst., 16(2):133–169, 1998. [62] Benjamin Hindman, Andy Konwinski, Matei Zaharia, and [81] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a method for solving graph [100] C. Olston, B. Reed, U. Srivastava, R. Kumar, and problems in MapReduce. In SPAA, pages 85–94, 2011. A. Tomkins. Pig latin: a not-so-foreign language for data [82] Harold Lim, Herodotos Herodotou, and Shivnath Babu. processing. In SIGMOD, pages 1099–1110, 2008. Stubby: A Transformation-based Optimizer for MapReduce [101] Biswanath Panda, Joshua Herbach, Sugato Basu, and Workflows. PVLDB, 5(12), 2012. Roberto J. Bayardo. PLANET: Massively Parallel Learning [83] Jimmy J. Lin. Brute force and indexed approaches to of Tree Ensembles with MapReduce. PVLDB, pairwise document similarity comparisons with 2(2):1426–1437, 2009. MapReduce. In SIGIR, pages 155–162, 2009. [102] Spiros Papadimitriou and Jimeng Sun. DisCo: Distributed [84] Yuting Lin, Divyakant Agrawal, Chun Chen, Beng Chin Co-clustering with Map-Reduce: A Case Study towards Ooi, and Sai Wu. Llama: leveraging columnar storage for Petabyte-Scale End-to-End Mining. In ICDM, pages scalable join processing in the MapReduce framework. In 512–521, 2008. SIGMOD Conference, pages 961–972, 2011. [103] David A. Patterson. Technical perspective: the data center is [85] Dionysios Logothetis and Ken Yocum. Ad-hoc data the computer. Commun. ACM, 51(1):105, 2008. processing in the cloud. PVLDB, 1(2):1472–1475, 2008. [104] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, [86] Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, S. Madden, and M. Stonebraker. A comparison of Petros Maniatis, Timothy Roscoe, and Ion Stoica. approaches to large-scale data analysis. In SIGMOD, pages Implementing declarative overlays. In SOSP, pages 75–90, 165–178, 2009. 2005. [105] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. [87] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Interpreting the data: Parallel analysis with Sawzall. Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Scientific Programming, 13(4):277–298, 2005. GraphLab: A New Framework For Parallel Machine [106] E. Prud’hommeaux and A. Seaborne. SPARQL Query Learning. In UAI, pages 340–349, 2010. Language for RDF, W3C Recommendation, January 2008. [88] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny http://www.w3.org/TR/rdf-sparql-query/. Bickson, Carlos Guestrin, and Joseph M. Hellerstein. [107] Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel, Jörg Schad, Distributed GraphLab: A Framework for Machine Learning and Jens Dittrich. RAFT at work: speeding-up mapreduce in the Cloud. PVLDB, 5(8):716–727, 2012. applications under task and node failures. In SIGMOD [89] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, Conference, pages 1225–1228, 2011. N. Leiser, and G. Czajkowski. Pregel: a system for [108] Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel, Jörg Schad, large-scale graph processing. In SIGMOD, pages 135–146, and Jens Dittrich. RAFTing MapReduce: Fast recovery on 2010. the RAFT. In ICDE, pages 589–600, 2011. [90] F. Manola and E. Miller. RDF Primer, W3C [109] Padmashree Ravindra, HyeongSik Kim, and Kemafor Recommendation, February 2004. Anyanwu. An Intermediate Algebra for Optimizing RDF http://www.w3.org/TR/REC-rdf-syntax/. Graph Pattern Matching on MapReduce. In ESWC (2), [91] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey pages 46–61, 2011. Romer, Shiva Shivakumar, Matt Tolton, and Theo [110] Sherif Sakr, Anna Liu, Daniel M. Batista, and Mohammad Vassilakis. Dremel: Interactive Analysis of Web-Scale Alomari. A Survey of Large Scale Data Management Datasets. PVLDB, 3(1):330–339, 2010. Approaches in Cloud Environments. IEEE Communications [92] Ahmed Metwally and Christos Faloutsos. V-SMART-Join: Surveys and Tutorials, 13(3):311–336, 2011. A Scalable MapReduce Framework for All-Pair Similarity [111] Alexander Schätzle, Martin Przyjaciel-Zablocki, Thomas Joins of Multisets and Vectors. PVLDB, 5(8):704–715, Hornung, and Georg Lausen. PigSPARQL: Mapping 2012. SPARQL to Pig Latin. In SWIM, pages 65–84, 2011. [93] G. Francisci Morales, A. Gionis, and M. Sozio. Social [112] Patricia G. Selinger, Morton M. Astrahan, Donald D. Content Matching in MapReduce. PVLDB, 4(7):460–469, Chamberlin, Raymond A. Lorie, and Thomas G. Price. 2011. Access Path Selection in a Relational Database [94] Kristi Morton, Magdalena Balazinska, and Dan Grossman. Management System. In SIGMOD Conference, pages ParaTimer: a progress indicator for MapReduce DAGs. In 23–34, 1979. SIGMOD Conference, pages 507–518, 2010. [113] M. Stonebraker. The case for shared nothing. IEEE [95] Kristi Morton, Abram Friesen, Magdalena Balazinska, and Database Eng. Bull., 9(1):4–9, 1986. Dan Grossman. Estimating the progress of MapReduce [114] M. Stonebraker, D. Abadi, D. DeWitt, S. Madden, pipelines. In ICDE, pages 681–684, 2010. E. Paulson, A. Pavlo, and A. Rasin. MapReduce and [96] Jaeseok Myung, Jongheum Yeon, and Sang goo Lee. parallel DBMSs: friends or foes? Commun. ACM, SPARQL basic graph pattern processing with iterative 53(1):64–71, 2010. MapReduce. In MDAC, 2010. [115] Philip Stutz, Abraham Bernstein, and William W. Cohen. [97] T. Neumann and G. Weikum. RDF-3X: a RISC-style engine Signal/Collect: Graph Algorithms for the (Semantic) Web. for RDF. PVLDB, 1(1), 2008. In International Semantic Web Conference (1), pages [98] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and 764–780, 2010. N. Koudas. MRShare: Sharing Across Multiple Queries in [116] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, MapReduce. PVLDB, 3(1):494–505, 2010. S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A [99] Martin Odersky, Lex Spoon, and Bill Venners. Warehousing Solution Over a Map-Reduce Framework. Programming in Scala: A Comprehensive Step-by-Step PVLDB, 2(2):1626–1629, 2009. Guide. Artima Inc, 2011. [117] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a ICDE, pages 1060–1071, 2010. petabyte scale data warehouse using Hadoop. In ICDE, [135] Marcin Zukowski, Peter A. Boncz, Niels Nes, and Sándor pages 996–1005, 2010. Héman. MonetDB/X100 - A DBMS In The CPU Cache. [118] Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba IEEE Data Eng. Bull., 28(2):17–22, 2005. Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham Murthy, and Hao Liu. Data warehousing and analytics APPENDIX infrastructure at facebook. In SIGMOD Conference, pages 1013–1020, 2010. A. APPLICATION OF THE MAPREDUCE [119] Jeffrey D. Ullman. Principles of Database and FRAMEWORK Knowledge-Base Systems: Volume II: The New MapReduce-based systems are increasingly being used for large- Technologies. W. H. Freeman & Co., New York, NY, USA, scale data analysis. There are several reasons for this such as [67]: 1990. • The interface of MapReduce is simple yet expressive. Al- [120] Leslie G. Valiant. A Bridging Model for Parallel though MapReduce only involves two functions map and re- Computation. Commun. ACM, 33(8):103–111, 1990. duce, a number of data analytical tasks including traditional [121] R. Vernica, M. Carey, and C. Li. Efficient parallel SQL query, data mining, machine learning and graph process- set-similarity joins using MapReduce. In SIGMOD, pages ing can be expressed with a set of MapReduce jobs. 495–506, 2010. • MapReduce is flexible. It is designed to be independent of [122] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, storage systems and is able to analyze various kinds of data, W. Tian, J. Xu, and R. Li. MapDupReducer: detecting near structured and unstructured. duplicates over massive datasets. In SIGMOD, pages • MapReduce is scalable. Installation of MapReduce can run 1119–1122, 2010. over thousands of nodes on a shared-nothing cluster while [123] Guozhang Wang, Wenlei Xie, Alan Demers, and Johannes keeping to provide fine-grain fault tolerance whereby only Gehrke. Asynchronous Large-Scale Graph Processing tasks on failed nodes need to be restarted. Made Easy. In CIDR, 2013. These main advantages have triggered several research efforts [124] Tom White. Hadoop: The Definitive Guide . O’Reilly with the aim of applying the MapReduce framework for solving Media, 2012. challenging data processing problem on large scale datasets in dif- [125] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and ferent domains. In this appendix, we provide an overview of sev- Guoren Wang. Efficient similarity joins for near-duplicate eral research efforts of developing MapReduce-based solutions for detection. ACM Trans. Database Syst., 36(3):15, 2011. data-intensive applications of different data models such as XML [126] H. Yang, A. Dasdan, R. Hsiao, and D. Parker. (Appendix A.1), RDF (Appendix A.2) and graphs (Appendix A.3). Map-reduce-merge: simplified relational data processing on Appendix A.4 provides an overview of several approaches of devel- large clusters. In SIGMOD, pages 1029–1040, 2007. oping MapReduce-based solutions in different data-intensive and [127] H. Yang and D. Parker. Traverse: Simplified Indexing on computationally expensive operations. Large Map-Reduce-Merge Clusters. In DASFAA, pages 308–322, 2009. A.1 MapReduce for Large Scale XML Pro- [128] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, cessing P. Gunda, and J. Currey. DryadLINQ: A System for XML (eXtensible Markup Language) [23] has been acknowl- General-Purpose Distributed Data-Parallel Computing edged as the defacto standard for data representation and exchange Using a High-Level Language. In OSDI, pages 1–14, 2008. over the World Wide Web. It has found practical application in [129] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and numerous domains including data interchange, streaming data and data storage. In addition, it has been considered as a standard I. Stoica. Improving MapReduce Performance in 30 31 Heterogeneous Environments. In OSDI, pages 29–42, 2008. format for many industries such as government , finance , elec- tronic business32 and science33. However, so far, the problem of [130] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, large XML processing using the MapReduce framework has not Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay been extensively considered in the research literature. Some pre- scheduling: a simple technique for achieving locality and liminary works have been presented to tackle this problem. For ex- fairness in cluster scheduling. In EuroSys, pages 265–278, ample, [47] have proposed an SQL-like query language for large- 2010. scale analysis of XML data on a MapReduce platform, called MRQL [131] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, (the Map-Reduce Query Language). The evaluation system of MRQL Ankur Dave, Justin Ma, Murphy McCauley, Michael J. leverages the relational query optimization techniques and com- Franklin, Scott Shenker, and Ion Stoica. Resilient piles MRQL queries to an algebra which is then translated to phys- Distributed Datasets: A Fault-Tolerant Abstraction for ical plans using cost-based optimizations. In particular, the query In-Memory Cluster Computing. In NSDI, 2012. plans are represented trees that are evaluated using a plan inter- [132] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, preter where each physical operator is implemented with a single Scott Shenker, and Ion Stoica. Spark: Cluster Computing MapReduce job which is parameterized by the functional parame- with Working Sets. In HotCloud, 2010. ters of the physical operator. The data fragmentation technique of [133] Yanfeng Zhang, Qinxin Gao, Lixin Gao, and Cuirong MRQL is built on top of the general Hadoop XML input format Wang. iMapReduce: A Distributed Computing Framework which is based on a single XML tag name. Hence, given a data for Iterative Computation. J. Grid Comput., 10(1):47–68, 2012. 30http://www.irs.gov/efile/ 31 [134] J. Zhou, P. Larson, and R. Chaiken. Incorporating http://www.fpml.org/ 32 partitioning and parallel plans into the SCOPE optimizer. In http://ebxml.org./ 33http://www.w3.org/TR/2010/REC-MathML3-20101021/ ±® ·º ¬¿­µ­ ¼± ²±¬ ½±³°´»¬»ò Û¿½¸ °¿·® ±º ¿­­±½·¿¬»¼ ÎÎ ¿²¼ ³¿° ¬¿­µ­ ®«²­ ·² ¿ ¼·­¬·²½¬ Ö¿ª¿ Ê·®¬«¿´ Ó¿½¸·²» øÖÊÓ÷ô ¿²¼ »¿½¸ °¿·® ±º ¿­­±½·¿¬»¼ ®»¼«½» ¿²¼ ÎÉ ¬¿­µ­ ®«²­ ·² ¿ ¼·­¬·²½¬ ÖÊÓò Ì¿­µ­ ¿®» ³±²·¬±®»¼ ¿²¼ ®»­¬¿®¬»¼ ·º ¬¸»§ º¿·´ò ß ³¿° ¬¿­µ ¬¸¿¬ º¿·´­ ¬®·¹¹»®­ ¬¸» ®»ó»¨»½«¬·±² ±º ¬¸¿¬ ¬¿­µ ø°±­­·¾´§ ±² ¿

¬¸¿¬ º¿·´­ ¬®·¹¹»®­ ¿ ®»ó»¨»½«¬·±² ±º ¬¸¿¬ ¬¿­µò

îòî ɱ®¼Ý±«²¬ ·² ݸ«ÏÔ É±®¼Ý±«²¬ ·­ ¿ ­·³°´» °®±¹®¿³ ¬¸¿¬ ´»ª»®¿¹»­ Ó¿°Î»¼«½» ¬± ½±³°«¬» ¬¸» ½¿®ó ¼·²¿´·¬§ ±º ©±®¼­ ·² ¿ ¬»¨¬ ½±´´»½¬·±² ­¬±®»¼ ±² ØÜÚÍò Í·²½» ·¬ ¬§°·½¿´´§ ­»®ª»­ ¿­ ¬¸» ÄØ»´´± ɱ®´¼ÿþ °®±¹®¿³ò ß­ ­¸±©² ±² Ú·¹«®» ïô ·¬ «­»­ ¬¸» ³¿°®»¼«½» »¨°®»­­·±² ¬¸¿¬ ·­ ¬¸» ³¿·² »¨¬»²­·±² ±º ݸ«ÏÔò

ï ³¿°®»¼«½» 1) The Predicate Split which partitions the RDF triples accord- î ·²°«¬ ing to their predicates. 2) The Predicate Object Split (POS) which í ®® º±® ü´·²» ¿¬ ü· ·² ü·² ®»¬«®² ü· ü´·²» ì ³¿° º±® ü©±®¼ ·² ü·² uses the explicit type information in the RDF triples to denote that ë ®»¬«®² ü©±®¼ a resource is an instance of a specific class while the remaining ê ®»¼«½» ü·² ü·² predicate files are partitioned according to the type of their ob- é ®© è ±«¬°«¬ ü·² jects. Using summary statistics for estimating the selectivities of ç join operations, the authors proposed an algorithm that generates a query plan whose cost is bounded by the log of the total number Ú·¹ò ïò ɱ®¼Ý±«²¬ ·² ݸ«ÏÔ of variables in the given SPARQL query. An approach for optimiz- Figure 19: The Word Count Example Program in ChuQL [75] ing RDF graph pattern matching by reinterpreting certain join tree ײ ݸ«ÏÔô ¬¸» ³¿°®»¼«½» »¨°®»­­·±² ½¿² ¾» «­»¼ ¬± ¼»­½®·¾»structures ¿ Ó¿°Î»¼«½» as grouping operations have been presented in [76, 109]. ·²°«¬ ±«¬°«¬ split of an XML document, Hadoop’s input format allows reading ¿²¼The proposed½´¿«­»­ approach represents the intermediate results as sets of ¬±the ®»­°»½¬·ª»´§ document as ®»¿¼ a stream ¿²¼ of©®·¬» string ±²¬± fragments, ØÜÚÍô so ø··÷ that®® each¿²¼ string®© ½´¿«­»­groups ¼»­½®·¾·²¹ of triples ®»ó called TripleGroups and uses Nested TripleGroup ­°»½¬·ª»´§will contain ¬¸» a single ®»½±®¼ complete ®»¿¼»® element ¿²¼ ©®·¬»®ô that has ø···÷ the requested³¿° ¿²¼ XML®»¼«½» ½´¿«­»­Algebra ¼»­½®·¾·²¹for manipulating them. [4] have demonstrated an approach ®»­°»½¬·ª»´§tag name. ChuQL ¬¸» ³¿°[75]¿²¼ is another®»¼«½» language°¸¿­»­ò that ̸±­» have ½´¿«­»­ been pro- °®±½»­­for ÈÓÔ storing ª¿´«»­ and querying ±® RDF data using the HadoopDB system in posed to support distributed XML processing using the MapRe- conjunction with a column-oriented database [2] that can provide a promising solution for supporting efficient and scalable semantic duce framework. It presents a MapReduce-based extension for the ü·² «­·²¹ ÈÏ«»®§ »¨°®»­­·±²­ò ײ »¿½¸ ±º ¬¸±­» »¨°®»­­·±²ô ¬¸» ­°»½·¿´web ª¿®·¿¾´» applications. A similar approach have been presented in [63] ·­ ¾±«²¼syntax, grammar ¬± ¬¸» ·²°«¬ and semantics ½«®®»²¬´§ of XQuery ¾»·²¹ °®±½»­­»¼ò[19], the standard ß² »¨°®»­­·±² W3C ½¿² ®»¬«®² µ»§ñó language for querying XML documents. In particular, the ChuQL where it replaced the column-oriented back-end database with the ª¿´«» °¿·®­ «­·²¹ ¬¸» ®»½±®¼ ½±²­¬®«½¬±® Û¨°® Û¨°® ò ̸» µ»§ ±® implementation takes care of distributing the computation to multi- state-of-the-art of RDF query processors, RDF-3X [97]. ª¿´«»ple XQuery ½¿² ¬¸»² engines ¾» ¿½½»­­»¼ running in «­·²¹ Hadoop ¬¸» nodes, ®»½±®¼ as described ¿½½»­­±® by »¨°®»­­·±²one Û¨°® ±® Û¨°®or more ChuQLò MapReduce expressions. Figure 19 illustrates the A.3 MapReduce for Large Scale Graph Pro- representation̸» ݸ«ÏÔ of °®±¹®¿³ the word count ±² Ú·¹«®»example ï program «­»­ ç in ´·²»­ the ChuQL ±º ½±¼» ¿­ ±°°±­»¼cessing ¬± ¬¸» ±®·¹·²¿´language ïëðó´·²» using its Ö¿ª¿ extended ª»®­·±²ò expressions ײ ¬¸» ±®·¹·²¿´ where the Ö¿ª¿mapreduce ·³°´»³»²¬¿¬·±²ex- ±ºGraphs ɱ®¼Ý±«²¬ are popular data structures which are used to model struc- ±ª»®pression Ø¿¼±±°ô is used ¬¸» to describeÎÎ ¬¿­µ a MapReduce °®±¼«½»­ µ»§ñª¿´«» job. The input °¿·®­and ©¸±­»out- µ»§­tural ¿®» relationship ¬¸» ¾§¬» between objects. Recently, they have been receiv- put clauses are respectively used to read and write onto HDFS. The ing increasing research attention as they are used in a wide vari- ety of high impact applications such as social networks, computer ½¿­»ôrr and ©»rw ¿­­«³»clauses ¬¸» are respectively ·²°«¬ ·­ ¿² used ÈÓÔ for describing ¼±½«³»²¬ the ¬¸¿¬ record ·²½´«¼»­ ¿ ­»¯«»²½» ±º reader and writer. The map and reduce clauses represent the stan- networks, telecommunication networks, recommendation systems, ®® dard map and reduce»´»³»²¬­ phases of ½±²¬¿·²·²¹ the framework ¬»¨¬ò where ̸» they pro-½´¿«­» ±²protein-protein ´·²» í ¿½½»­­»­ interaction networks and the World Wide Web. Some »¿½¸cess ´·²» XML ©·¬¸ values ·¬­ ·²¼»¨ or key/value ø·² ª¿®·¿¾´» pairs ofü· XML÷ô ¿²¼ values ®»¬«®²­ to match µ»§ñª¿´«» the °¿·®­research ©¸±­» efforts ·²¼»¨ have been proposed for providing scalable process- ·­ ¬¸»MapReduce µ»§ô ¿²¼ model ©¸±­» which ª¿´«» are specified ·­ ¬¸» ´·²» using »´»³»²¬ò XQuery expressions. ing mechanisms for massive graph datasets. For example, Surfer [31] Í·³·´¿® ¬± ¬¸» ±®·¹·²¿´ Ö¿ª¿ ·³°´»³»²¬¿¬·±²ô ¬¸» ³¿° ½´¿«­» ±²is ´·²» a large ë ¬±µ»²·¦»­ scale graph processing engine which is designed to pro- ¬¸»A.2 ´·²»­ MapReduce ±º ¬»¨¬ ¿²¼ °®±¼«½»­ for Large µ»§ñª¿´«» Scale RDF °¿·®­ Process- ©¸±­» µ»§­ ¿®»vide ¬¸» two ¬±µ»²·¦»¼ basic primitives for programmers: MapReduce and prop- ing agation. In this engine, MapReduce processes different key-value RDF (Resource Description Framework) is the standard W3C pairs in parallel, and propagation is an iterative computational pat- recommendation for expressing and exchanging semantic metadata tern that transfers information along the edges from a vertex to its about remote information sources [90]. It represent a core compo- neighbors in the graph. In principle, these two primitives are com- nent of the Semantic Web initiatives as it defines a model for de- plementary in graph processing where MapReduce is suitable for scribing relationships among resources in terms of uniquely iden- processing flat data structures (e.g. vertex-oriented tasks) while tified attributes and values. The basic building block in RDF is a propagation is optimized for edge-oriented tasks on partitioned graphs. [81] simple tuple model, (sub ject, predicate,ob ject), to express differ- presented a set of MapReduce-based algorithms for a variety of ent types of knowledge in the form of fact statements. The inter- fundamental graph problems such as minimum spanning trees, max- pretation of each statement is that subject S has property P with imal matchings, approximate weighted matchings, approximate ver- value O, where S and P are resource URIs and O is either a URI or tex and edge covers, and minimum cuts. All of the presented algo- a literal value. Thus, any object from one triple can play the role of rithms are parameterized by the amount of memory available on the a subject in another triple which amounts to chaining two labeled machines which are used to determine the number of MapReduce edges in a graph-based structure. The SPARQL query language is rounds. the official W3C standard for querying and extracting information GBASE [72] has been introduced as a scalable and general graph from RDF graphs [106]. It is based on a powerful graph matching management system. It uses graph storage method, called block facility which allows binding variables to components in the in- compression, to efficiently store homogeneous regions of graphs. put RDF graph and supports conjunctions and disjunctions of triple In particular, given the original raw graph which is stored as a patterns. big edge file, GBASE first partitions it into several homogeneous Some research efforts have been proposed for achieving scalable blocks. According to the partition results, it reshuffles the nodes so RDF processing using the MapReduce framework. PigSPARQL [111] that the nodes belonging to the same partition are put nearby after which it compresses all non-empty block through standard com- is a system that have been introduced to process SPARQL queries 34 using the MapReduce framework by translating them into Pig Latin pression such as GZip . Finally, it stores the compressed blocks programs where each Pig Latin program is executed by a series of together with some meta information into the graph databases. GBASE MapReduce jobs on a Hadoop cluster. [96] have presented a pre- supports different types of graph queries including neighborhood, liminary algorithm for SPARQL graph pattern matching by adopt- induced subgraph, egonet, K-core and cross-edges. To achieve this ing the traditional multi-way join of the RDF triples and selecting goal, it applies a grid selection strategy to minimize disk accesses a good join-key to avoid unnecessary iterations. [64] have de- and answer queries by applying a MapReduce-based algorithm that scribed a storage scheme for RDF data using HDFS where the in- put data are partitioned into multiple files using two main steps: 34http://www.gzip.org/ supports incidence matrix based queries. The key of query execu- It supports the ability to define complex entity resolution work- tion engine is that it unifies the different types of inputs as query flows that can include different matching steps and/or apply ma- vectors and unifies the different types of operations on the graph chine learning mechanisms for the automatic generation of match by a unified matrix-vector multiplication. Hence, GBASE handles classifiers. The defined workflows are then automatically trans- queries by executing appropriate block matrix-vector multiplica- lated into MapReduce jobs for parallel execution on Hadoop clus- tion modules. PEGASUS [73, 74] is another system which has been ters. The MapDupReducer [122] is another system that has been introduced as a large scale graph mining library that is implemented proposed as a MapReduce-based solution which is developed for on the top of Hadoop and supports performing typical graph min- supporting the problem of near duplicate detection over massive ing tasks such as computing the diameter of the graph, computing datasets using the PPJoin (Positional and Prefix filtering) algo- the radius of each node and finding the connected components via a rithm [125]. generalization of matrix-vector multiplication (GIM-V). [71] have An approach to efficiently perform set-similarity joins in paral- presented a MapReduce-based algorithm for discovering patterns lel using the MapReduce framework. In particular, they propose a on near-cliques and triangles on large scale graphs which is built 3-stage approach for end-to-end set-similarity joins have been have on the top of Hadoop. proposed in [121]. The approach takes as input a set of records In general, graph algorithms can be written as a series of chained and outputs a set of joined records based on a set-similarity con- MapReduce invocations that requires passing the entire state of dition. It partitions the data across nodes in order to balance the the graph from one stage to the next. However, this approach is workload and minimize the need for replication. [83] has presented ill-suited for graph processing and can lead to suboptimal perfor- three MapReduce algorithms for computing pairwise similarity on mance due to the additional communication and associated serial- document collections. The first algorithm is based on brute force, ization overhead in addition to the need of coordinating the steps of the second algorithm treats the problem as a large-scale ad hoc re- a chained MapReduce. The Pregel system [89] has been introduced trieval and the third algorithm is based on the Cartesian product of by Google as scalable platform for implementing graph algorithms. postings lists. V-SMART-Join [92] is a MapReduce-based frame- It relies on a vertex-centric approach, which is inspired by the Bulk work for discovering all pairs of similar entities which is applica- Synchronous Parallel model (BSP) [120], where programs are ex- ble to sets, multisets, and vectors. It presents a family of 2-stage pressed as a sequence of iterations, in each of which a vertex can algorithms where the first stage computes and joins the partial re- receive messages sent in the previous iteration, send messages to sults, and the second stage computes the similarity exactly for all other vertices and modify its own state as well as that of its outgo- candidate pairs. [6] have provided a theoretical analysis of various ing edges or mutate graph topology. In particular, Pregel computa- MapReduce-based similarity join algorithms in terms of various pa- tions consist of a sequence of iterations, called supersteps. During rameters including map and reduce costs, number of reducers and a superstep the framework invokes a user-defined function for each communication cost. vertex, conceptually in parallel, which specifies the behavior at a The DisCo (Distributed Co-clustering) framework [102] has been single vertex V and a single superstep S. It can read messages sent introduced as an approach for distributed data pre-processing and to V in superstep S −1, send messages to other vertices that will be co-clustering from the raw data to the end clusters using the MapRe- received at superstep S+1, and modify the state of V and its outgo- duce framework. [35] have presented an approach for finding sub- ing edges. Messages are typically sent along outgoing edges, but a space clusters in very large moderate-to-high dimensional data that message may be sent to any vertex whose identifier is known. Sim- is having typically more than 5 axes. [46] described the design ilar to the MapReduce framework, Pregel has been designed to be and the MapReduce-based implementations of the k-median and k- efficient, scalable and fault-tolerant implementation on clusters of center clustering algorithms. PLANET (Parallel Learner for Assembling thousands of commodity computers where the distribution-related Numerous Ensemble Trees) is a distributed framework for learning details are hidden behind an abstract. It keeps vertices and edges on tree models over large datasets. It defines tree learning as a se- the machine that performs computation and uses network transfers ries of distributed computations and implements each one using the only for messages. Hence, the model is well suited for distributed MapReduce model [101]. The SystemML [54] provides a frame- implementations as it doesn’t expose any mechanism for detect- work for expressing machine learning algorithms using a declara- ing order of execution within a superstep, and all communication tive higher-level language. The algorithms expressed in SystemML is from superstep S to superstep S + 1. The ideas of Pregel have are then automatically compiled and optimized into a set of MapRe- been cloned by many open source projects such as GoldenOrb35, duce jobs that can run on a cluster of machines. NIMBLE [53] Apache Hama36 and Apache Giraph37. Both of Hama and Giraph provides an infrastructure that has been specifically designed to are implemented to be launched as a typical Hadoop job that can enable the rapid implementation of parallel machine learning and leverage the Hadoop infrastructure. Other large scale graph pro- data mining algorithms. The infrastructure allows its users to com- cessing systems which have been introduced that neither follow the pose parallel machine learning algorithms using reusable (serial MapReduce model nor leverage the Hadoop infrastructure include and parallel) building blocks that can be efficiently executed using GRACE [123], GraphLab [87, 88] and Signal/Collect [115]. the MapReduce framework. Mahout38 is an Apache project which is designed with the aim of building scalable machine learning li- A.4 Other MapReduce Applications braries using the MapReduce framework. Ricardo [36] is presented Several approaches have been proposed in the literature for tack- as a scalable platform for applying sophisticated statistical methods ling different data-intensive and computationally expensive oper- over huge data repositories. It is designed to facilitate the trading ations using the MapReduce framework. For example, The De- between R (a famous statistical software packages39) and Hadoop doop system (Deduplication with Hadoop) [77, 78] has been pre- where each trading partner performs the tasks that it does best. In sented as an entity resolution framework based on MapReduce. particular, this trading is done in a way where R sends aggregation- processing queries to Hadoop while Hadoop sends aggregated data 35http://goldenorbos.org/ 36http://hama.apache.org/ 38http://mahout.apache.org/ 37http://giraph.apache.org/ 39http://www.r-project.org/ to R for advanced statistical processing or visualization. [26] presented an approach for applying the MapReduce model in the domain of spatial data management. In particular, they fo- cus on the bulk-construction of R-Trees and aerial image quality computation which involves vector and raster data. [93] have pre- sented two matching algorithms, GreedyMR and StackMR, which are geared for the MapReduce paradigm with the aim of distribut- ing content from information suppliers to information consumers on social media applications. In particular, they seek to maximize the overall relevance of the matched content from suppliers to con- sumers while regulating the overall activity.