Optimizing SQL Query Execution Over Map-Reduce

Optimizing SQL Query Execution over Map-Reduce

Thesis submitted in partial fulﬁllment of the requirements for the degree of

MS by Research in Computer Science

Bharath Vissapragada 200702012 bharat [email protected]

Center for Data Engineering International Institute of Information Technology Hyderabad - 500 032, INDIA September 2014 Copyright c Bharath Vissapragada, 2013 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certiﬁed that the work contained in this thesis, titled “Optimizing SQL Query Execution over Map- Reduce” by Bharath Vissapragada, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. Kamalakar Karlapalem To my uncle Late. Ravi Sanker Ganti Acknowledgments

Firstly I would like to thank my dad, mom, sister and grand parents for believing in me and letting me pursue my interests. Also I would have never completed my thesis without the support of my advisers Kamal sir and Satya sir. They were always open for discussions and I am really lucky to have their support. I would like to thank my closest pals Abilash, Chaitanya, Phani, Ravali, Ronanki and Vignesh for their constant support, especially when I was let down by something. I really miss my uncle Late. Ravi Sanker Ganti, who was responsible for what a Iam today. Thanks to the Almighty for blessing me with good luck and mental peace.

v Abstract

Query optimization in relational database systems is a topic that has been studied at depth, in both stand alone and distributed scenarios. Modern day optimizers became complex with focus on increasing quality of optimization and reduction in query execution times. They employ a wide range of heuristics and considers a set of possible plans called plan space to ﬁnd the most optimal plan to be executed. However, with the advent of big data, more and more organizations are moving towards map-reduce based processing systems for managing their large databases, since it outperforms all the traditional techniques for processing very huge amounts of data and still runs on commodity hardware thus reducing the maintenance cost. In this thesis, we describe the design and implementation of a query optimizer tailor made for ef- ﬁcient execution of SQL queries over map-reduce framework. We rely on the traditional relational database query optimization principles and extend them to address this problem. Our major contributions can be summarized as follows

1. We proposed a statistics based approach for optimizing SQL queries on top of map-reduce

2. We designed cost formulae to predict the run time of joins before executing the query

3. We extend the traditional plan space to consider the bushy plan space which leverage the massively parallel architecture of map-reduce systems. We designed three algorithms to explore this plan space and use our cost formulae to select the plan with the least execution cost

4. We developed a task scheduler based on max flow min cut algorithm for map-reduce shuffle that minimizes the overall network IO during joins. This algorithm uses the statistics collected from data and formulates the task assignment problem as a max-flow min-cut problem in a flow graph on nodes and solving it we get the overall minimal IO in shuffle phase

5. We show the performance enhancements from the above features using TPCH workload of scales 100 and 300 on both TPCH benchmark queries and synthetic queries

Our experiments show run time enhancements of up to 2x improvement in the query execution time and up to 33% reduction in the shufﬂe network IO during map-reduce jobs.

vi Contents

Chapter Page

1 Introduction and Background ...... 1 1.1 Map-ReduceandHadoop...... 2 1.1.1 Hadoop Distributed File System ...... 3 1.1.2 HDFSArchitecture...... 3 1.1.3 Replication factor and replica placement ...... 4 1.1.4 Data reads, writes and deletes ...... 5 1.1.5 Map-Reduce ...... 5 1.2 Hive...... 7 1.2.1 HiveAnatomy ...... 7 1.2.2 JoinsinHive ...... 7 1.3 Query Optimization in Databases ...... 9 1.4 Problem statement and scope ...... 11 1.5 Contributions ...... 11 1.6 Organization of thesis ...... 12

2 Related Work ...... 13 2.1 RelatedWork ...... 13

3 Query Plan Space ...... 15 3.1 Overview of Query Planspace - An Example ...... 15 3.2 ExploringBushyPlanspace...... 18 3.2.1 Finding minimum cost n-fully balanced tree - FindMin ...... 20 3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin ...... 21 3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin . . . 21 3.3 Choosing the value of n for an n-balancedtree ...... 23

4 Cost Based Optimization Problem ...... 26 4.1 Distributed Statistics Store ...... 27 4.2 CostFormulae...... 28 4.2.1 Joinmapphase...... 28 4.2.2 Joinshufﬂephase...... 29 4.2.3 JoinReducerphase...... 30 4.3 Scheduling - An example ...... 30 4.4 Scheduling strategy ...... 31 4.5 Shufﬂe algorithm - Proof of minimality ...... 33

vii viii CONTENTS

4.6 Shufﬂe algorithm - A working example ...... 34

5 Experimental Evaluation and Results ...... 36 5.1 Experimental setup ...... 36 5.2 Plan space evaluation and Complexity ...... 36 5.2.1 Algorithm FindMin ...... 36 5.2.2 Algorithm FindRecMin ...... 37 5.2.3 Algorithm FindRecDeepMin ...... 37 5.3 Algorithms performance ...... 38 5.4 CostFormulaeaccuracy...... 43 5.5 Efﬁciency of scheduling algorithm ...... 44

6 Conclusions and Future Work ...... 53 6.1 Conclusions and contributions ...... 53 6.2 Futurework...... 54

Appendix A: Query execution plans for TPCH queries ...... 56 A.1 q2...... 56 A.1.1 Postgres...... 56 A.1.2 Hive...... 57 A.1.3 FindRecDeepMin...... 58 A.2 q3...... 59 A.2.1 Postgres...... 59 A.2.2 Hive...... 60 A.2.3 FindRecDeepMin...... 61 A.3 q10 ...... 62 A.3.1 Postgres...... 62 A.3.2 Hive...... 63 A.3.3 FindRecDeepMin...... 64 A.4 q11 ...... 65 A.4.1 Postgres...... 65 A.4.2 Hive...... 65 A.4.3 FindRecDeepMin...... 66 A.5 q16 ...... 67 A.5.1 Postgres...... 67 A.5.2 Hive...... 68 A.5.3 FindRecDeepMin...... 69 A.6 q18 ...... 70 A.6.1 Postgres...... 70 A.6.2 Hive...... 70 A.6.3 FindRecDeepMin...... 72

Bibliography ...... 74 List of Figures

Figure Page

1.1 Hivequery-Example...... 2 1.2 Pig script - Example ...... 2 1.3 HDFSarchitecture ...... 4 1.4 MapReduceArchitecture...... 6 1.5 HiveArchitecture...... 8 1.6 Block diagram of a query optimizer ...... 10

3.1 Possible join orderings ...... 15 3.2 Time line for the execution of left deep QEP ...... 17 3.3 Time line for the execution of bushy Query plan ...... 18 3.4 An example of 4-fullybalanced tree ...... 19 3.5 An example of 4-balanced tree ...... 19 3.6 8-balancedtree ...... 20 3.7 4-balancedtree ...... 20 3.8 Running Algorithm1 on 9 tables ...... 21 3.9 Running Algorithm2 on 9 tables ...... 23

4.1 Statistics store architecture ...... 27 4.2 Query scheduling example ...... 31 4.3 Flownetwork ...... 33

5.1 Queryexecutionplanforq2 ...... 40 5.2 Queryexecutionplanforq3 ...... 41 5.3 Queryexecutionplanforq10...... 42 5.4 Queryexecutionplanforq11...... 43 5.5 Queryexecutionplanforq16...... 44 5.6 Queryexecutionplanforq18...... 45 5.13 Shufﬂe algorithm on a TPCH scale 100 dataset - Best case performance ...... 47 5.14 Shufﬂe algorithm on a TPCH scale 100 dataset - Worst case performance ...... 48 5.7 Queryexecutionplanforq20...... 49 5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these queries have less number of join tables ...... 50 5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries...... 50 5.10 Map phase cost formulae evaluation ...... 51

ix x LIST OF FIGURES

5.11 Reduce and shufﬂe cost formulae evaluation ...... 51 5.12 Comparison of default vs optimal shufﬂe IO ...... 52 List of Tables

Table Page

4.1 Notations for cost formulae - Map phase ...... 29 4.2 Notations for cost formulae - Shufﬂe phase ...... 30 4.3 Key to Node assignment with optimized scheduler ...... 35 4.4 Key to Node assignment with default scheduler ...... 35

5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14 38 5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a 10 node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees ...... 38 5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these queries have less number of join tables ...... 39 5.4 Summary of query plans for TPCH Dataset ...... 39 5.5 Shuffle data size comparison of default and optimized algorithm tested on a TPCH scale 100dataset ...... 46 5.6 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance ...... 47 5.7 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance ...... 48

xi Chapter 1

Introduction and Background

In the internet age, data is wealth. Most organizations rely on their data warehouses for analytics, based on which the management takes strategic decisions that are important to organization’s growth. After a decade of internet bubble, the amount of data each of these organizations possess scaled up to tens of petabytes and this cannot be handled by centralized servers. Owing to these needs, the data warehouse infrastructure has rapidly changed over the past few years from high-end servers holding vast amounts of data to a set of commodity hardware machines holding data in a distributed fashion. The map-reduce programming paradigm [15, 1] from Google has facilitated this transformation by providing highly scalable and fault tolerant distributed systems for production level quality application software.

In the internet age, big-data has become a buzz word. Every company ranging from small size startups to internet giants like Google and Facebook is managing data of unbelievable scale [5]. The data sources are mainly web crawlers, user forms in websites, data uploaded in the social networking sites, webserver logs etc. For example, Facebook, largest photosharing site today holds about 10 billion photos whose size increases at the rate of 2-3TB per day [6]. This holds true even in the other areas of science. Large Hedron Collider (LHC) produces data of enormous sizes from its nuclear reactors and this data is stored in a grid consisting of 200,000 processing cores and 150PB of disk space in a distributed network [4].Yahoo has built a hadoop cluster of 4000 nodes [11] with 16PB of storage to perform their daily data crunching tasks and these numbers clearly show the power of map-reduce programming. These huge data sizes and distributed systems create a whole new set of challenges for managing it and performing huge computations like data mining, machine learning tasks etc. Most of the ﬁrms spend a lot of money in managing and extracting important information present in this data. So, performing these computations of large scale efﬁciently is very important and even slightest improvement in these processing techniques will save a lot of money and time.

Processing huge datasets is a complex task as it cannot be done on a single machine and the distributed systems implementation always pose a variety of problems in terms of synchronization, fault tolerance and reliability. Fortunately, with the introduction of Google’s Map-Reduce programming

1 paradigm [16, 15], the above process has become fairly simple as the user just needs to think that he is programming for a single machine and the framework takes care of distributing it across the cluster and provides the additional features of fault tolerance and reliability. Hadoop [1] is an opensource implementation of Google’s map-reduce programming that has been widely accepted in the academia and the industry [3] over the past few years for its ability to process large amounts of data using commodity hardware while hiding the inner details of parallelism from the end users. Hadoop is widely used in production for building search indexes, crunching web server logs, recommender systems and a variety of tasks that require huge processing capabilities on data of large scale. A number of packages have been developed on top of Hadoop infrastructure to provide a SQL or a similar interface so that users can perform analytics on the data using the traditional SQL queries and retrieve results. Such efforts include Pig Latin, Hive [42, 33]. All the packages rely on basic principles for converting a SQL-like query to map-reduce jobs but there are some minor differences in the way they work. For example, Hive takes SQL like inputs from the user and converts the query into a directed acyclic graph (DAG) of map-reduce jobs whereas pig is a scripting language and takes a script as input and user has to supply the entire execution plan like a program but the goal is to convert the user’s tasks into a set of map-reduce jobs. An example showing difference in joining 3 tables in Hive and Pig is shown in ﬁgures 1.1 and 1.2.

select * from A join B on A.a = B.b join C on B.b = C.c;

Figure 1.1 Hive query - Example

temp = join A by a, B by b; result = join temp by B::b, C by c; dump result

Figure 1.2 Pig script - Example

1.1 Map-Reduce and Hadoop

In this section we describe in detail about map-reduce in terms of its open source release Hadoop and the ﬁle system it relies on, Hadoop Distributed File System(HDFS).

HDFS [38] is very much similar to Google File System, the base for the Map-Reduce paradigm described in the Google paper.Map-Reduce is a parallel programming paradigm that relies on a basic principle “Moving computation is cheaper than moving data”.

2 1.1.1 Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a highly fault tolerant distributed ﬁle system designed to provide high throughput access for applications that have large data sets and still runs on commodity and cheap hardware. It is designed to handle hardware failures well by replicating data on various machines in a distributed fashion. HDFS does not follow the POSIX requirements and relaxes a few of them to enable streaming access to ﬁle system data. HDFS has been built on the following goals [7] 1.3.

• Hardware Failure Since HDFS runs on a large set of commodity machines, there is a great probability that a subset of them fails at any moment. HDFS has been designed to overcome this issue by intelligently placing replicas of data and maintains the replica count in case of data loss by copying it to other machines in the cluster.

• Streaming Data Access Since HDFS has been designed for applications requiring high throughput, a few of the unnecessary POSIX requirements have been relaxed to increase efﬁciency.

• Large amounts of Data HDFS has been tailor made for holding very huge amounts of data ranging from tens of terabytes to a few petabytes. It also has the ability to scale linearly with the amount of data just by adding new machines on the ﬂy and still provide fault tolerance and fast access. This makes it a de-facto choice for big data needs.

• Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A map-reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

• Moving Computation is Cheaper than Moving Data Since the data we are dealing with is huge, it is clever to take the code to the location where data resides instead of moving data across machines and HDFS provides methods to do this efﬁciently by letting the code know the locations of data blocks and giving support for moving code executables.

• Portability Across Heterogeneous Hardware and Software Platforms Since HDFS supports a very diverse set of platforms and has been written in java, users can include wide variety of platforms in their cluster as long as they support basic generic requirements

1.1.2 HDFS Architecture

HDFS has a master/slave architecture. A HDFS cluster consists of a master node responsible for maintaining filesystem namespace and a set of worker nodes called data nodes where the actual data is stored. The namenode stores the filesystem namespace and exposes the data to the end user as a file and

3 directory hierarchy and provides him with a set of basic utilities to add, modify and delete data. It also exposes a java api for all these features and bindings with other languages are also in widespread use. All the instructions are passed via namenode to the datanodes which execute them in an orderly fashion. Also all the datanodes report to the master about the block information and send a constant heartbeat to the master to notify its health. In case the namenode does not get piggy backs from any of the datanodes for sometime,they are declared dead and the blocks that miss the minimum replication factor are repli- cated to other nodes. All this information is about block mapping is stored in the namenode local ﬁle system as “fs-image”. Also all the changes to this are tracked by recording the changes to a ﬁle called editLog.

Figure 1.3 HDFS architecture

1.1.3 Replication factor and replica placement

A replication factor is set by the user per ﬁle , which is the minimum number of replicas for each block of data and it ensures fault tolerance. Higher the replication factor greater is the fault tolerance of the cluster.The location of the replicas for each block is decided by the namenode and is placed intelligently so as to maximize the fault tolerance. In case of replication factor three, ﬁrst replica is written

4 to the local node ensuring maximum speed, second to a node in a different rack and third is written to a node in the same rack. Second replica is useful when the whole rack goes down because of a switch failure. Also greater replication can ensure faster and parallel reads, since namenode has the option of choosing the replica closest to the client.

1.1.4 Data reads, writes and deletes

• Data is written to HDFS in a pipelined fashion. Suppose a client is writing data to a file, it is first accumulated in a local file and when it becomes a full block of user data, it contacts the namenode for a list of datanodes to hold the replicas for this block. This block of data is then flushed to the first DataNode in streams of 4KB. The first dawned writes this data locally and forwards it to the next datanode. This process is carried on till the last node in the list writes the whole block and it notifies the namenode about this and returns the block map. Thus the whole process is pipelined and parallel

• When namenode receives a read request, it tries to fulﬁll it by selecting a replica closest to the client to save bandwidth and reduce the response time. Same rack replica is preferred most of the times if such a replica exists. This whole information about network topology is fed to the system via rack awareness scripts

• During file deletes, the data is just marked as deleted but is not deleted from datanodes immedi- ately but is moved to the /trash folder. The data can be restored as long as it is in the /trash folder (this time is configurable). Once it crosses the configured time limit, data is deleted and all the blocks are freed

1.1.5 Map-Reduce

A map-reduce program consists of three main phases map, shuffle and reduce. User needs to specify the map and reduce functionalities via an API and submit the executable to the processing framework. Map-reduce job takes a file or a set of files stored in HDFS as input and the map function is applied to each block of input (In reality, map function is applied to a FileSplit which can span across multiple blocks, for simplicity we use FileSplit and block interchangeably). Each instance of map function takes (key,value) pairs as input and emits a new set of (key,value) pairs as output and the framework shuffles all the pairs with the same key to a single machine. This is called the shuffle phase. User can control what all keys can go to the same machine via a partitioner function that can be plugged into the executable. Now all the (key,value) pairs that are shuffled to the same machine are sorted and a reducer function is applied to the whole group and the output is emitted by reducers as new set of (key,value) pairs which is written to HDFS. The intermediate data of each map phase is sorted, merged in multiple rounds and is

5 written to the disk local to the map execution. Following equations outline the map and reduce phases and the whole job ﬂow is summarized in the ﬁgure 1.4 [8].

map(k1,v1)− > list(k2,v2) (1.1)

reduce(k2,list(v2))− > list(v2) (1.2)

Figure 1.4 Map Reduce Architecture

Following are the salient features of the map-reduce framework.

• Map-reduce programming framework makes a programmer think that he is writing the code for a single machine and the framework takes care of distributing the logic and scaling it to thousands of machines

• Users can write the logic for map and reduce function and control various components of the framework like ﬁle splitting, secondary sorting, partitioner and combiner via pluggable components

• Map-Reduce is highly fault tolerant in the sense that failed tasks (map or reduce or subset of them) cannot stop the whole job. Only tasks that failed can be restarted and the job can be resumed. This saves a lot of time with jobs on large amounts of data. This framework can be made to work with both namenode and datanode failures

• One more notable feature of map-reduce framework is task localization. The task scheduler always tries tries to reduce the network IO by assigning map tasks as close to the input splits as possible. Setting the split size and HDFS block size to the same value makes is still easier and gives 100% task localization

6 • The scale at which the map-reduce programs work is enormous and has been shown to scale to tens of thousands of nodes. This brings a very high degree of parallelism while data processing resulting in faster throughputs

1.2 Hive

Hive [2] is a data warehouse infrastructure built on top of Hadoop. It provides the tools to perform ofﬂine batch processing ETL tasks on structured data of peta-byte scale. It provides an SQL like query interface to the user called Hive-QL through which user can query the data. Since Hive is built on top of Hadoop, user can expect a latency of few minutes even on small amount data as Hadoop takes time to submit the jobs, schedule them across the cluster and initialize their JVMs. This restricts Hive from using it as an OLTP engine and also doesnt answer real-time queries and individual row level updates as in an normal relational database. The functionality of Hive can be extended by using user deﬁned functions (UDFs).This notion of UDFs is not new to the relational databases and is in practice since ages. In hive we can plug in our own custom mappers and reducer scripts to perform operations on the results of the original query. These functionalities of Hive along with its ability to scale to tens of thousands of nodes, makes it a very useful ETL engine.

1.2.1 Hive Anatomy

Hive stores the tables in the warehouse as ﬂat ﬁles on HDFS and they can be partitioned based on the value of a particular column. Each partition can be further bucketed based on other columns (other than the one used for partitioning). A query executed by a user is parsed and is converted into an abstract syntax tree (AST) where each node corresponds to a logical operator which is then mapped to a physical operator. Figure 1.5 depicts hive architecture.

1.2.2 Joins in Hive

As with most of the relational database systems, executing a join in Hive is more costlier compared to other operators in terms of query execution time and also resource consumption. This problem has more signiﬁcance in Hive owing to the fact that the data is sharded and distributed across a network and performing a join requires matching tuples to be moved from one machine to another and this results in a lot of network IO overhead. Hive implements joins over the map-reduce framework and the following three types of joins are supported.

• Common Join : Common join is the default join mechanism in Hive and is implemented as a single map-reduce job. It can be thought of as a distributed union of cartesian products. Suppose

we are joining a table A on column ’a’ with table B on column ’b’ and di is the distinct set of join column values of both columns a and b. Then the commonjoin operator can be described

7 Figure 1.5 Hive Architecture

using the following mathematical equation where Aa=di is the set of all rows of A with column a’s value as di

(Aa=di ⊲⊳ Bb=di ) i [=0 Data of both the tables is read in mappers and then rows are shufﬂed across the network in such a way that all the rows with same join key reach the same system. To identify the table to which they belong, they are tagged during the map phase and are differentiated in the reduce phase according to these tags. A cartesian product is now applied on the rows of both tables with same join column value and the output is written to the disk for further processing.

• Map Join : Map join is an optimization over common join and is used if one of the tables is very small and can ﬁt in the main memory of the slaves. In map join the smaller table is copied into the distributed cache before the map-reduce job and the large table is fed into the mappers. The large table is streamed row by row and the join is done with the rows of the smaller table and the results are written to the disk. Map join eliminates the need for shufﬂe and reduce phases of the map-reduce job and this makes it very fast compared to other join types

8 • Bucket Join : Bucket Map join is a special case of Map join in which both the tables to be joined are bucketed(storing all the rows for a column value at a single place). Larger table is given as input to the mappers for each value of join column, corresponding buckets for the smaller table are fetched and the join is performed. This is an improvement over map join in the sense that, instead of copying whole data of smaller table, we copy just the required buckets to the mappers of larger table.

All the join conditions in the parse tree are converted into operators corresponding to one of Com- mon, Map, Bucket joins. User can provide hints about the sizes of the tables as a part of the query and this information is used during query processing time.

1.3 Query Optimization in Databases

In the recent times, SQL has become the de-facto standard for writing analytical queries over data. The process of query optimization takes an SQL query, builds the basic query plan, applies some transformations and gives a query execution plan (QEP) as output. The transformations applied on the query plan are dependent on the logic of the query optimizer. In general, the query optimizer algorithm ﬁrst determines the logical query plans which are generic and then decides the physical operators to be executed. Overall the procedure of query optimization can be broken down into following steps and is shown in ﬁgure 1.6 [10].

• Generating the search space for the query

• Building a cost model for query execution

• Divising a search strategy to enumerate the plans in the search space

Search Space : There exist many ways of executing the same query and query optimizers consider only a subset of plans to ﬁnd the best plan for execution by assigning some cost to each possible plan. Since this problem has been proved to be NP hard [22], some heuristics are applied to reduce the search space and prune out non-optimal plans. This whole set of plans that each query optimizer considers to pick out the best plan is called the search space for that query optimizer. A lot of research exists on the search space for query optimizers starting from left deep trees [14, 25, 18] to bushy trees [27]. Each plan space has its own merits and drawbacks but ﬁnding the most optimal plan has been proved to be np-hard and non-feasible for the optimizers. Cost Model : Since we consider a search space and select the best plan, we need some function to quantify the cost of executing a plan in terms of known parameters of the cluster. We minimize or maximize some objective function based on the costs given out by our cost model. This cost model relies on (i) the statistics on the relations and indexes, (ii) the formulae to estimate selectivity of various

9 Figure 1.6 Block diagram of a query optimizer predicates and the sizes of the output of each operator in the query plan, and (iii) the formulae to estimate the CPU and IO cost of every operator in the query plan. The statistics on the tables includes various metadata about the actual data like the number of rows, the number of disk pages of the relations, indexes and the number of distinct values of a column. Query optimizers use these statistics to estimate the selectivity factor for a given query or predicate, which means the number of rows that actually qualify this predicate. The most well known way of doing this is by using histograms [35, 31, 23]. Using statistics query optimizers predict the overall cost of executing a query and that includes mainly CPU, IO and network costs. Estimation of these costs of a query operator is also non-trivial since it needs to take into consideration various properties about system and lots of internal implementation at the system level like data ﬂow from cache to buffer disk etc. [14]. Other factors like the number of queries concurrently running and the available buffer space also affect the cost values. Many detailed IO cost models have been developed to estimate various IO operations like seek, latency and data transfer [21, 20]. Search Strategy : Various approaches have been studied in theory to search the given space of plans. The dynamic programming algorithm proposed in [37] is an exhaustive search strategy that enumerates the query plans in a bottom-up manner and prunes expensive plans based on the cost formulae. Work has been done on heuristic and random optimizations to walk through the search space [40, 39] and also much work was done on comparing various search strategies like left-deep, right deep and bushy

10 trees [27, 40, 41, 24] . Also some query optimizers employ a dynamic query optimization technique where the decisions about the type of query operator to be used and their physical algorithms are taken at run-time. Thus, the optimizer designs a decision tree which is evaluated at run time. This type of plan enumeration is more suited for top-down architectures like Volcano [19, 17].

1.4 Problem statement and scope

In this thesis, we design and implement a cost based query-optimizer for join queries over map- reduce, that modiﬁes the naive query plans given by these translators based on statistics collected about the data. Our query optimizer calculates the query execution cost for various possible plans of a given query based on the statistics and our cost formulae and then decides on the plan with the least cost to execute on a cluster of machines. It uses a linear combination of communication, IO and CPU costs involved in the query execution to compute the total cost. Our query optimizer considers a whole new plan space of plans that suit the highly parallelizable framework of map-reduce on huge datasets. It then follows an optimized approach of assigning tasks to the slave machines which minimizes the total network shufﬂe of data and also distributes data processing evenly among the slaves to increase the total query throughput.

1.5 Contributions

Contributions of the thesis can be summed up as follows.

• Proposed a statistics based approach for optimizing SQL queries on top of map-reduce. We bor- rowed this approach from existing query optimization techniques in relational database systems and extended it to work for the current usecase.

• Extended the traditional plan space to include bushy trees which leverage the massively parallel architecture of map-reduce systems. We explore a subset of bushy plan space that provides parallelism and expolits the massively parallel mapreduce framework.

• Designed cost formulae to predict the runtime of joins before executing the query. We use these cost formulae to ﬁnd the optimial query plan from the above plan space.

• Designed and implemented a new task scheduler for map-reduce minimizes the overall network IO during joins. This scheduler is based on our statistics store and converts the problem of assigning tasks to a maxﬂow-mincut formulation. Our experiments showed up to 33% reduction in shufﬂe data size.

• Our expermients show that this query optimizer works upto 2 times faster on join queries on TPCH dataset of scales 100 and 300 on both TPCH benchmark queries and synthetic queries. We

11 ran various SQL queries with select,project and join predicates and tested all our approaches by building a distributed statistics store on the dataset.

1.6 Organization of thesis

Rest of the thesis is organized as follows. Chapter 2 discusses about related work and chapter 3 discusses in depth the query plan space we deal with and its advantages. Chapter 4 describes the statistics engine and the cost formulae required to evaluate the joins in map-reduce scenario and also our scheduling algorithm for map-reduce. Then we discuss the results of our research in chapter 5 and conclude the thesis by giving our observations and possible directions of future work in chapter 6.

12 Chapter 2

Related Work

2.1 Related Work

A lot of research has gone into the ﬁeld of the query optimization in databases, both stand alone and distributed. Many techniques to estimate the cost of query plans using statistics have been proposed. The most popular query optimizer System R from IBM has been extended to System R* [30] to work for distributed databases.We extended the ideas from these systems to work for join optimization over map-reduce based systems. Many heuristics have been developed to reduce the plan space for joins. Not much work has gone into the ﬁeld of optimizing SQL queries over map-reduce using traditional database techniques. However work has been done in improving the runtime of map-reduce based SQL systems by reducing the number of map-reduce jobs per query and removing the redundant IOs during scans in [28] and [43].

In [28] Lee et. al developed a correlation aware translator from SQL to map-reduce which considers various correlations among the query operators to build a graph of map-reduce operators instead of hav- ing a one to one-operation-to-one job translation. This reduces the overall number of map-reduce jobs that are required to run for a query and also minimize scans of the datasets by considering correlations. In [43] Vernica et. al worked on improving set-similarity joins on the map-reduce framework. The main idea is to balance the workload across the nodes in the cluster by efﬁciently partitioning the data.

Afrati et. al worked on optimizing chain and star joins on map-reduce environment and optimizing shares between map and reduce tasks [13]. They also consider special cases of chain and star joins to test their theory of optimal allocation of shares and they noticed their approach is useful in cases of star schema join where a single large fact table is joined with a large number of small dimension tables. Work was done on theta-joins over map-reduce [32] using randomized algorithms. [12] works on a special case of fuzzy joins over map-reduce and quantifying its cost. Also Google has implemented a framework called Tenzing [29] which has some inbuilt optimizations for improving the runtime of queries over map-reduce environment by using techniques such as sort avoidance, block shufﬂe and

13 local execution and others. These techniques result in efﬁcient execution of joins during runtime of the query depending of various parameters like size of relations, volume of shufﬂe data and depending on these the operators are scheduled to execute on the nodes.

14 Chapter 3

Query Plan Space

3.1 Overview of Query Planspace - An Example

Lets consider a simple query with two joins and three relations A,B,C,D as follows.

select * from A,B,C,D where A.a = B.b and B.b = C.c and C.c = D.d

Figure 3.1 lists two possible join orderings for the query.

Figure 3.1 Possible join orderings

The one on the left is (((A join B) join C)join D), a left deep plan considered by Hive optimizer for execution whereas the plan on the right ((A join B) join (C join D))is bushy tree. Lets assume for this analysis purposes that the size of the intermediate relations is larger and Map

15 join is not a possible operator for the plan execution. So the ﬁgure 3.1 is also the operator tree where CJ represents a Common Join operator. Left deep plan is by default serial in nature and should be executed level after level where as the bushy trees inherently parallel and the left and right children of the root node can be executed in parallel.

Suppose the above query is executed using Hive on a hadoop cluster. Hive chooses the left deep plan to be executed and it is broken down into 3 joins as follows A common join B -> result:temp1 temp1 common join C -> result:temp2 temp2 common join D -> result:temp3 (final output)

The entire query execution is serial and there will be three map reduce jobs corresponding to three common join operators. So the relations A and B join and the result is stored in a table temp1 in the ﬁrst map-reduce job and rest of the map-reduce jobs should wait till the entire result of temp1 is written to HDFS. Given this scenario two possible cases can occur.

• Case 1 : The map-reduce job takes up all the slots of mappers and reducers in the cluster or

• Case 2 : The job takes up a subset of map and reduce slots in the cluster

In both of the cases there is an underutilization of cluster resources because of the fact that mapper slots remain idle till the job completes shuffle and reduce phases and this is because of the restriction of the left deep plan that the entire procedure is serial. However, in case 2 the underutilization is more pro- nounced because of the fact that many slots remain idle right from the beginning of the job. The timeline of this job looks as in the following figure . Considering the example query above, the map-reduce job corresponding to the join (temp1 join B)cannot start even though there are free mapper slots in the cluster. This definitely increases the query runtime and also most of the machines remain idle until the whole job is completed. The query execution timeline is depicted in the figure 3.2. It is clear that none of the phases overlap and everything is perfectly sequential. From t = a to t = c all the map slots in the cluster and CPU remains idle.

Suppose the bushy tree execution plan is considered, the execution plan can be broken down into 3 joins as follows A common join B -> result:temp1 C common join D -> result:temp2 temp1 common join temp2 -> result:temp3 (final output)

Even in this execution plan, the two cases discussed above are valid. Suppose that the two map- reduce jobs corresponding to (A join B)and (A join B)are launched in parallel and the map-

16 Figure 3.2 Time line for the execution of left deep QEP

reduce job for (A join B)is executed first in the default FIFO scheduler for hadoop. In case 1, the tasks corresponding to (A join B)will be waiting in the queue since there are no slots available for them to get launched. As soon as the mappers corresponding to the job (A join B)complete new tasks belonging to the job (A join B)get launched. This increases the cluster resouce utilization and also increases intra query parallelization. Case 2 is even simpler and is more faster because both the map-reduce jobs run in parallel due to abundance of resources and improves the performance. The query execution timeline for this QEP is depicted in the figure 3.3. We can clearly see the overlap in task assignment . The figure represents the worstcase task assignment where all the job needs to wait for all the reducers, however in reality the situation is much better because some reducers complete quickly and give way to others. Given the benefits of bushy tree parallelization, we might be tempted to say that they are always more efficient than left deep tree. Though this is partially true, the plan space of bushy trees is very huge and exploring the whole of it takes a lot of time for the optimizer, sometimes much more than query runtime. This is the reason most of the modern day optimizers in do not consider bushy trees in

17 Figure 3.3 Time line for the execution of bushy Query plan their search space and still rely on left or right deep trees. In the rest of the chapter, we describe three different algorithms to build bushy trees for joins in map-reduce and explore their planspace in detail.

3.2 Exploring Bushy Planspace

In this section we present 3 novel approaches for building bushy trees benefitting join queries over mapreduce framework. We explore a subset of bushy planspace called n-balanced trees. We define an n-balanced tree as follows. n-fullybalanced tree 1. An n-fully balanced tree is a perfectly balanced binary tree with n leaf nodes. This forces n to be a power of 2. An example of 4-fullybalanced tree is shown in figure 3.4 n-balanced tree 1. An n-balanced tree is obtained by replacing the left most leaf node of a n-fullybalanced tree with another n-fullybalanced tree and repeating this procedure with the newly obtained tree. How-

18 Figure 3.4 An example of 4-fullybalanced tree

ever the condition that the ﬁrst tree is n-fullybalanced is relaxed in case of query plans depending on the number joins. In the resulting n-balanced tree all the internal nodes are join operators and all the leaf nodes are relations.

An example of 4-balanced tree is shown in the ﬁgure 3.5. The number of levels in the tree is decided by the number of joins in the query.

Figure 3.5 An example of 4-balanced tree

The rationale behind selecting n-balanced trees is that, at any level during execution there can be n/2 parallel mapreduce jobs (unlike 1 for left deep trees) that can be executed in the cluster. Choosing the value of n carefully will give a very good resource utilization in the cluster and also increases the query performance. Also too much of parallelization will be an overkill for the query since the waiting queue will be very long and for some non FIFO scheduler like fair scheduler cannot meet the requirements and there will be too may task context switches. For example figures 3.6 and 3.7 shows two possible execution plans for a query with 8 tables. Figure 3.6 is 8-balanced tree and figure 3.7 is 4-balanced tree and we ran both the query plans on the cluster with fair scheduler configured and the runtimes for them are 923 seconds and 613 seconds respectively. Since we are running on a 10 node cluster with large input data, it cannot run 4 mapreduce jobs at a single time. So the jobs wait in the queue until they have sufficient slots to complete the job. In the 4-balanced tree, since only 2

19 Figure 3.6 8-balanced tree Figure 3.7 4-balanced tree

jobs run at a single time, the waiting time will be less and the job is completed quickly. Also carefully choosing the value of n will reduce the plan space of bushy trees and this solution has the beneﬁts of both parallelization and higher query performance as we see in the later sections. First we start off by explaining algorithm FindMin that builds a n-fullybalanced bushy tree to be executed and FindRecMin & FindRecDeepMin explore the n-balanced bushy plan space.

3.2.1 Finding minimum cost n-fully balanced tree - FindMin

In this section, we describe an algorithm FindMin to build an n-fullybalanced bushy tree. We build it level by level ﬁnding minimum at each level. This algorithm takes as input a query tree Q that corresponds to the parsed SQL statement given by the user and the value n and gives an n-fully balanced join operator tree as output. This algorithm is described below.

Algorithm 1: FindMin to build an n-fullybalanced bushy tree Input : A query tree Q corresponding to the parsed SQL statement Output: An operator tree to be executed 1 begin 2 J ←− getJoinTables(Q) 3 s ←− sizeOf(J) 4 while s 6= 1 do 5 x ←− getPowerOf2(n) //fetches the power of 2 less than n 6 P ←− selectMinPairs(x) //selects x/2 pairs of joins with least cost 7 for y ∈ P do 8 Remove tables in y from J 9 Add y to J 10 end 11 s ←− sizeOf(J) //update the size of J 12 end 13 return makeOperatorTree(J) 14 end

20 The heart of the algorithm lies in 5 where we use the function selectMinPairs(x) to select the top x/2 pairs from the table list based on the cost functions we deﬁne in the next chapter. We then remove the individual tables in those minimum pairs and add the pair as a whole. This is the greediness in the algorithm as this local minimum cost maynot give as a global minimum cost in the whole query tree, even then we proceed assuming it gives the global minimum. An example run of this algorithm on a a query with 8 joins(9 tables) to build 4-balanced trees is shown in the ﬁgure 3.8

Figure 3.8 Running Algorithm1 on 9 tables

At each step, the bracketed tables are the minimum cost x/2 join pairs. Once they are bracketed, the whole bracket is considered as a single table for the subsequent step and each bracket converts to a join in the query tree.

3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin

The algorithm we describe in this section generates n-balanced trees. The algorithm takes the query plan Q that corresponds to the parsed SQL query and a value n as input and gives out the join operator tree as output. This algorithm considers all possible combinations of size n at each level, once it finds finds a minimum combination it finalizes it and reflects it in the final result eventhough this local minimum combination of size n may not produce a global minimum. An example run of this algorithm on 8 joins(9 times) is shown in the figure 3.9. At each stage of bracketing, all possible n-combinations are considered and the the combination with least execution cost is shown in the figure. All the combinations are not shown in the figure due to space constraints.

3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin

We now describe an approach to generate n-balanced trees that is similar to Algorithm FindRecMin, however instead of ﬁnalizing on local minimum of size n, we recursively go up the order to see if this

21 Algorithm 2: FindRecMin algorithm for building n-balanced trees Input : A query tree Q corresponding to the parsed SQL statement and n for generating n-balanced trees Output: An operator tree to be executed 1 begin 2 J ←− getJoinTables(Q) 3 breakF lag ←− True 4 while breakF lag do 5 s ←− sizeOf(J) 6 comb ←− n 7 if s

22 Figure 3.9 Running Algorithm2 on 9 tables n combination gives a global minimum. The algorithm can be described as follows 3. An example run of this exhaustive algorithm is similar to the ﬁgure 2 except that for every bracketing another recursive call is applied if it gives a global minimum. So we might get a set of bracketings different from those in the ﬁgure but the whole shape of the tree remains the same. We analyze the runtime, search space and performance of each of these algorithms in detail in the results chapter .

3.3 Choosing the value of n for an n-balanced tree

Choosing the value of n is not as trivial as it seems. Setting the wrong value may not produce desirable results as one of the following two cases may occur.

• A very small value of n might leave the cluster underutilized as there will be empty task slots in the cluster but there are no more jobsthat can be run in parallel

• A very large value of n might keep many jobs in the cluster in the waiting queue and also bring down the cluster efﬁcieny by too much multitasking on nodes These can be avoided by using few simple techniques. One way is to try a few values and ﬁxing that value which gives the most optimal results. This takes experimentation and may take some time before

23 Algorithm 3: FindRecDeepMin algorithm for building n-balanced trees Input : A query tree Q corresponding to the parsed SQL statement and n for generating n-balanced trees Output: An operator tree to be executed 1 begin 2 J ←− getJoinTables(Q) 3 breakF lag ←− True 4 while breakF lag do 5 s ←− sizeOf(J) 6 comb ←− n 7 if s

24 we ﬁnd a suitable value. Another way is to make an estimate is by calculating the maximum capacity of the cluster as follows.

1. Get the total number of map slots in a cluster(M). This can be done by summing up the map slots on each machine of the cluster

2. Get the average input for a single map reduce job (I). This is twice the size of the average table size (since two tables are involved per single join).

3. Get the average number of map tasks per mapreduce job(A) by divding the average input size(I) by block size(B)

4. Now a good estimate of n is a power of 2 closest to the value of M/A

Another procedure is to improve this estimate of n based on past history of queries and come out with an average number of jobs that can run at any time. Also the number of reduce tasks were not involved in the computation since the map tasks are the ones that deﬁne the size of the input and the number of reducers is generally constant per task.

25 Chapter 4

Cost Based Optimization Problem

All the algorithms described in the previous chapter assume that we have a cost estimator for the join operators for joins over map-reduce framework. In this chapter we describe in detail

1. Estimating the cost of the operators in terms of disk IO operations, CPU usage and network movement of data and

2. Optimizing the data movement across machines to reduce the network IO

To give accurate cost estimates majority of the traditional relational database systems rely on histograms of data. Histograms give an overview of the data by partitioning the data range into a set of bins. The number of bins and the partitioning method determines the accuracy of histograms and this has been studied at depth. Traditionally two types of histograms have been in common use [36]

• Equi-width histograms : Equi-width histograms or frequency histograms are the most common histograms used in databases where the range values are divided on n buckets equal size. Values are bucketed based on the range they fall in and the number of rows per bucket depend on the data distribution.

• Equi-depth histograms : In equi-depth histograms the number of rows per bucket is ﬁxed and the size of buckets is based on data distribution.

Each of the techniques has its own set of advantages and disadvantages. We borrow the idea of using histograms for accurate cost estimation from relational databases and extend it to build our own distributed statistics store and cost estimator functions based on that. So the rest of the chapter is organized as follows. First we describe our distributed statistics store and the methods they expose to the query optimizer and then describe the cost formulae for each of the join operators based on these methods.

26 4.1 Distributed Statistics Store

We designed and implemented a statistics store distributed across the machines in the hadoop cluster. Each node maintains a equi-depth histogram for data local to that machine. These histograms are used for cardinality estimations local to that site. Methods have been written to compute these histograms from data during map-reduce jobs. Since the data is likely to be updated, we update the histograms too but not by reading all the data again. We managed to plugin our code in such way that when the modified data is read as a part of other map-reduce jobs the histograms also get updated. Though this might slightly increase the runtime for that query, we need not put the extra load on the cluster by reading the whole data again and our experiments show that this slight increase in runtime is not that significant. Also initial statistics can be computed while loading the data into the cluster or by separately running a job that reads the whole data and updates all the local histograms. Further we consolidate the distributed statistics to maintain the global per table statistics that we use in our cost model. These histograms are serialized to disk as tables in mysql and APIs have been written on top it to fetch the cardinality esitmates by the query optimizer.The query optimizer uses the APIs written to get the cardinality estimates to calculate the cost of executing a query. Its architecture is summarized in figure 4.1

Figure 4.1 Statistics store architecture

27 4.2 Cost Formulae

In this section we describe the cost formulae we use for estimating the cost of executing the join operators on top of map-reduce framework. The actual procedure inside map-reduce framework is quite complex in its implementation does multiple scans of disk while shuffling and sorting. For example with hadoop framework, there are many settings to be done by the user and a small tweak would greatly effect the performance of a mapreduce job. One such example is io.sort.mb. It is the amount of buffer memory used by the map task jvm to sort the incoming streams and spill them to disk. Tuning this parameter has been proven to greatly improve the job performance based on the size of map output data. There are many such knobs that can be tuned on a per job basis that can greatly affect the job execution time. These can set inside the job configuration classes. However for the problem of query optimization, it is sufficient if our cost model predicts the runtime proportional the actual execution time. Our cost model divides the whole join mapreduce job into three phases map, shuffle and reduce and evaluates cost of each of the phases separately and add them up for the whole cost. We now describe each of these phases in detail and cost formulae we use to predict its runtime.

4.2.1 Join map phase

In the map phase, each HDFS block is read, parsed with an appropriate RecordReader implementation and the select predicates are applied to prune the unnecessary rows. They are divided into partitions based on the join column value and are spilled to disk whenever the memory gets ﬁlled. So, once all the data is read, there might be multiple spills on disk and they are merged in multiple passes. This whole process is complex and there are multiple round trips of data from disk to memory and memory to disk. However we identiﬁed the parts of this phase that incur most cost (in terms of runtime) and included them in our cost analysis. They are as follows.

1. Reading the whole data block from HDFS through network IO. Even though the data is local, HDFS uses sockets to read the data. If the data is not local, it is read from a remote machine, however hdfs tries to maintain data locality most of the time by reading local replicas. This

speeds up the whole process. So, cost for this step is BLOCK SIZE/Rhdfs

2. Once the data is read, selectivity ﬁlters are applied and the rest of the data is written back to the disk so that it can be merged later. Cost of spilling this to disk is (BLOCK SIZE ∗

selectivity factor)/Wlocal write. We calculate the selectivity factor based on the selection predicates in the input query

3. Now all the written data is read back to do an inmemory merge in multiple rounds. Cost of doing

this is (BLOCK SIZE ∗ selectivity factor)/Rlocal read

4. All the merged data is written back again to the local disk so that it can read in reduce part. We

include this cost as (BLOCK SIZE ∗ selectivity factor)/Wlocal write

28 BLOCK SIZE Conﬁgured block size in HDFS Rhdfs Read throughput from HDFS selectivity factor Selectivity factor the select predicates in the query for that block Rlocal read Read throughput from local machine where the map task runs Wlocal write Write throughput in local machine where the map task runs

Table 4.1 Notations for cost formulae - Map phase

Summing up all the costs, total cost of map phase is written as follows.

BLOCK SIZE/Rhdfs +(BLOCK SIZE ∗ selectivity factor)/Wlocal write+

(BLOCK SIZE ∗ selectivity factor)/Rlocal read+

(BLOCK SIZE ∗ selectivity factor)/Wlocal write

The values of Rhdfs, Rlocal read and Rwrite read are computed before hand by running a few experiments on the cluster and BLOCK SIZE is the user mentioned block size. These are computed only once provided the cluster is not changed by adding or removing nodes. The value of selectivity factor is calculated from the distributed statistics using the selection predicates from the query. Above formula assumes that all the spills are merged in a single go and this is valid for join map reduce jobs since not much extra data (apart from block content along with selectivity factor) is written to the disk and current day memory sizes are so big that the setting io.sort.mb (Hadoop setting responsible for heap available for this merging process) is conﬁgured appropriately to speedup this process. Since this cost function is tailored for join job map-phase, necessary modiﬁcations should be made before extending to other map-reduce jobs. Now multiple such maps run on each node in parallel in batches. Since the total map time of the job is limited by the node with the last and the slowest map task, we take the total map time to the the time taken to complete the last maptask on the slowest node.

4.2.2 Join shufﬂe phase

Reducer starts copying data after a conﬁgured number of map tasks are complete. We quantify the total time taken to complete the shufﬂe phase for a reducer running on machne k that has been assigned a partition i as follows and the notations are in table 4.2. The value of Rjk is calculated before hand by performing a series of experiments that transfer data from each node to every other node. SPij can be calculated easily from histograms and a given partition assignment.

∀j∈nodes SPij/NRjk j X

29 SPij Size of partition i on machine j NRjk Network read throughput of data on machine j from machine k

Table 4.2 Notations for cost formulae - Shufﬂe phase

4.2.3 Join Reducer phase

In the reducer phase, all the shufﬂed data is read and the actual join of tables is performed. All this data is now written to HDFS so that the subsequent join task reads it. Since the HDFS replicates each block to multiple nodes entire process is limited by HDFS write throughput. So we estimate the total cost of reduce phase to be the sum of time taken to read the whole partition data into reducer’s memory

(SPi) (after writing it from the shufﬂe) and the time taken to write the result (resultj) of join back to HDFS from reducer j. So the equation can be written as follows

SPi/Wwrite local + SPi/Rread local + resultj/Wwrite hdfs

Wwrite local and Wwrite hdfs are write throughputs of local machine and hdfs respectively. Now we

estimate the value of resultj using the global histograms as follows.

∀keys k∈Partition i (sel factor(k jointable1) ∗ (size(join table1))∗ k X (sel factor(k jointable2) ∗ size(join table2)) For map joins, the cost of scheduling and shufﬂe is automatically set to 0 and mapper cost includes writing the result and excludes the cost of writing intermediate data to local disks.

4.3 Scheduling - An example

Consider the execution plan (on a two machine cluster) of CommonJoin operator in MapReduce using the tables A and B in Figure 3 joined according to the following query

SELECT * FROM A JOIN B ON (A.a = B.b)

Both tables A and B are read into mappers and triples in the format are emitted out. So in the given example triples , , ,,,are emitted out. The tag 0 or 1 is used to classify it to table A or B in the reducers so that join can be done. Now all the triples with same join column value are moved to the same system in the shufﬂe phase. These are now classiﬁed according to their tags and a cartesian product is done on them and the result is written to the disk. Join operator is considered costly because it involves network movement of data from one node to other and this incurs a huge latency. In the above example there are two following ways of scheduling join keys in reduce phase.

30 Figure 4.2 Query scheduling example

1. Join value x on machine 1 and y on machine 2

2. Join value x on machine 2 and y on machine 1

In case 1 total network cost (in terms of rows) is 2 ( (3,y) from 1 to 2 and (4,x) from 2 to 1) where as in case 2 total network cost in case 2 is 4. We can observe that the total network cost is double in case 2 is twice as that of case 1. Considering the size of data Hadoop ecosystem manages, these ﬁgures look very huge for tables of terabytes scale and such communication costs might heavily impact the query performance. So the problem here ﬁnally boils down to partitioning m reduce keys into n machines so as to minimize the network movement of data.

4.4 Scheduling strategy

In this section we show a novel approach for scheduling a CommonJoin operator by modeling as a max-ﬂow min-cut problem. Consider a Hive query which joins two tables A and B on columns a and b respectively. The rest of the description assumes that there are m distinct values of both A.a and B.b combined and n machines in the cluster.

We deﬁne a variable Xij as follows.

1 if reducer key i is assigned to machine j; Xij = ( 0 Otherwise. Since a key can only be assigned to a single machine ,

∀i Xij = 1 (4.1) j X Also, we put a limit on the number of keys a reducer can process. This depends on the processing capability of the machine and let it be lj for node j.

∀j Xij ≤ lj (4.2) i X

31 The nxn matrix C is obtained by calcuting the average time taken to transfer a unit data from every machine to every other machine by a simple experiment. Suppose the key i is assigned to the reducer

machine k and Pij is the size of key i on machine j, then the total cost of data transfer from all machines to machine k because of key i along with the runtime estimate of the reducer, Wik can be written as

Wik =( Pij ∗ Cjk) j X The total shufﬂing cost can now be written as

Ctotal = Wik ∗ Xik (4.3) i k X X The above cost function can be generalized according the scheduling requirement. For example, to schedule tasks in heterogenous environment, we can add additonal parameters that signify the cost of running a query on machine k. Since we are optimizing communication costs in this case, only network latencies are considered. So this model can be extended to any general scheduling problem on mapreduce framework. We now model this problem as a ﬂow network by following the steps below [26].

1. Create two nodes source (S) and sink(T )

2. n nodes S1 to Sn are created one for each machine

3. m nodes K1 to Km are created one for each key

4. n edges each from source S to each of S1 to Sn are created with capacities l1 to ln and costs 0

5. Now every pair of Si (1 ≤ i ≤ n) and Kj (1 ≤ j ≤ m) is connected by an edge with cost Wji

and capacity 1. Xij is the ﬂow of edge connecting Si and Kj

6. m edges from every node Kj (1 ≤ j ≤ m) to target T with capacity 1 and cost 0

The maximum flow in the above flow network is clearly m since all the capacities of the inbound edges to target T add up to m. At maximum flow , since the capacities of outgoing edges from K1 to Kn are 1, out of the multiple incoming edges, only one edge is selected and rest all flows become 0.

This process is similar to assigning the key Kj to the machine Si corresponding to the incoming edge whose ﬂow is made 1. The above procedure makes sure that one key is assigned to a single machine

and since we make the capacities of the incoming edges of each of machines S1 to Sm are l1 to lm, we make sure that the number of keys that are assigned to that particular machine doesn’t exceed that limit. The ﬂow numbers of these edges determine the number of keys assigned to them once the above

flow network is solved for minimum cost and maximum flow. Once we have the values of Pij from our statistics, we solve the above max-flow min-cut graph to obtain the optimal key allocation and feed it to the mapreduce program. This flow network can be solved using an algorithm that has a strongly

32 Figure 4.3 Flow network

polynomial complexity[34]. In the next chapter we discuss the experimentation and results for the approaches discussed so far.

4.5 Shufﬂe algorithm - Proof of minimality

We use proof by contradiction to prove that our algorithm actually gives the optimal shufﬂe. Let us assume that the algorithm gives a non optimal partition assignment as the output. This means for

keys i and machines j there exists another possible allocation of Xij that has lesser shufﬂe cost than

the allocation chosen by the algorithm. Lets call this PlanX and the plan chosen by the algorithm

as Planopt. If we prove that PLANx has a lower ﬂow cost in the network than PLANopt, it means that the max-ﬂow min-cost doesn’t choose a plan with minimum cost which is a contradiction. From

equation 4.3, the total shufﬂe cost for a given allocation Xij is

Ctotal = Wik ∗ Xik (4.4) i k X X From the flow network in figure 4.3, the total flow cost can be written as follows

Cflow = Wik ∗ Xik (4.5) i∈nodesX(S) k∈nodesX(K) which is the same as 4.4, since the nodes(S) represents machines and nodes(K) represent keys from the way the flow network is built. This implies that the total cost of an allocation and the corresponding cost in the flow network are the same and this means that flow cost of PLANx is less than PLANopt which is contradicting. So shuffle algorithm always chooses the minimum shuffle cost allocation.

33 4.6 Shufﬂe algorithm - A working example

Lets consider a query joining two tables customer and supplier from TPCH dataset. The query is as follows

s e l e c t ∗ from customer join supplier on customer.c nationkey = supplier .s nationkey where s nationkey < 5 ;

We put a where clause to reduce the key set size so that the ﬂow diagram is small and easy to understand. We ran the query on a 7 node cluster and values in matrices Pij for the tables customer and supplier are obtained from histograms.

135464 163672 206312 225336 179908 3280 397208  242064 256660 15252 111028 586464 271748 133496  P (customer)= 163672 184664 169904 19680 188928 22632 182696 ij      94300 53300 234192 49364 73636 127264 131364     197948 581052 29520 69372 191880 123492 113816      256878 318364 0 0 559906 0 18318  66030 83354 0 0 29678 0 232312  P (supplier)= 27122 4118 0 0 40044 0 208030 ij      312542 188008 0 0 503106 0 60918     2272 66314 0 0 123398 0 89176      So, the total Pij for the ﬂow matrix is the sum of the above 2 matrices which is as follows,

392342 482036 206312 225336 739814 3280 415526  308094 340014 15252 111028 616142 271748 365808  P = 190794 188782 169904 19680 228972 22632 390726 ij      406842 241308 234192 49364 576742 127264 192282     200220 647366 29520 69372 315278 123492 202992      In our testing, all the machines are under the same switch and because of this we have stable ping time across all machines. So a normalized matrix Cij looks as follows,

0111111  1011111  1101111     Cij =  1110111     1111011       1111101     1111110     

34 So, the matrix Wij evaluates to the following,

2072304 1982610 2258334 2239310 1724832 2461366 2049120  1719992 1688072 2012834 1917058 1411944 1756338 1662278  W = 1020696 1022708 1041586 1191810 982518 1188858 820764 ij      1421152 1586686 1593802 1778630 1251252 1700730 1635712     1388020 940874 1558720 1518868 1272962 1464748 1385248      Solving the flow network with these Wij values, we get the optimal assignment of keys as in table 4.3 with a shuffle volume of 618 megabytes. As we can see, multiple keys are assigned to a single node and this limit (node capacity) can be configured per node while solving the flow graph.

Key name Assigned node 0 node5 1 node5 2 node7 3 node5 4 node2

Table 4.3 Key to Node assignment with optimized scheduler

The same query when ran on a hadoop scheduler had the assignment in table 4.4 with a shufﬂe volume of 726 megabytes.

Key name Assigned node 0 node3 1 node2 2 node4 3 node5 4 node7

Table 4.4 Key to Node assignment with default scheduler

For building the matrix Wij, we perform the standard matrix multiplication algorithm n times, where n is the number of nodes in the cluster and is generally in the range of lower hundreds even for medium to large clusters. So this whole process takes a fraction of a second to run on a normal dual core CPU machine. We discuss the performance evaluation of our approach in detail in the next chapter where best, average and worst case performances of our algorithm along with results.

35 Chapter 5

Experimental Evaluation and Results

In this chapter we discuss about the experimental evaluation we conducted on the theory discussed so far and present you the results obtained.

5.1 Experimental setup

We conducted all the experiments on a 10 node cluster comprising of 1 master and 9 slaves. We used TPCH datasets of scales 100 (100 gigabytes) and 300 (300 gigabytes) for testing the join queries. The input queries we tested include both synthetic queries and tpch benchmark query set. Each machine is equipped with 3.5 gigabyte RAM and 2.4 GHz dual core CPUs and are connected to a 10Gbps network. This setup qualiﬁes to be a network of commodity hardware machines loaded with linux and run a stable version of hadoop. We have patched Hive with each of the above to test our algorithms.We used mysql to serialize the histograms per node as discussed.

5.2 Plan space evaluation and Complexity

In this section we describe in detail the planspace that each of our algorithms explore and compare it with the overall bushy planspace. For the purpose of this discussion, we assume queries to be multiway join with m-tables and we are trying to build n-balanced trees.

5.2.1 Algorithm FindMin

In each iteration of the algorithm with i nodes remaining, we consider i2 pairs to ﬁnd out 2⌊log2 i⌋ (value less than i which is a power of 2) tables whose joins have the least cost. This means after iteration i, the number of tables remaining will be i − 2⌊log2 i⌋ + 1. So the total planspace of this algorithm is

{i>0,i=i−2⌊log2 i⌋+1} (2⌊log2 i⌋)2 i m X=

36 Coming to the complexity part, with n input tables, the algorithm calls selectMinPairs() for ⌊log2n⌋ 2 2 times and each call of it takes O(n ). So the total complexity of the algorithm is O(n log2n).

5.2.2 Algorithm FindRecMin

In this algorithm, we find the minimum n-fullybalanced tree and fix it to build the rest of the tree. In each iteration the number of tables decreases by n − 1 starting from the initial value of m. The number m of ways of finding the n-fullybalanced sub tree is Cn ∗ n!. So, total planspace of the algorithm is as follows,

{i|(m−i∗(n−1))>n} (m−i∗(n−1)) Cn ∗ n! i X=0 In the algorithm, each iteration of main loop selects an n-fully balanced tree. Given m tables, this loop runs ⌈(m − 1/n − 1)⌉ times reducing the number of tables by n − 1 in each iteration. Each iteration calls generateCombinations(x,n) which generates all n sized combinations of joins set x. The complexity of this function call is O(x ∗ n) for linear chain joins since we need to consider only n adjacent tables for each input table. However for cyclic or non linear joins, the complexity of x generateCombinations(x,n) is n and we iterate through all possible permutations (for non-linear x and cyclic joins) to ﬁnd the best plan. So the total complexity of each iteration of loop is n ∗ n! n which turns out to be O(m ) and for linear chain joins each iteration of loop takes O(mn). So the total complexity of the algorithm is O(mn ∗ ⌈(m − 1/n − 1)⌉) for cyclic or non linear joins and O(mn ∗⌈(m − 1/n − 1)⌉) for linear chain joins. n generally takes small values like 4 or 8 depending on the size of the cluster.

5.2.3 Algorithm FindRecDeepMin

In this algorithm we try to build the tree bottom up and n-fully balanced in each iteration. In the beginning, we have m tables and we build a n fully balanced tree (n < m) and the number of ways of m doing this is Cn ∗ n!. However the number of tables decreases over each iteration by a value of n − 1 and we recursively call the same procedure again. So the total planspace for this algorithm is

{i|(m−i∗(n−1))>n} (m−i∗(n−1)) Cn ∗ n! i Y=0 Using the above formulae, following table lists the planspaces for various values of m and for n = 4 and n = 8.

37 FindRecMin FindRecMin FindRecDeepMin FindRecDeepMin m FindMin Total bushy trees n = 4 n = 8 n = 4 n = 8 3 4 6 6 6 6 12 4 7 25 24 24 24 120 5 14 122 120 240 120 1680 6 22 366 720 2160 720 30240 7 35 865 5040 20160 5040 665280 8 35 1802 40321 403200 40320 17297280 9 50 3390 362882 6531840 725760 518918400 10 67 5905 1814406 101606400 10886400 17643225600 11 90 9722 6652824 3193344000 159667200 670442572800 12 101 15270 19958520 77598259200 2395008000 28158588057600 13 128 23065 51892560 1743565824000 37362124800 1.30E+15 14 158 33746 121086000 76716896256000 610248038400 6.48E+16

Table 5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14

FindRecMin FindRecDeepMin FindRecMin FindRecDeepMin Number of tables Hive FindMin n = 4 n = 4 n = 8 n = 8 3 178 121.2 121.2 121.2 121.2 121.2 4 318 199 199 199 199 199 5 342.199 293 255 255 255 255 6 504.452 396 363 305 300 300 7 637.73 547 547 547 580 580 8 718.808 590 590 590 602 602 9 773.317 488 496 458 378 378 10 870.155 644 626 561 402 402 11 994.457 684 656 656 664 664 12 995.565 712 698 608 428 428

Table 5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a 10 node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees

5.3 Algorithms performance

In this section, we describe in detail about the performance of our algorithms on queries of increasing number of joins on our experimental setup. For our evaluation, we ran SQL queries with number of join tables from 3 to 12. For each such query, we ran it on Hive using the default left deep execution plans, and also with the 3 algorithms mentioned above. Execution times are tabulated in table 5.2. Figure 5.9 has these results plotted. From the graph we can see that the n-balanced trees perform well compared to default hive execution especially as the number of joins increase. Since we are working on 4-balanced tees, the performance of all the algorithms remains the same till the number of tables is 4. Once this number goes up we can see that the algorithm 2 and 3 performing well as they considers a bigger plan space than algorithm1. It is interesting to see the performance of 8-balanced trees. They graph for n = 8 is relatively ragged compared to n = 4 owing to the fact that sometimes too much parallelization is an overkill in some

38 TPCH Query Hive FindMin FindRecMin FindRecDeepMin q2 362.443 321.634 266.234 266.234 q3 1238.635 1069.725 1069.725 1069.725 q10 1247.157 995.087 995.087 995.087 q11 321.236 274 274 274 q16 325.129 325.129 305.966 305.966 q18 1004.821 1002.882 1002.882 1002.882 q20 994.485 832.663 622.234 622.234

Table 5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these queries have less number of join tables

cases and suits the cluster in some cases as described in the previous chapters. This is reflected in the performance figures too. We ran the standard tpch benchmark queries on the same cluster of 10 nodes with scale 300 dataset and the performance figures are in tabulated in 5.3 and plotted in 5.8. We included only the queries with joins and selects since they are the only ones relavent to our current work. Also since all these queries have 4 or less join tables, we needn’t test on 8-balanced trees as they essentially give the same output as n = 4 and hence all the results in 5.8 correspond to n = 4. This the reason queries like q16 and q18 show similar performance between hive and our algorithms because there are only 3 tables (2 joins) and there is no search space that our algorithms can explore. We show the execution plans for each of these queries in figures 5.1, 5.2, 5.3, 5.4, 5.5, 5.7 for Commercial DBMS [9], postgres, hive and FindRecDeepMin query planners. and the ‘EXPLAIN’ command output in appendix A. Each figure is followed by a description that explains the performance improvements of our approach compared to non-parallelizable left deep plans chosen by other optimizers and the whole summary is tabulated in table 5.4.

TPCH Query Commercial DBMS Postgres Hive FindRecDeepMin left deep left deep left deep 4-balanced q2 ((((ps 1 s) 1 n) 1 r) 1 p) ((((r 1 n) 1 s) 1 ps) 1 p) ((((r 1 n) 1 s) 1 ps) 1 p) (((r 1 n) 1 (ps 1 p)) 1 s) left deep left deep left deep left deep q3 ((c 1 o)1 l) ((l 1 o)1 c) ((c 1 o)1 l) ((l 1 o)1 c) left deep left deep left deep 4-fully balanced q10 ((l 1 o)1 c) 1 n) ((c 1 n)1 o) 1 l) ((c 1 o)1 n) 1 l) ((c 1 n) 1(l 1 o)) left deep left deep left deep left deep q11 ((s 1 n)1 p) ((p 1 s)1 n) ((p 1 s)1 n) ((s 1 n)1 p) left deep left deep left deep left deep q16 ((p 1 ps)1 s) ((p 1 ps)1 s) ((p 1 s)1 ps) ((p 1 ps)1 s) left deep left deep left deep left deep q18 ((l 1 o)1 c) ((c 1 o)1 l) ((c 1 o)1 l) ((c 1 o)1 l) left deep left deep left deep 4-balanced q20 (((((ps 1 p)1 l)1 p)1 s)1 n) (((((ps 1 p)1 l)1 p)1 s)1 n) (((((ps 1 p)1 l)1 p)1 s)1 n) (((ps 1 p)1(ps 1 l))1 (s 1 n))

Table 5.4 Summary of query plans for TPCH Dataset c=customer, s=supplier, ps=partsupp, r=region, l=lineitem, o=orders, n=nation

39 (a) Commercial DBMS (b) Postgres

(d) FindRecDeepMin (c) Hive

Figure 5.1 Query execution plan for q2 We can notice that the execution plans chosen by Commercial DBMS, postgres and hive are all left deep whereas the algorithm FindRecDeepMin has chosen a 4-balanced tree which has more parallelization which can run two joins (and hence two map-reduce jobs, (region join nation) and (partsupp join part)) at the same time thus resulting in effective utilization of the cluster. Also at each join node we optimize the reducer allocation based on the statistics.

40 (a) Commercial DBMS (b) Postgres

Figure 5.2 Query execution plan for q3 In query q3, we have only 2 joins and so we don’t have many possible join orders and no parallelization is possible. Each of the optimizers choose plans based on their stats and for FindRecDeepMin algorithm we optimally schedule the reduce tasks based on the statistics.

41 (b) Postgres

(a) Commercial DBMS

Figure 5.3 Query execution plan for q10 This query shows the difference between the parallel and non-parallel plans by building a fully balanced tree with four tables. As we can see from the ﬁgure, Commercial DBMS, postgres and Hive build the left deep plans but FindRecDeepMin’s plan can be easily parallelized with two joins running at the same time ((customer join nation) and (lineitem join orders)). This way we can optimally utilize the task slots and schedule the tasks as and when a slot is available thus utilizing the cluster resources properly.

42 (a) Commercial DBMS (b) Postgres

Figure 5.4 Query execution plan for q11 This query is similar to query q3 in ﬁgure 5.2. It has three tables and two joins. So each of the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based on statistics.

5.4 Cost Formulae accuracy

In this section we analyze in detail the accuracy of the cost formulae we designed for the purpose of query optimizer. Our main aim of designing the cost formuale for our optimizer is to quantify the cost of each phase of join mapreduce job. As explained already, a real mapreduce job is quite complex under the hood and we picked the parts of each phase that incur signiﬁcant costs and included them in the cost formulae and we assumed the worst case scenarios to get an approximate upper bound for each phase. We ran 5 different mapreduce join jobs with increasing input sizes and measured the runtime of the phases and mapped them with the results from the cost formulae. Results are in the ﬁgures 5.10 and 5.11. From the graph 5.10, it is clear that the estimated runtime using cost formulae is somewhat proportional to the actual map phase execution time. But in all the inputs, the estimated cost is high. This is because of the fact that the HDFS uses block locality principle in most of the cases to read the the

43 (a) Commercial DBMS (b) Postgres

(d) FindRecDeepMin (c) Hive

Figure 5.5 Query execution plan for q16 This query is similar to queries q3 and q11 in ﬁgures 5.2 and 5.4. It has three tables and two joins. So each of the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based on statistics. data but in the formulae we consider a read throughput speed whose calculation includes reading remote blocks too. Also we consider the time taken by the slowest machine to complete its map wave since slowest machine limits the total speed of the cluster. However in reality tasks might get assigned to faster machines thus resulting in faster execution times. Unlike others, the estimates of shufﬂe and reduce phases are lesser compared to the actual runtime since the actual runtime includes the jvm startup time which is not present in the cost formulae. It varies from machine to machine and depends on various factors such as jvm reuse, caching etc.

5.5 Efﬁciency of scheduling algorithm

In this section we discuss about the efﬁciency of our scheduling algorithm discussed in chapter 4. We ran various join queries on a TPCH 100 scale data set with input sizes up to 100 gigabytes. We ran eighteen different queries to demonstrate the best case and the worst case running of the algorithm. This best or worst case of the algorithm (as compared to default scheduler) depends on the distribution

44 (b) Postgres (a) Commercial DBMS

Figure 5.6 Query execution plan for q18 This query is similar to queries q3, q11 and q16 in figures 5.2 , 5.4 and 5.5. It has three tables and two joins. So each of the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based on statistics. of keys across the machines in the dataset since that is what defines the shuffling of keys across the machines. To demonstrate the worst/best cases, we explain the data distribution that results in such performance and show it experimentally on such a sample distribution. We ran these queries twice once with default hadoop scheduling and once with the optimal scheduling allocation in place. We measured the data shuffle inputs of each machine in the cluster from other machines in both of the above executions and plotted the graph in figure 5.12. X-axis denotes the size of input tables and they reach up to 100 gigabytes whereas the Y-axis measures the execution time. Red plot corresponds to the default scheduler in hadoop and the blue plot is the optimized shuffle. The actual shuffle values are tabulated in table 5.5. From the graph and data, we can see that the optimized shuffle data size is less than the default hadoop shuffle. This is because the optimized suffle algorithm minimizes the network IO by assigning

45 keys which result in least network movement of data. Also we can notice that the performance of the algorithm increases with the increase in input data size. This is because the algorithm gets ﬂexibility to try out more possible allocations since the data is spread out.

Input size Optimized Shuffle Volume Default Shuffle Volume Num Shuffles Num Shuffles (H/A) Gigabytes Megabytes (A) Megabytes (H) Optimized Shuffle Default Shuffle 2.42 24.07 37.33 177787 275688 1.55 12.1 122.39 181.28 903841 1338694 1.48 24.2 218.43 314.43 1612961 2321897 1.43 36.3 361.81 462.07 2671741 3412088 1.27 48.4 424.99 690.52 3138280 5099067 1.62 60.5 495.46 686.55 3658709 5069751 1.38 72.6 597.86 891.49 4414868 6583109 1.49 84.7 682.94 1373.80 5043081 10144671 2.01 96.8 958.08 1476.46 7074826 10902697 1.54

Table 5.5 Shufﬂe data size comparison of default and optimized algorithm tested on a TPCH scale 100 dataset

This algorithm works the best when all the values corresponding to the join key from both the join tables lie on the same system so that the algorithm allocates the reduce task on the same machine and if this holds for all the join keys, the total shuffle IO is zero since no join key should be moved to any other system. We ran 9 different queries demonstrating this best case working and the results are in table 5.6 and plotted in figure 5.13 (Shuffle algorithm IO bytes are coincide with the X-axis as the shuffle volume is zero due to perfect data locality). Since hadoop does not care about data locality while assigning reduce tasks, we can clearly see that the data is being unnecessarily shipped to other machines resulting in shuffle IO.

46 Best Case 6000

Hadoop 5000 Shuffle Algo

4000

3000

Shuffle size in megabytes 2000

1000

0 0 10 20 30 40 50 60 70 80 90 100 Input sizes

Figure 5.13 Shufﬂe algorithm on a TPCH scale 100 dataset - Best case performance

Input size Default Shuffle Volume Num Shuffles Optimized Shuffle Num Shuffles Gigabytes Megabytes Default Shuffle Megabytes Optimized Shuffle 2.42 131917296 821508 0 0 12.1 659591856 4107576 0 0 24.2 1539049806 9584358 0 0 36.3 1978778388 12322746 0 0 48.4 3078100760 19168723 0 0 60.5 3847626734 23960909 0 0 72.6 3957557628 24645498 0 0 84.7 5386676540 33545267 0 0 96.8 4397286800 27383890 0 0

Table 5.6 Shufﬂe algorithm on a TPCH scale 100 dataset - Best case performance

47 Worst case shuffle 6000

Hadoop Shuffle Algo 5000

4000

3000 Shuffle size in megabytes 2000

1000

0 0 10 20 30 40 50 60 70 80 90 100 Input table sizes

Figure 5.14 Shufﬂe algorithm on a TPCH scale 100 dataset - Worst case performance

Input size Default Shuffle Volume Num Shuffles Optimized Shuffle Volume Num Shuffles H/A Gigabytes Megabytes (H) Default Shuffle Megabytes (A) Optimized Shuffle 2.42 131901300 821400 131901300 821400 1 12.1 659573700 4107450 659573700 4107450 1 24.2 1319168700 8215050 1319168700 8215050 1 36.3 1978763700 12322650 1978763700 12322650 1 48.4 2638358700 16430250 2638358700 16430250 1 60.5 3297932400 20537700 3297932400 20537700 1 72.6 3957527400 24645300 3957527400 24645300 1 84.7 4617122400 28752900 4617122400 28752900 1 96.8 5276717400 32860500 5276717400 32860500 1

Table 5.7 Shufﬂe algorithm on a TPCH scale 100 dataset - Worst case performance

Coming to the worst case performance of the algorithm, it happens when each key corresponding to the join column is equally distributed across all the machines. In this case, it doesn’t matter which machine we ship it to, we get overall same amount of shufﬂe IO. This is equivalent to a perfect uniform distribution of rows per machine per join column value. We ran queries in the worst case and our scheduler performs exactly same as hadoop’s default scheduler. The results are tabulated in table 5.7 and plotted in the ﬁgure 5.14.

48 (a) Commercial DBMS (b) Postgres

(d) FindRecDeepMin (c) Hive

Figure 5.7 Query execution plan for q20 This query highlights the parallelization achieved by the algorithm FindRecDeepMin compared to other optimizers. As seen in the ﬁgure Commercial DBMS, Postgres and Hive stuck to the left deep tree approach whereas the algorithm FindRecDeepMin provides multiple levels of parallelization. Instead of two parallel joins in the ﬁrst step, this query plan provides 3 join map-reduce jobs in parallel ((partsupp join part) and (partsupp join lineitem) and (supplier join nation)). This utilizes the cluster to its maximum taking up the slots as soon as they are left out by the other jobs. Also in the immediate level we have two join jobs in parallel (between the intermediate output tables). Adding to that we schedule the tasks optimally based on the statistics and this gave this query a good performance improvement over the other plans.

49 1400

Hive FindMin FindRecMin 1200 FindRecDeepMin

1000

800

600 Runtime of the queries in seconds

400

200

0 q2 q3 q10 q11 q16 q18 q20 TPCH query numbers

Figure 5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these queries have less number of join tables

1000

Hive 900 FindMin FindRecMin(n=4) FindRecDeepMin (n=4) 800 FindRecMin(n=8) Find RecDeepMin(n=8)

700

600

500

Run time of the query in seconds 400

300

200

100 3 4 5 6 7 8 9 10 11 12 Number of tables in the join query

Figure 5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries

50 800

700

600

500

400 Map phase runtime 300

200

Actual 100 Estimate

0 0 10 20 30 40 50 60 70 80 90 Size of input data in gigabytes

Figure 5.10 Map phase cost formulae evaluation

Actual Estimate

Runtime of shuffle and reduce 12

8 0 10 20 30 40 50 60 70 80 90 Size of input tables in gigabytes

Figure 5.11 Reduce and shufﬂe cost formulae evaluation

51 Shuffle algorithm performance 1500

1000 Shuffle size in megabytes 500

Hadoop Shuffle Algo

0 0 10 20 30 40 50 60 70 80 90 100 Input table sizes

Figure 5.12 Comparison of default vs optimal shufﬂe IO

52 Chapter 6

Conclusions and Future Work

6.1 Conclusions and contributions

The database infrastructure has rapidly changed over the past few years from high-end servers holding vast amounts of data to a set of commodity hardware machines holding data in a distributed fashion. The map-reduce programming paradigm from Google has facilitated this transformation by providing highly scalable and fault tolerant distributed systems of production level quality. This increase in dataset sizes has posed various problems to the system designers interms of complexity and scalability. Since many companies still rely on SQL standards for analytics, it has been ported to mapreduce based systems too and is made a de-facto standard. This posed new problems to the query optimizers owing to its complexity and the scale at which it works. As a part of this work, we have built a new optimizer from scratch for joins over mapreduce framework based on traditional relational database style optimizations. Our contributions can be summed up as follows.

• We built a query optimizer from scratch for mapreduce based SQL sytems. We followed the traditional database style optimizations using statistics and costformulae.

• We designed a distributed statistics store for data in HDFS and built cost formulae for join operators in hive.

• We explored a new subset of bushy plan space called n-balanced trees based on the the fact that their inherent parallelization suits mapreduce framework.

• We formulated the shufﬂe in mapreduce job as a maxﬂow mincut problem and found out the optimal assignments to reduce netowrk IO.

Chapter 3 discusses about the plan space of n-balanced trees and their applicability to mapreduce framework based on statistics. Results show that accurate statistics can be very useful for predicting the output sizes of intermediate results and can be also help us in picking a best plan from the search space of n-balanced trees. Also results show the beneﬁts of using n-balanced trees for massively parllel

53 framework like mapreduce compared to traditional left deep join trees as in hive. But one should be careful while choosing the value of n as too much of parallelization might be an overkill too since every job keeps waiting for others to complete. Also using our optimial shuffle max-flow mincut formulation will result in reducing the network IO from other nodes and helps increasing the query performance. The ideas presented in this work can be used in any standard mapreduce based SQL systems with a few changes. As a proof of concept modified Hive for our experimental analysis and a similar approach is possible for other systems too.

6.2 Future work

In this section we discuss possible directions to extend this work.

• This work focuses on most basic problem of query optimization using joins with selection and pro- jection operators since a join operator is considered relatively costly compared to other operators. We can extend this work to include other SQL operators like aggregation, groupby, subqueries etc. For that we need to design appropriate cost formuale for each of them and follow similar techniques from relational world.

• Another important direction of work is to improve the cost formulae for join operators. Currently we get an upper bound cost based on statistics. Since a mapreduce system is very complex interms of design, we need to take lot of factors ranging from cache to buffer sizes to network speeds and also need track all the roundtrips from disk to memory in each phase of the job. Taking each and every factor into account and designing accurate cost formulae for each operator enhances the query optimizer .

• A mapreduce job is designed to be fault tolerant and can sustain task failures. Current optimizer doesn’t take failures into account and calculates costs of operators without considering such scenarios. However such failures can increase the cost of each SQL operator and including them in the optimizer might give us more accurate costs and thus better plans. We need to consider factors like load on each machine (number of mappers and reducers), scheduler properties like preemp- tion, mean time to failure (MTTF) of nodes etc. to include failure scenarios in the optimizer

• The performance of a mapreduce cluster depends on how well the scheduler works. There are a lot of schedulers available today that decide whether to launch a task or job or kill an existing one based various factors like priority, load, fairness etc. We need to take into consideration such properties speciﬁc to schedulers and build our optimizer accordingly since the cost of each operator depends on how the scheduler works

• Block placement in a distributed systems play an important role in the cost of processing them. Placing them closer to the code, reduces the network IO and increases the query performance.

54 The same principle applies to HDFS too. Since data is split into blocks, we can design an optimal block allocation for a set of queries that reduces the overall cost of processing them. So the problem translates to ﬁnding an optimal HDFS block allocation given a set of queries and the costs to process them

55 Appendix A

Query execution plans for TPCH queries

A.1 q2

A.1.1 Postgres

Nested Loop (cost=36.91..61.34 rows=1 width=730) Join Filter: (n.n_nationkey = s.s_nationkey) -> Hash Join (cost=12.14..24.48 rows=1 width=108) Hash Cond: (n.n_regionkey = r.r_regionkey) -> Seq Scan on nation n (cost=0.00..11.70 rows=170 width=112) -> Hash (cost=12.12..12.12 rows=1 width=4) -> Seq Scan on region r (cost=0.00..12.12 rows=1 width=4) Filter: (r_name = ’EUROPE’::bpchar) -> Hash Join (cost=24.77..36.84 rows=1 width=630) Hash Cond: (s.s_suppkey = ps.ps_suppkey) -> Seq Scan on supplier s (cost=0.00..11.50 rows=150 width=510) -> Hash (cost=24.76..24.76 rows=1 width=128) -> Hash Join (cost=12.41..24.76 rows=1 width=128) Hash Cond: (ps.ps_partkey = p.p_partkey) -> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width =24) -> Hash (cost=12.40..12.40 rows=1 width=108) -> Seq Scan on part p (cost=0.00..12.40 rows=1 width =108) Filter: (((p_type)::text ˜˜ ’\%BRASS’::text) AND ( p_size = 15))

56 A.1.2 Hive

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-1 depends on stages: Stage-4 Stage-2 depends on stages: Stage-1 Stage-3 depends on stages: Stage-2 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-4 Map Reduce Alias -> Map Operator Tree: nation TableScan alias: n Reduce Output Operator region TableScan alias: r Filter Operator (r_name = ’EUROPE’) Reduce Operator Tree: Join Operator

Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator supplier TableScan alias: s Reduce Output Operator Reduce Operator Tree: Join Operator condition map:

Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: \$INTNAME

57 Reduce Output Operator partsupp TableScan alias: ps Reduce Output Operator

Stage: Stage-3 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator part TableScan alias: p Filter Operator predicate: expr: ((p_size = 15) and (p_type like ’\%BRASS’)) Reduce Output Operator

Reduce Operator Tree: Join Operator, File Output Operator

Stage: Stage-0 Fetch Operator limit: -1

A.1.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 is a root stage Stage-3 depends on stages - Stage-1,Stage-2 Stage-4 depends on stages - Stage-3 Stage-0 is a root stage

Stage: Stage-1 Stage: Stage-2 Map Reduce Map Reduce Alias -> Map Operator Tree: Alias -> Map Operator Tree: nation partsupp TableScan alias: n TableScan alias: ps Reduce Output Operator Reduce Output Operator region part

58 TableScan alias: r TableScan alias: p Filter Operator (r_name = ’ Filter Operator EUROPE’) predicate: expr: ((p_size Reduce Operator Tree: Join = 15) and (p_type like Operator ’\%BRASS’)) Reduce Output Operator Reduce Operator Tree: Join Operator condition map:

Stage: Stage-3 Map Reduce Alias -> Map Operator Tree: Stage-1-out TableScan alias: s1 Stage-2-out TableScan alias: s2 Reduce Output Operator Reduce Operator Tree: Join Operator

Stage: Stage-4 Map Reduce Alias -> Map Operator Tree: Stage-3-out TableScan alias: s3 supplier TableScan alias: s Reduce Output Operator Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.2 q3

A.2.1 Postgres

Hash Join (cost=24.67..37.43 rows=1 width=12)

59 Hash Cond: (l.l\_orderkey = o.o\_orderkey) -> Seq Scan on lineitem l (cost=0.00..12.50 rows=67 width=4) Filter: (l_shipdate > ’1995-03-15’::date) -> Hash (cost=24.66..24.66 rows=1 width=12) -> Hash Join (cost=11.76..24.66 rows=1 width=12) Hash Cond: (o.o\_custkey = c.c\_custkey) -> Seq Scan on orders o (cost=0.00..12.62 rows=70 width=16) Filter: (o\_orderdate < ’1995-03-15’::date) -> Hash (cost=11.75..11.75 rows=1 width=4) -> Seq Scan on customer c (cost=0.00..11.75 rows=1 width=4) Filter: (c\_mktsegment = ’BUILDING’::bpchar)

A.2.2 Hive

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: customer TableScan alias: c Filter Operator (c\_mktsegment = ’BUILDING’) Reduce Output Operator orders TableScan alias: o Filter Operator (o\_orderdate < ’1995-03-15’) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-2 Map Reduce Alias -> Map Operator Tree:

60 \$INTNAME Reduce Output Operator

lineitem TableScan alias: l Filter Operator (l\_shipdate > ’1995-03-15’) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.2.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: lineitem TableScan alias: l Filter Operator (l\_shipdate > ’1995-03-15’) Reduce Output Operator orders TableScan alias: o Filter Operator (o\_orderdate < ’1995-03-15’) Reduce Output Operator

Reduce Operator Tree: Join Operator

61 Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator

customer TableScan alias: customer Filter Operator (c\_mktsegment = ’BUILDING’) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.3 q10

A.3.1 Postgres

Nested Loop (cost=25.11..49.97 rows=1 width=606) Join Filter: (o.o\_orderkey = l.l\_orderkey) -> Hash Join (cost=25.11..37.46 rows=1 width=610) Hash Cond: (n.n\_nationkey = c.c\_nationkey) -> Seq Scan on nation n (cost=0.00..11.70 rows=170 width=108) -> Hash (cost=25.10..25.10 rows=1 width=510) -> Hash Join (cost=13.16..25.10 rows=1 width=510) Hash Cond: (c.c\_custkey = o.o\_custkey) -> Seq Scan on customer c (cost=0.00..11.40 rows=140 width =506) -> Hash (cost=13.15..13.15 rows=1 width=8) -> Seq Scan on orders o (cost=0.00..13.15 rows=1 width =8) Filter: ((o\_orderdate >= ’1993-10-01’::date) AND (o\_orderdate < ’1994-01-01’::date))

62 -> Seq Scan on lineitem l (cost=0.00..12.50 rows=1 width=4) Filter: (l.l\_returnflag = ’R’::bpchar)

A.3.2 Hive

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-3 depends on stages: Stage-2 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: customer TableScan alias: c Reduce Output Operator orders TableScan alias: o Filter Operator ((o\_orderdate >= ’1993-10-01’) and (o\_orderdate < ’1994-01-01’)) Reduce Output Operator Reduce Operator Tree: Join Operator Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator nation TableScan alias: n Reduce Output Operator

Reduce Operator Tree:Join Operator

Stage: Stage-3 Map Reduce Alias -> Map Operator Tree:

63 \$INTNAME Reduce Output Operator lineitem TableScan alias: l Filter Operator (l\_returnflag = ’R’) Reduce Output Operator Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.3.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-3 depends on stages: Stage-2 Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1 Map Reduce Stage: Stage-2 Alias -> Map Operator Tree: Map Reduce customer Alias -> Map Operator Tree: TableScan alias: c lineitem Reduce Output Operator TableScan alias: l nation Filter Operator (l\ TableScan alias: n _returnflag = ’R’) Reduce Output Operator orders Reduce Output Operator TableScan alias: o Reduce Operator Tree: Filter Operator ((o\ Join Operator _orderdate >= ’1993-10-01’) and (o\ _orderdate < ’1994-01-01’)) Reduce Output Operator

64 Reduce Operator Tree: Join Operator

Stage: Stage-3 Map Reduce Alias -> Map Operator Tree: Stage-1-out TableScan alias: s1 Stage-2-out TableScan alias: s2 Reduce Output Operator Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.4 q11

A.4.1 Postgres

Hash Join (cost=24.22..36.57 rows=1 width=20) Hash Cond: (ps.ps\_suppkey = s.s\_suppkey) -> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=24) -> Hash (cost=24.21..24.21 rows=1 width=4) -> Hash Join (cost=12.14..24.21 rows=1 width=4) Hash Cond: (s.s\_nationkey = n.n\_nationkey) -> Seq Scan on supplier s (cost=0.00..11.50 rows=150 width=8) -> Hash (cost=12.12..12.12 rows=1 width=4) -> Seq Scan on nation n (cost=0.00..12.12 rows=1 width=4) Filter: (n\_name = ’GERMANY’::bpchar)

A.4.2 Hive

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

65 STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: partsupp TableScan alias: ps Reduce Output Operator Reduce Output Operator supplier TableScan alias: s Reduce Output Operator

Reduce Operator Tree:Join Operator

Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator nation TableScan alias: n Filter Operator (n\_name = ’GERMANY’) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.4.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

66 STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: supplier TableScan alias: s Reduce Output Operator nation TableScan alias: Filter Operator (n\_name = ’GERMANY’) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator partsupp TableScan alias: ps Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.5 q16

A.5.1 Postgres

Hash Join (cost=27.57..44.13 rows=139 width=120)

67 Hash Cond: (ps.ps\_suppkey = s.s\_suppkey) -> Hash Join (cost=13.82..28.40 rows=158 width=120) Hash Cond: (p.p\_partkey = ps.ps\_partkey) -> Seq Scan on part p (cost=0.00..12.40 rows=158 width=120) Filter: ((p\_brand <> ’Brand45’::bpchar) AND ((p\_type)::text !˜˜ ’ MEDIUM POLISHED%’::text)) -> Hash (cost=11.70..11.70 rows=170 width=8) -> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=8) -> Hash (cost=11.88..11.88 rows=150 width=4) -> Seq Scan on supplier s (cost=0.00..11.88 rows=150 width=4) Filter: ((s\_comment)::text !˜˜ ’\%Customer%Complaints\%’::text)

A.5.2 Hive

STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: part TableScan alias: p Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’MEDIUM POLISHED\%’)))

Reduce Output Operator supplier TableScan alias: s Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’)) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-1 Map Reduce

68 Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator partsupp TableScan alias: ps Reduce Output Operator Reduce Operator Tree: Join Operator Stage: Stage-0 Fetch Operator limit: -1

A.5.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: supplier TableScan alias: ps Reduce Output Operator part TableScan alias: p Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’ MEDIUM POLISHED\%’))) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-2 Map Reduce

69 Alias -> Map Operator Tree: \$INTNAME Reduce Output Operator supplier TableScan alias: s Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’)) Reduce Output Operator

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.6 q18

A.6.1 Postgres

Hash Join (cost=43.02..57.26 rows=49 width=96) Hash Cond: (l.l\_orderkey = o.o\_orderkey) -> Seq Scan on lineitem l (cost=0.00..12.00 rows=200 width=4) -> Hash (cost=42.40..42.40 rows=49 width=100) -> Hash Join (cost=26.49..42.40 rows=49 width=100) Hash Cond: (o.o\_custkey = c.c\_custkey) -> Hash Join (cost=13.34..28.50 rows=70 width=32) Hash Cond: (o.o\_orderkey = t.l\_orderkey) -> Seq Scan on orders o (cost=0.00..12.10 rows=210 width=28) -> Hash (cost=12.50..12.50 rows=67 width=4) -> Seq Scan on lineitem t (cost=0.00..12.50 rows=67 width=4) Filter: (l\_quantity > 300::numeric) -> Hash (cost=11.40..11.40 rows=140 width=72) -> Seq Scan on customer c (cost=0.00..11.40 rows=140 width =72)

A.6.2 Hive

70 STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 is a root stage

STAGE PLANS: Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: customer TableScan alias: c Reduce Output Operator orders TableScan alias: o Reduce Output Operator Reduce Operator Tree: Join Operator

Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: \$INTNAME l TableScan alias: l Reduce Output Operator key expressions: expr: l\_orderkey type: int sort order: + Map-reduce partition columns: expr: l\_orderkey type: int tag: 2 t

71 TableScan alias: t Filter Operator (l\_quantity > 300.0)

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

A.6.3 FindRecDeepMin

STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 is a root stage

Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: \$INTNAME l TableScan alias: l

72 Reduce Output Operator key expressions: expr: l\_orderkey type: int sort order: + Map-reduce partition columns: expr: l\_orderkey type: int tag: 2 t TableScan alias: t Filter Operator (l\_quantity > 300.0)

Reduce Operator Tree: Join Operator

Stage: Stage-0 Fetch Operator limit: -1

73 Bibliography

[1] Apache Hadoop. http://hadoop.apache.org/. [2] Apache Hive. http://hive.apache.org/. [3] Apache PoweredBy. http://wiki.apache.org/hadoop/PoweredBy. [4] Cern data scale. http://home.web.cern.ch/about/updates/2013/04/ animation-shows-lhc-data-processing. [5] Data Scale. http://www.comparebusinessproducts.com/fyi/ 10-largest-databases-in-the-world. [6] Facebook scale. https://www.facebook.com/notes/paul-yang/ moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/ 10150246275318920. [7] HDFS architecture. http://hadoop.apache.org/docs/stable1/hdfs_design.html. [8] Map Reduce architecture diagram . http://biomedicaloptics.spiedigitallibrary.org/ article.aspx?articleid=1167145. [9] SQL server tpch plans. http://researchweb.iiit.ac.in/˜bharath.v/sql_server_plans.pdf. [10] Towards Join Order Templates based Query Optimization: An Empirical Evaluation. http://web2py. iiit.ac.in/research_centres/publications/view_publication/phdthesis/12. [11] Yahoo scale. http://developer.yahoo.com/blogs/hadoop/ scaling-hadoop-4000-nodes-yahoo-410.html. [12] F. Afrati, A. Sarma, D. Menestrina, A. Parameswaran, and J. Ullman. Fuzzy joins using mapreduce. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 498–509, 2012. [13] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 99–110, New York, NY, USA, 2010. ACM. [14] S. Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, pages 34–43, New York, NY, USA, 1998. ACM.

74 [15] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. [16] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, Oct. 2003. [17] G. Graefe. Volcano— an extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120–135, Feb. 1994. [18] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC, USA, 1993. IEEE Computer Society. [19] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC, USA, 1993. IEEE Computer Society. [20] L. M. Haas, M. J. Carey, M. Livny, and A. Shukla. Seeking the truth about ad hoc join costs. The VLDB Journal, 6(3):241–256, 1997. [21] E. P. Harris and K. Ramamohanarao. Join algorithm costs revisited. The VLDB Journal, 5(1):064–084, Jan. 1996. [22] T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst., 9(3):482–502, Sept. 1984. [23] Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 19–30. VLDB Endowment, 2003. [24] Y. E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, Mar. 1996. [25] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its implications for query optimization. SIGMOD Rec., 20(2):168–177, Apr. 1991. [26] K. Karlaplem and N. M. Pun. Query driven data allocation algorithms for distributed database systems. In in 8th International Conference on Database and Expert Systems Applications (DEXA’97), Toulouse, Lecture Notes in Computer Science 1308, pages 347–356, 1997. [27] R. S. G. Lanzelotte, P. Valduriez, and M. Za¨ıt. On the effectiveness of optimization search strategies for parallel execution spaces. In Proceedings of the 19th International Conference on Very Large Data Bases, VLDB ’93, pages 493–504, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. [28] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In Distributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 25–36, 2011. [29] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and M. Wong. Tenzing a sql implementation on the mapreduce framework. [30] L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB ’86, pages 149–159, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc.

75 [31] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec., 27(2):448–459, June 1998. [32] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 949–960, New York, NY, USA, 2011. ACM. [33] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110, New York, NY, USA, 2008. ACM. [34] J. B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In OPERATIONS RESEARCH, pages 377–387, 1988. [35] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhD thesis, Madison, WI, USA, 1997. UMI Order No. GAX97-16074. [36] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 3 edition, 2003. [37] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, SIGMOD ’79, pages 23–34, New York, NY, USA, 1979. ACM. [38] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, 2010. [39] M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized optimization for the join ordering problem. The VLDB Journal, 6(3):191–208, Aug. 1997. [40] A. Swami. Optimization of large join queries: Combining heuristics and combinatorial techniques. SIGMOD Rec., 18(2):367–376, June 1989. [41] A. Swami and B. Iyer. A polynomial time algorithm for optimizing join queries. In Data Engineering, 1993. Proceedings. Ninth International Conference on, pages 345–354, 1993. [42] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996–1005, 2010. [43] R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 495–506, New York, NY, USA, 2010. ACM.