Hadoopdb: an Architectural Hybrid of Mapreduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Azza Abouzeid1, Kamil Bajda-Pawlikowski1, Daniel Abadi1, Avi Silberschatz1, Alexander Rasin2 1Yale University, 2Brown University {azza,kbajda,dna,avi}@cs.yale.edu; [email protected]

ABSTRACT partly due to the increased automation with which data can be pro- The production environment for analytical data management ap- duced (more business processes are becoming digitized), the prolif- plications is rapidly changing. Many enterprises are shifting away eration of sensors and data-producing devices, Web-scale interac- from deploying their analytical databases on high-end proprietary tions with customers, and government compliance demands along machines, and moving towards cheaper, lower-end, commodity with strategic corporate initiatives requiring more historical data hardware, typically arranged in a shared-nothing MPP architecture, to be kept online for analysis. It is no longer uncommon to hear often in a virtualized environment inside public or private “clouds”. of companies claiming to load more than a terabyte of structured At the same time, the amount of data that needs to be analyzed is data per day into their analytical database system and claiming data exploding, requiring hundreds to thousands of machines to work in warehouses of size more than a petabyte [19]. parallel to perform the analysis. Given the exploding data problem, all but three of the above There tend to be two schools of thought regarding what tech- mentioned analytical database start-ups deploy their DBMS on a nology to use for data analysis in such an environment. Propo- shared-nothing architecture (a collection of independent, possibly nents of parallel databases argue that the strong emphasis on per- virtual, machines, each with local disk and local main memory, formance and efficiency of parallel databases makes them well- connected together on a high-speed network). This architecture suited to perform such analysis. On the other hand, others argue is widely believed to scale the best [17], especially if one takes that MapReduce-based systems are better suited due to their supe- hardware cost into account. Furthermore, data analysis workloads rior scalability, fault tolerance, and flexibility to handle unstructured tend to consist of many large scan operations, multidimensional ag- data. In this paper, we explore the feasibility of building a hybrid gregations, and star schema joins, all of which are fairly easy to system that takes the best features from both technologies; the pro- parallelize across nodes in a shared-nothing network. Analytical totype we built approaches parallel databases in performance and DBMS vendor leader, Teradata, uses a shared-nothing architecture. Oracle and Microsoft have recently announced shared-nothing an- efficiency, yet still yields the scalability, fault tolerance, and flexi- 1 bility of MapReduce-based systems. alytical DBMS products in their Exadata and Madison projects, respectively. For the purposes of this paper, we will call analytical DBMS systems that deploy on a shared-nothing architecture paral- 1. INTRODUCTION lel databases2. The analytical database market currently consists of $3.98 bil- Parallel databases have been proven to scale really well into the lion [25] of the $14.6 billion database software market [21] (27%) tens of nodes (near linear scalability is not uncommon). However, and is growing at a rate of 10.3% annually [25]. As business “best- there are very few known parallel databases deployments consisting practices” trend increasingly towards basing decisions off data and of more than one hundred nodes, and to the best of our knowledge, hard facts rather than instinct and theory, the corporate thirst for there exists no published deployment of a parallel database with systems that can manage, process, and granularly analyze data is nodes numbering into the thousands. There are a variety of reasons becoming insatiable. Venture capitalists are very much aware of why parallel databases generally do not scale well into the hundreds this trend, and have funded no fewer than a dozen new companies in of nodes. First, failures become increasingly common as one adds recent years that build specialized analytical data management soft- more nodes to a system, yet parallel databases tend to be designed ware (e.g., Netezza, Vertica, DATAllegro, Greenplum, Aster Data, with the assumption that failures are a rare event. Second, parallel Infobright, Kickfire, Dataupia, ParAccel, and Exasol), and continue databases generally assume a homogeneous array of machines, yet to fund them, even in pressing economic times [18]. it is nearly impossible to achieve pure homogeneity at scale. Third, At the same time, the amount of data that needs to be stored until recently, there have only been a handful of applications that re- and processed by analytical database systems is exploding. This is quired deployment on more than a few dozen nodes for reasonable performance, so parallel databases have not been tested at larger scales, and unforeseen engineering hurdles await. As the data that needs to be analyzed continues to grow, the num- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, ber of applications that require more than one hundred nodes is be- the VLDB copyright notice and the title of the publication and its date appear, ginning to multiply. Some argue that MapReduce-based systems and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers 1To be precise, Exadata is only shared-nothing in the storage layer. or to redistribute to lists, requires a fee and/or special permission from the 2 publisher, ACM. This is slightly different than textbook definitions of parallel VLDB ‘09, August 24-28, 2009, Lyon, France databases which sometimes include shared-memory and shared- Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. disk architectures as well. [8] are best suited for performing analysis at this scale since they ments to demonstrate some of the issues with scaling parallel were designed from the beginning to scale to thousands of nodes databases. in a shared-nothing architecture, and have had proven success in Google’s internal operations and on the TeraSort benchmark [7]. • We describe the design of a hybrid system that is designed to Despite being originally designed for a largely different application yield the advantages of both parallel databases and MapRe- (unstructured text data processing), MapReduce (or one of its pub- duce. This system can also be used to allow single-node licly available incarnations such as open source Hadoop [1]) can databases to run in a shared-nothing environment. nonetheless be used to process structured data, and can do so at tremendous scale. For example, Hadoop is being used to manage • We evaluate this hybrid system on a previously published Facebook’s 2.5 petabyte data warehouse [20]. benchmark to determine how close it comes to parallel Unfortunately, as pointed out by DeWitt and Stonebraker [9], DBMSs in performance and Hadoop in scalability. MapReduce lacks many of the features that have proven invaluable for structured data analysis workloads (largely due to the fact that 2. RELATED WORK MapReduce was not originally designed to perform structured data There has been some recent work on bringing together ideas analysis), and its immediate gratification paradigm precludes some from MapReduce and database systems; however, this work focuses of the long term benefits of first modeling and loading data before mainly on language and interface issues. The Pig project at Yahoo processing. These shortcomings can cause an order of magnitude [22], the SCOPE project at Microsoft [6], and the open source Hive slower performance than parallel databases [23]. project [11] aim to integrate declarative query constructs from the Ideally, the scalability advantages of MapReduce could be com- database community into MapReduce-like software to allow greater bined with the performance and efficiency advantages of parallel data independence, code reusability, and automatic query optimiza- databases to achieve a hybrid system that is well suited for the an- tion. Greenplum and Aster Data have added the ability to write alytical DBMS market and can handle the future demands of data MapReduce functions (instead of, or in addition to, SQL) over data intensive applications. In this paper, we describe our implementa- stored in their parallel database products [16]. tion of and experience with HadoopDB, whose goal is to serve as Although these five projects are without question an important exactly such a hybrid system. The basic idea behind HadoopDB step in the hybrid direction, there remains a need for a hybrid solu- is to use MapReduce as the communication layer above multiple tion at the systems level in addition to at the language and interface nodes running single-node DBMS instances. Queries are expressed levels. This paper focuses on such a systems-level hybrid. in SQL, translated into MapReduce by extending existing tools, and as much work as possible is pushed into the higher performing single node databases. 3. DESIRED PROPERTIES One of the advantages of MapReduce relative to parallel In this section we describe the desired properties of a system de- databases not mentioned above is cost. There exists an open source signed for performing data analysis at the (soon to be more com- version of MapReduce (Hadoop) that can be obtained and used mon) petabyte scale. In the following section, we discuss how par- without cost. Yet all of the parallel databases mentioned above allel database systems and MapReduce-based systems do not meet have a nontrivial cost, often coming with seven figure price tags some subset of these desired properties. for large installations. Since it is our goal to combine all of the Performance. Performance is the primary characteristic that com- advantages of both data analysis approaches in our hybrid system, mercial database systems use to distinguish themselves from other we decided to build our prototype completely out of open source solutions, with marketing literature often filled with claims that a components in order to achieve the cost advantage as well. Hence, particular solution is many times faster than the competition. A we use PostgreSQL as the database layer and Hadoop as the factor of ten can make a big difference in the amount, quality, and communication layer, Hive as the translation layer, and all code we depth of analysis a system can do. add we release as open source [2]. High performance systems can also sometimes result in cost sav- One side effect of such a design is a shared-nothing version of ings. Upgrading to a faster software product can allow a corporation PostgreSQL. We are optimistic that our approach has the potential to delay a costly hardware upgrade, or avoid buying additional com- to help transform any single-node DBMS into a shared-nothing par- pute nodes as an application continues to scale. On public cloud allel database. computing platforms, pricing is structured in a way such that one Given our focus on cheap, large scale data analysis, our tar- pays only for what one uses, so the vendor price increases linearly get platform is virtualized public or private “cloud computing” with the requisite storage, network bandwidth, and compute power. deployments, such as Amazon’s Elastic Compute Cloud (EC2) Hence, if data analysis software product A requires an order of mag- or VMware’s private VDC-OS offering. Such deployments nitude more compute units than data analysis software product B to significantly reduce up-front capital costs, in addition to lowering perform the same task, then product A will cost (approximately) operational, facilities, and hardware costs (through maximizing an order of magnitude more than B. Efficient software has a direct current hardware utilization). Public cloud offerings such as EC2 effect on the bottom line. also yield tremendous economies of scale [14], and pass on some of these savings to the customer. All experiments we run in this paper Fault Tolerance. Fault tolerance in the context of analytical data are on Amazon’s EC2 cloud offering; however our techniques are workloads is measured differently than fault tolerance in the con- applicable to non-virtualized cluster computing grid deployments text of transactional workloads. For transactional workloads, a fault as well. tolerant DBMS can recover from a failure without losing any data In summary, the primary contributions of our work include: or updates from recently committed transactions, and in the context of distributed databases, can successfully commit transactions • We extend previous work [23] that showed the superior per- and make progress on a workload even in the face of worker node formance of parallel databases relative to Hadoop. While this failures. For read-only queries in analytical workloads, there are previous work focused only on performance in an ideal set- neither write transactions to commit, nor updates to lose upon node ting, we add fault tolerance and heterogeneous node experi- failure. Hence, a fault tolerant analytical DBMS is simply one that does not have to restart a query if one of the nodes involved in query These systems all support standard relational tables and SQL, and processing fails. implement many of the performance enhancing techniques devel- Given the proven operational benefits and resource consumption oped by the research community over the past few decades, in- savings of using cheap, unreliable commodity hardware to build cluding indexing, compression (and direct operation on compressed a shared-nothing cluster of machines, and the trend towards data), materialized views, result caching, and I/O sharing. Most extremely low-end hardware in data centers [14], the probability (or even all) tables are partitioned over multiple nodes in a shared- of a node failure occurring during query processing is increasing nothing cluster; however, the mechanism by which data is parti- rapidly. This problem only gets worse at scale: the larger the tioned is transparent to the end-user. Parallel databases use an op- amount of data that needs to be accessed for analytical queries, the timizer tailored for distributed workloads that turn SQL commands more nodes are required to participate in query processing. This into a query plan whose execution is divided equally among multi- further increases the probability of at least one node failing during ple nodes. query execution. Google, for example, reports an average of 1.2 Of the desired properties of large scale data analysis workloads failures per analysis job [8]. If a query must restart each time a described in Section 3, parallel databases best meet the “perfor- node fails, then long, complex queries are difficult to complete. mance property” due to the performance push required to compete Ability to run in a heterogeneous environment. As described on the open market, and the ability to incorporate decades worth above, there is a strong trend towards increasing the number of of performance tricks published in the database research commu- nodes that participate in query execution. It is nearly impossible nity. Parallel databases can achieve especially high performance to get homogeneous performance across hundreds or thousands of when administered by a highly skilled DBA who can carefully de- compute nodes, even if each node runs on identical hardware or on sign, deploy, tune, and maintain the system, but recent advances an identical virtual machine. Part failures that do not cause com- in automating these tasks and bundling the software into appliance plete node failure, but result in degraded hardware performance be- (pre-tuned and pre-configured) offerings have given many parallel come more common at scale. Individual node disk fragmentation databases high performance out of the box. and software configuration errors can also cause degraded perfor- Parallel databases also score well on the flexible query interface mance on some nodes. Concurrent queries (or, in some cases, con- property. Implementation of SQL and ODBC is generally a given, current processes) further reduce the homogeneity of cluster perfor- and many parallel databases allow UDFs (although the ability for mance. On virtualized machines, concurrent activities performed the query planner and optimizer to parallelize UDFs well over a by different virtual machines located on the same physical machine shared-nothing cluster varies across different implementations). can cause 2-4% variation in performance [5]. However, parallel databases generally do not score well on the If the amount of work needed to execute a query is equally di- fault tolerance and ability to operate in a heterogeneous environ- vided among the nodes in a shared-nothing cluster, then there is a ment properties. Although particular details of parallel database danger that the time to complete the query will be approximately implementations vary, their historical assumptions that failures are equal to time for the slowest compute node to complete its assigned rare events and “large” clusters mean dozens of nodes (instead of task. A node with degraded performance would thus have a dis- hundreds or thousands) have resulted in engineering decisions that proportionate effect on total query time. A system designed to run make it difficult to achieve these properties. in a heterogeneous environment must take appropriate measures to Furthermore, in some cases, there is a clear tradeoff between prevent this from occurring. fault tolerance and performance, and parallel databases tend to choose the performance extreme of these tradeoffs. For example, Flexible query interface. There are a variety of customer-facing frequent check-pointing of completed sub-tasks increase the fault business intelligence tools that work with database software and tolerance of long-running read queries, yet this check-pointing aid in the visualization, query generation, result dash-boarding, and reduces performance. In addition, pipelining intermediate results advanced data analysis. These tools are an important part of the between query operators can improve performance, but can result analytical data management picture since business analysts are of- in a large amount of work being lost upon a failure. ten not technically advanced and do not feel comfortable interfac- ing with the database software directly. Business Intelligence tools typically connect to databases using ODBC or JDBC, so databases 4.2 MapReduce that want to work with these tools must accept SQL queries through MapReduce was introduced by Dean et. al. in 2004 [8]. these interfaces. Understanding the complete details of how MapReduce works is Ideally, the data analysis system should also have a robust mech- not a necessary prerequisite for understanding this paper. In short, anism for allowing the user to write user defined functions (UDFs) MapReduce processes data distributed (and replicated) across and queries that utilize UDFs should automatically be parallelized many nodes in a shared-nothing cluster via three basic operations. across the processing nodes in the shared-nothing cluster. Thus, First, a set of Map tasks are processed in parallel by each node in both SQL and non-SQL interface languages are desirable. the cluster without communicating with other nodes. Next, data is repartitioned across all nodes of the cluster. Finally, a set of Reduce 4. BACKGROUND AND SHORTFALLS OF tasks are executed in parallel by each node on the partition it receives. This can be followed by an arbitrary number of additional AVAILABLE APPROACHES Map-repartition-Reduce cycles as necessary. MapReduce does not In this section, we give an overview of the parallel database and create a detailed query execution plan that specifies which nodes MapReduce approaches to performing data analysis, and list the will run which tasks in advance; instead, this is determined at properties described in Section 3 that each approach meets. runtime. This allows MapReduce to adjust to node failures and slow nodes on the fly by assigning more tasks to faster nodes and 4.1 Parallel DBMSs reassigning tasks from failed nodes. MapReduce also checkpoints Parallel database systems stem from research performed in the the output of each Map task to local disk in order to minimize the late 1980s and most current systems are designed similarly to the amount of work that has to be redone upon a failure. early Gamma [10] and Grace [12] parallel DBMS research projects. Of the desired properties of large scale data analysis workloads, MapReduce best meets the fault tolerance and ability to operate in SQL Query heterogeneous environment properties. It achieves fault tolerance MapReduce Job SMS Planner by detecting and reassigning Map tasks of failed nodes to other MapReduce nodes in the cluster (preferably nodes with replicas of the input Map Job Hadoop core data). It achieves the ability to operate in a heterogeneous environ- Master node MapReduce ment via redundant task execution. Tasks that are taking a long time HDFS Framework Catalog Loader Data to complete on slow nodes get redundantly executed on other nodes NameNode JobTracker that have completed their assigned tasks. The time to complete the task becomes equal to the time for the fastest node to complete the InputFormat Implementations redundantly executed task. By breaking tasks into small, granular Database Connector tasks, the effect of faults and “straggler” nodes can be minimized. Task with MapReduce has a flexible query interface; Map and Reduce func- InputFormat tions are just arbitrary computations written in a general-purpose Node 1 Node 2 Node n language. Therefore, it is possible for each task to do anything on TaskTracker TaskTracker TaskTracker its input, just as long as its output follows the conventions defined by the model. In general, most MapReduce-based systems (such as Database DataNode Database DataNode Database DataNode Hadoop, which directly implements the systems-level details of the MapReduce paper) do not accept declarative SQL. However, there are some exceptions (such as Hive). As shown in previous work, the biggest issue with MapReduce Figure 1: The Architecture of HadoopDB is performance [23]. By not requiring the user to first model and load data before processing, many of the performance enhancing locality by matching a TaskTracker to Map tasks that process data tools listed above that are used by database systems are not possible. local to it. It load-balances by ensuring all available TaskTrackers Traditional business data analytical processing, that have standard are assigned tasks. TaskTrackers regularly update the JobTracker reports and many repeated queries, is particularly, poorly suited for with their status through heartbeat messages. the one-time query processing model of MapReduce. The InputFormat library represents the interface between the Ideally, the fault tolerance and ability to operate in heterogeneous storage and processing layers. InputFormat implementations parse environment properties of MapReduce could be combined with the text/binary files (or connect to arbitrary data sources) and transform performance of parallel databases systems. In the following sec- the data into key-value pairs that Map tasks can process. Hadoop tions, we will describe our attempt to build such a hybrid system. provides several InputFormat implementations including one that allows a single JDBC-compliant database to be accessed by all tasks in one job in a given cluster. 5. HADOOPDB In this section, we describe the design of HadoopDB. The goal of 5.2 HadoopDB’s Components this design is to achieve all of the properties described in Section 3. HadoopDB extends the Hadoop framework (see Fig. 1) by pro- The basic idea behind behind HadoopDB is to connect multiple viding the following four components: single-node database systems using Hadoop as the task coordinator and network communication layer. Queries are parallelized across 5.2.1 Database Connector nodes using the MapReduce framework; however, as much of the The Database Connector is the interface between independent single node query work as possible is pushed inside of the corre- database systems residing on nodes in the cluster and TaskTrack- sponding node databases. HadoopDB achieves fault tolerance and ers. It extends Hadoop’s InputFormat class and is part of the Input- the ability to operate in heterogeneous environments by inheriting Format Implementations library. Each MapReduce job supplies the the scheduling and job tracking implementation from Hadoop, yet Connector with an SQL query and connection parameters such as: it achieves the performance of parallel databases by doing much of which JDBC driver to use, query fetch size and other query tuning the query processing inside of the database engine. parameters. The Connector connects to the database, executes the SQL query and returns results as key-value pairs. The Connector 5.1 Hadoop Implementation Background could theoretically connect to any JDBC-compliant database that At the heart of HadoopDB is the Hadoop framework. Hadoop resides in the cluster. However, different databases require different consits of two layers: (i) a data storage layer or the Hadoop Dis- read query optimizations. We implemented connectors for MySQL tributed File System (HDFS) and (ii) a data processing layer or the and PostgreSQL. In the future we plan to integrate other databases MapReduce Framework. including open-source column-store databases such as MonetDB HDFS is a block-structured file system managed by a central and InfoBright. By extending Hadoop’s InputFormat, we integrate NameNode. Individual files are broken into blocks of a fixed size seamlessly with Hadoop’s MapReduce Framework. To the frame- and distributed across multiple DataNodes in the cluster. The work, the databases are data sources similar to data blocks in HDFS. NameNode maintains metadata about the size and location of blocks and their replicas. 5.2.2 Catalog The MapReduce Framework follows a simple master-slave ar- The catalog maintains metainformation about the databases. This chitecture. The master is a single JobTracker and the slaves or includes the following: (i) connection parameters such as database worker nodes are TaskTrackers. The JobTracker handles the run- location, driver class and credentials, (ii) metadata such as data time scheduling of MapReduce jobs and maintains information on sets contained in the cluster, replica locations, and data partition- each TaskTracker’s load and available resources. Each job is bro- ing properties. ken down into Map tasks based on the number of data blocks that The current implementation of the HadoopDB catalog stores its require processing, and Reduce tasks. The JobTracker assigns tasks metainformation as an XML file in HDFS. This file is accessed by to TaskTrackers based on locality and load balancing. It achieves the JobTracker and TaskTrackers to retrieve information necessary to schedule tasks and process data needed by a query. In the future, Map File Sink Operator Phase File Sink Operator file=annual_sales_revenue Only we plan to deploy the catalog as a separate service that would work file=annual_sales_revenue in a way similar to Hadoop’s NameNode. Table Scan Operator Select Operator SELECT YEAR(saleDate), SUM(revenue) expr:[Col[0], Col[1]] FROM sales GROUP BY YEAR(saleDate) 5.2.3 Data Loader [0: string, 1: double] Group By Operator (b) The Data Loader is responsible for (i) globally repartitioning data aggr:[Sum(1)] keys:[Col[0]] mode: Reduce merge partial File Sink Operator on a given partition key upon loading, (ii) breaking apart single Phase file=annual_sales_revenue node data into multiple smaller partitions or chunks and (iii) finally Map Reduce Sink Operator Phase Partition Cols: Col[0] Select Operator bulk-loading the single-node databases with the chunks. expr:[Col[0], Col[1]] Group By Operator Group By Operator The Data Loader consists of two main components: Global aggr:[Sum(1)] keys:[Col[0]] mode: aggr:[Sum(1)] keys:[Col[0]] mode: hash Reduce merge partial Hasher and Local Hasher. The Global Hasher executes a custom- Phase Select Operator made MapReduce job over Hadoop that reads in raw data files expr:[Col[YEAR(saleDate)], Col Map Reduce Sink Operator [revenue]] Phase Partition Cols: Col[0] stored in HDFS and repartitions them into as many parts as the [0: string, 1: double] number of nodes in the cluster. The repartitioning job does not Table Scan Operator Table Scan Operator SELECT YEAR(saleDate), SUM(revenue) sales incur the sorting overhead of typical MapReduce jobs. FROM sales GROUP BY YEAR(saleDate) The Local Hasher then copies a partition from HDFS into the [0: string, 1: double] (a) local file system of each node and secondarily partitions the file into (c) smaller sized chunks based on the maximum chunk size setting. The hashing functions used by both the Global Hasher and the Figure 2: (a) MapReduce job generated by Hive (b) Local Hasher differ to ensure chunks are of a uniform size. They MapReduce job generated by SMS assuming sales is par- also differ from Hadoop’s default hash-partitioning function to en- titioned by YEAR(saleDate). This feature is still unsup- sure better load balancing when executing MapReduce jobs over ported (c) MapReduce job generated by SMS assuming the data. no partitioning of sales 5.2.4 SQL to MapReduce to SQL (SMS) Planner HadoopDB provides a parallel database front-end to data analysts phases of a query plan. The Hive optimizer is a simple, na¨ıve, rule- enabling them to process SQL queries. based optimizer. It does not use cost-based optimization techniques. The SMS planner extends Hive [11]. Hive transforms HiveQL, a Therefore, it does not always generate efficient query plans. This is variant of SQL, into MapReduce jobs that connect to tables stored another advantage of pushing as much as possible of the query pro- as files in HDFS. The MapReduce jobs consist of DAGs of rela- cessing logic into DBMSs that have more sophisticated, adaptive or tional operators (such as filter, select (project), join, aggregation) cost-based optimizers. that operate as iterators: each operator forwards a data tuple to the (5) Finally, the physical plan generator converts the logical query next operator after processing it. Since each table is stored as a plan into a physical plan executable by one or more MapReduce separate file in HDFS, Hive assumes no collocation of tables on jobs. The first and every other Reduce Sink operator marks a tran- nodes. Therefore, operations that involve multiple tables usually sition from a Map phase to a Reduce phase of a MapReduce job and require most of the processing to occur in the Reduce phase of the remaining Reduce Sink operators mark the start of new MapRe- a MapReduce job. This assumption does not completely hold in duce jobs. The above SQL query results in a single MapReduce HadoopDB as some tables are collocated and if partitioned on the job with the physical query plan illustrated in Fig. 2(a). The boxes same attribute, the join operation can be pushed entirely into the stand for the operators and the arrows represent the flow of data. database layer. (6) Each DAG enclosed within a MapReduce job is serialized To understand how we extended Hive for SMS as well as the dif- into an XML plan. The Hive driver then executes a Hadoop job. ferences between Hive and SMS, we first describe how Hive creates The job reads the XML plan and creates all the necessary operator an executable MapReduce job for a simple GroupBy-Aggregation objects that scan data from a table in HDFS, and parse and process query. Then, we describe how we modify the execution plan for one tuple at a time. HadoopDB by pushing most of the query processing logic into the The SMS planner modifies Hive. In particular we intercept the database layer. normal Hive flow in two main areas: Consider the following query: (i) Before any query execution, we update the MetaStore with SELECT YEAR(saleDate), SUM(revenue) references to our database tables. Hive allows tables to exist exter- FROM sales GROUP BY YEAR(saleDate); nally, outside HDFS. The HadoopDB catalog, Section 5.2.2, pro- Hive processes the above SQL query in a series of phases: vides information about the table schemas and required Deserial- (1) The parser transforms the query into an Abstract Syntax Tree. izer and InputFormat classes to the MetaStore. We implemented (2) The Semantic Analyzer connects to Hive’s internal catalog, these specialized classes. the MetaStore, to retrieve the schema of the sales table. It also (ii) After the physical query plan generation and before the ex- populates different data structures with meta information such as ecution of the MapReduce jobs, we perform two passes over the the Deserializer and InputFormat classes required to scan the table physical plan. In the first pass, we retrieve data fields that are actu- and extract the necessary fields. ally processed by the plan and we determine the partitioning keys (3) The logical plan generator then creates a DAG of relational used by the Reduce Sink (Repartition) operators. In the second operators, the query plan. pass, we traverse the DAG bottom-up from table scan operators to (4) The optimizer restructures the query plan to create a more the output or File Sink operator. All operators until the first repar- optimized plan. For example, it pushes filter operators closer to the tition operator with a partitioning key different from the database’s table scan operators. A key function of the optimizer is to break up key are converted into one or more SQL queries and pushed into the plan into Map or Reduce phases. In particular, it adds a Repar- the database layer. SMS uses a rule-based SQL generator to recre- tition operator, also known as a Reduce Sink operator, before Join ate SQL from the relational operators. The query processing logic or GroupBy operators. These operators mark the Map and Reduce that could be pushed into the database layer ranges from none (each table is scanned independently and tuples are pushed one at a time We observed that disk I/O performance on EC2 nodes were ini- into the DAG of operators) to all (only a Map task is required to tially quite slow (25MB/s). Consequently, we initialized some ad- output the results into an HDFS file). ditional space on each node so that intermediate files and output of Given the above GroupBy query, SMS produces one of two dif- the tasks did not suffer from this initial write slow-down. Once disk ferent plans. If the sales table is partitioned by YEAR(saleDate), space is initialized, subsequent writes are much faster (86MB/s). it produces the query plan in Fig. 2(b): this plan pushes the entire Network speed is approximately 100-110MB/s. We execute each query processing logic into the database layer. Only a Map task task three times and report the average of the trials. The final results is required to output results into an HDFS file. Otherwise, SMS from all parallel databases queries are piped from the shell com- produces the query plan in Fig. 2(c) in which the database layer mand into a file. Hadoop and HadoopDB store results in Hadoop’s partially aggregates data and eliminates the selection and group-by distributed file system (HDFS). In this section, we only report re- operator used in the Map phase of the Hive generated query plan sults using trials where all nodes are available, operating correctly, (Fig. 2(a)). The final aggregation step in the Reduce phase of the and have no concurrent tasks during benchmark execution (we drop MapReduce job, however, is still required in order to merge partial these requirements in Section 7). For each task, we benchmark per- results from each node. formance on cluster sizes of 10, 50, and 100 nodes. For join queries, Hive assumes that tables are not collocated. Therefore, the Hive generated plan scans each table independently 6.1 Benchmarked Systems and computes the join after repartitioning data by the join key. In Our experiments compare performance of Hadoop, HadoopDB contrast, if the join key matches the database partitioning key, SMS (with PostgreSQL5 as the underlying database) and two commercial pushes the entire join sub-tree into the database layer. parallel DBMSs. So far, we only support filter, select (project) and aggregation operators. Currently, the partitioning features supported by Hive 6.1.1 Hadoop are extremely na¨ıve and do not support expression-based partition- Hadoop is an open-source version of the MapReduce framework, ing. Therefore, we cannot detect if the sales table is partitioned implemented by directly following the ideas described in the orig- by YEAR(saleDate) or not, therefore we have to make the pes- inal MapReduce paper, and is used today by dozens of businesses simistic assumption that the data is not partitioned by this attribute. to perform data analysis [1]. For our experiments in this paper, we The Hive build [15] we extended is a little buggy; as explained in use Hadoop version 0.19.1 running on Java 1.6.0. We deployed the Section 6.2.5, it fails to execute the join task used in our bench- system with several changes to the default configuration settings. 3 mark, even when running over HDFS tables . However, we use the Data in HDFS is stored using 256MB data blocks instead of the de- SMS planner to automatically push SQL queries into HadoopDB’s fault 64MB. Each MR executor ran with a maximum heap size of DBMS layer for all other benchmark queries presented in our ex- 1024MB. We allowed two Map instances and a single Reduce in- periments for this paper. stance to execute concurrently on each node. We also allowed more buffer space for file read/write operations (132MB) and increased 5.3 Summary the sort buffer to 200MB with 100 concurrent streams for merging. HadoopDB does not replace Hadoop. Both systems coexist en- Additionally, we modified the number of parallel transfers run by abling the analyst to choose the appropriate tools for a given dataset Reduce during the shuffle phase and the number of worker threads and task. Through the performance benchmarks in the following for each TaskTracker’s http server to be 50. These adjustments sections, we show that using an efficient database storage layer cuts follow the guidelines on high-performance Hadoop clusters [13]. down on data processing time especially on tasks that require com- Moreover, we enabled task JVMs to be reused. plex query processing over structured data such as joins. We also For each benchmark trial, we stored all input and output data in show that HadoopDB is able to take advantage of the fault-tolerance HDFS with no replication (we add replication in Section 7). Af- and the ability to run on heterogeneous environments that comes ter benchmarking a particular cluster size, we deleted the data di- naturally with Hadoop-style systems. rectories on each node, reformatted and reloaded HDFS to ensure uniform data distribution across all nodes. We present results of both hand-coded Hadoop and Hive-coded 6. BENCHMARKS Hadoop (i.e. Hadoop plans generated automatically via Hive’s SQL In this section we evaluate HadoopDB, comparing it with a interface). These separate results for Hadoop are displayed as split MapReduce implementation and two parallel database imple- bars in the graphs. The bottom, colored segment of the bars repre- mentations, using a benchmark first presented in [23]4. This sent the time taken by Hadoop when hand-coded and the rest of the benchmark consists of five tasks. The first task is taken directly bar indicates the additional overhead as a result of the automatic from the original MapReduce paper [8] whose authors claim is plan-generation by Hive, and operator function-call and dynamic representative of common MR tasks. The next four tasks are data type resolution through Java’s Reflection API for each tuple analytical queries designed to be representative of traditional processed in Hive-coded jobs. structured data analysis workloads that HadoopDB targets. We ran our experiments on Amazon EC2 “large” instances (zone: 6.1.2 HadoopDB us-east-1b). Each instance has 7.5 GB memory, 4 EC2 Compute The Hadoop part of HadoopDB was configured identically to the Units (2 virtual cores), 850 GB instance storage (2 x 420 GB plus description above except for the number of concurrent Map tasks, 10 GB root partition) and runs 64-bit platform Linux Fedora 8 OS. which we set to one. Additionally, on each worker node, Post- greSQL version 8.2.5 was installed. We increased memory used by 3The Hive team resolved these issues in June after we completed the PostgreSQL shared buffers to 512 MB and the working memory the experiments. We plan to integrate the latest Hive with the SMS Planner. 5Initially, we experimented with MySQL (MyISAM storage layer). 4We are aware of the writing law that references shouldn’t be used However, we found that while simple table scans are up to 30% as nouns. However, to save space, we use [23] not as a reference, faster, more complicated SQL queries are much slower due to the but as a shorthand for “the SIGMOD 2009 paper by Pavlo et. al.” lack of clustered indices and poor join algorithms. size to 1GB. We did not compress data in PostgreSQL. (VARCHAR(100)) and contents (arbitrary text). Finally, the Rank- Analogous to what we did for Hadoop, we present results of ings table contains three attributes: pageURL (VARCHAR(100)), both hand-coded HadoopDB and SMS-coded HadoopDB (i.e. en- pageRank (INT), and avgDuration(INT). tire query plans created by HadoopDB’s SMS planner). These sep- The data generator yields 155 million UserVisits records (20GB) arate results for HadoopDB are displayed as split bars in the graphs. and 18 million Rankings records (1GB) per node. Since the data The bottom, colored segment of the bars represents the time taken generator does not ensure that Rankings and UserVisits tuples with by HadoopDB when hand-coded and the rest of the bar indicates the same value for the URL attribute are stored on the same node, a the additional overhead as a result of the SMS planner (e.g., SMS repartitioning is done during the data load, as described later. jobs need to serialize tuples retrieved from the underlying database Records for both the UserVisits and Rankings data sets are stored and deserialize them before further processing in Hadoop). in HDFS as plain text, one record per line with fields separated by a delimiting character. In order to access the different attributes 6.1.3 Vertica at run time, the Map and Reduce functions split the record by the Vertica is a relatively new parallel database system (founded in delimiter into an array of strings. 2005) [3] based on the C-Store research project [24]. Vertica is a column-store, which means that each attribute of each table is 6.2.1 Data Loading stored (and accessed) separately, a technique that has proven to improve performance for read-mostly workloads. We report load times for two data sets, Grep and UserVisits in Vertica offers a “cloud” edition, which we used for the experi- Fig. 3 and Fig. 4. While grep data is randomly generated and re- ments in this paper. Vertica was also used in the performance study quires no preprocessing, UserVisits needs to be repartitioned by of previous work [23] on the same benchmark, so we configured destinationURL and indexed by visitDate for all databases during Vertica identically to the previous experiments6. The Vertica con- the load in order to achieve better performance on analytical queries figuration is therefore as follows: All data is compressed. Vertica (Hadoop would not benefit from such repartitioning). We describe, operates on compressed data directly. Vertica implements primary briefly, the loading procedures for all systems: indexes by sorting the table by the indexed attribute. None of Ver- Hadoop: We loaded each node with an unaltered UserVisits data tica’s default configuration parameters were changed. file. HDFS automatically breaks the file into 256MB blocks and stores the blocks on a local DataNode. Since all nodes load their 6.1.4 DBMS-X data in parallel, we report the maximum node load time from each cluster. Load time is greatly affected by stragglers. This effect DBMS-X is the same commercial parallel row-oriented database is especially visible when loading UserVisits, where a single slow as was used for the benchmark in [23]. Since at the time of our node in the 100-node cluster pushed the overall load time to 4355 VLDB submission this DBMS did not offer a cloud edition, we seconds and to 2600 seconds on the 10-node cluster, despite the did not run experiments for it on EC2. However, since our Vertica average load time of only 1100 seconds per node. numbers were consistently 10-15% slower on EC2 than on the Wis- HadoopDB: We set the maximum chunk size to 1GB. Each chunk consin cluster presented in [23]7 (this result is expected since the is located in a separate PostgreSQL database within a node, and virtualization layer is known to introduce a performance overhead), processes SQL queries independently of other chunks. We report we reproduce the DBMS-X numbers from [23] on our figures as a the maximum node load time as the entire load time for both Grep best case performance estimate for DBMS-X if it were to be run on and UserVisits. EC2. Since the Grep dataset does not require any preprocessing and is 6.2 Performance and Scalability Benchmarks only 535MB of data per node, the entire data was loaded using the standard SQL COPY command into a single chunk on each node. The first benchmark task (the “Grep task”) requires each sys- The Global Hasher partitions the entire UserVisits dataset across tem to scan through a data set of 100-byte records looking for a all nodes in the cluster. Next, the Local Hasher on each node re- three character pattern. This is the only task that requires process- trieves a 20GB partition from HDFS and hash-partitions it into 20 ing largely unstructured data, and was originally included in the smaller chunks, 1GB each. Each chunk is then bulk-loaded using benchmark by the authors of [23] since the same task was included COPY. Finally, a clustered index on visitDate is created for each in the original MapReduce paper [8]. chunk. To explore more complex uses of the benchmarked systems, the The load time for UserVisits is broken down into several phases. benchmark includes four more analytical tasks related to log-file The first repartition carried out by Global Hasher is the most ex- analysis and HTML document processing. Three of these tasks op- pensive step in the process. It takes nearly half the total load time, erate on structured data; the final task operates on both structured 14,000 s. Of the remaining 16,000 s, locally partitioning the data and unstructured data. into 20 chunks takes 2500 s (15.6%), the bulk copy into tables takes The datasets used by these four tasks include a UserVisits table 5200 s (32.5%), creating clustered indices, which includes sort- meant to model log files of HTTP server traffic, a Documents table ing, takes 7100 s (44.4%), finally vacuuming the databases takes containing 600,000 randomly generated HTML documents, and a 1200 s (7.5%). All the steps after global repartitioning are executed Rankings table that contains some metadata calculated over the data in parallel on all nodes. We observed individual variance in load in the Documents table. The schema of the tables in the benchmark times. Some nodes required as little as 10,000 s to completely load data set is described in detail in [23]. In summary, the UserVisits UserVisits after global repartitioning was completed. table contains 9 attributes, the largest of which is destinationURL Vertica: The loading procedure for Vertica is analogous to the one which is of type VARCHAR(100). Each tuple is on the order of 150 described in [23]. The loading time improved since then because a bytes wide. The Documents table contains two attributes: a URL newer version of Vertica (3.0) was used for these experiments. The 6In fact, we asked the same person who ran the queries for this key difference is that now bulk load COPY command runs on all previous work to run the same queries on EC2 for our paper nodes in the cluster completely in parallel. 7We used a later version of Vertica in these experiments than [23]. DBMS-X: We report the total load time including data compression On using the identical version, slowdown was 10-15% on EC2. and indexing from [23]. pes lo eiiilzddssbfr xeiet seSec- (see experiments disk-write before by disks affected initialized not 6). we are tion Also, they write-limited, speeds. not are and marks pageURL tuple’s the applies outputs Map delimiter, and a field pageRank, the on [23]: predicate using 10; the to tuples > Rankings identically pageRank parses executed WHERE function was Rankings (hand-coded) FROM Hadoop pageRank pageURL, SELECT SQL: identical predicate. the this are pass that There node table. each Rankings on the tuples 36,000 from approximately attribute pageRank the on cate about by DBMS-X up experiments). speeds all compression on that 50% note reduces Ver- [23] significantly both ( which cost that data, I/O fact their parallel the compress the to DBMS-X due by and is tica outperformed difference are This systems systems. parsing both databases runtime However, of lack the data. to of due Hadoop handles than it efficiently as more Hadoop I/O outperforms slightly HadoopDB 6.1). Section Map instances. the PostgreSQL of output the HDFS. so to task, directly written this No is for “XYZ”). function needed on match is sub-string function a Reduce performs that function Map ple speed. disk scan by table full limited field a mostly requires the is query on and this index systems, all an for contained Hence, systems attribute. benchmarked the of None SQL: each identical for the nodes. of processed number records the of times million number 5.6 total roughly is size or The records. cluster records, 10,000 every 100-byte data. in such of once million 535MB found 5.6 is contains and node field, Each byte searched 90 is “XYZ” the pattern in The for string. character 90-byte a by lowed 7) section that tolerance (see process environments inherent only heterogeneous HadoopDB’s the of and is Hadoop’s slow- loading from the cluster, benefit by the cannot limited in is of speed loading number disk-write of the speed est as the scale Since systems increases. all nodes ensure Vertica and HadoopDB 8 .. eeto Task Selection 6.2.3 Task Grep 6.2.2 C ik r lwo nta rts ic efrac bench- performance Since writes. initial on slow are disks EC2 SELECT etc,DM-,HdoD,adHdo(ie l executed all Hadoop(Hive) and HadoopDB, DBMS-X, Vertica, predi- selection simple a evaluates task data structured first The in explained were bars split the (note, results the displays the 5 into Fig. clause WHERE the pushes planner SMS HadoopDB’s sim- (a [23] to identically executed was (hand-coded) Hadoop executed all Hadoop(Hive) and HadoopDB, DBMS-X, Vertica, fol- bytes, 10 first the in key unique a of consists record Each Hadoop, of features load parallel the DBMS-X, to contrast In seconds 1000 1200 1400 1600 iue3 odGe (0.5GB/node) Grep Load 3: Figure 200 400 600 800 0

* 92 10 nodes RMDt HR il IE‘%XYZ%’; LIKE field WHERE Data FROM 139 V ertica 47 DB-X 141 50 nodes HadoopDB 100 43 Hadoop

100 nodes 164

161 77 8 . iue4 odUeVst (20GB/node) UserVisits Load 4: Figure seconds 10000 15000 20000 25000 30000 35000 40000 45000 50000 5000 0 10 nodes V ertica DB-X 50 nodes HadoopDB h dnia SQL: identical the groups. unique 2,500,000 are there sourceIP, the group- entire on When the grouping groups. on unique ing When 2000 (so are calculated). cluster there be the prefix, can seven-character in aggregate nodes final different the intermediate between that requires exchanged task be this to tasks, results previous the the either sourceIP Unlike entire by the or column. grouped column table, sourceIP the UserVisits of the prefix seven-character in sourceIP each from which systems these low. of very otherwise costs is time network query Vertica increased when entire dominate and to the DBMS-X to scan due relative sequentially better mainly scales to index HadoopDB clustered need set. a the data of eliminates use this pageRank the on because on HadoopDB Hadoop 1GB. outperforms entire set the data containing table Rankings the data of 1GB performance. only HadoopDB’s process decreases to significantly tasks node over- Map per The twenty scheduling chunk). (col- UserVisits of MB 1GB head 50 corresponding only the is with chunk repartition located Rankings to Each Hasher pageURL. Local by and Rankings UserVisits Global and the pageURL causes Rankings destinationURL between relationship key us- eign from general Hadoop. benefit in outperform to however, Hence, able are systems, DBMSs column. parallel other pageRank the and the The HadoopDB on indices of file. clustered scan a ing complete in brute-force, a data performs all Hive) without and (with instances. PostgreSQL the into clauses This succeeds. function. predicate Reduce the a require if not pair does key/value task new a as pageRank etn l r-grgtdsm rmec otrSLinstance). PostgreSQL col- each (after Reduce from aggregation sums to final pre-aggregated the sent all perform then lecting that is Hadoop output of The inside jobs instances. PostgreSQL the into for gets aggregation which sum query) the sourceIP). performs larger (or prefix which the each function in Reduce field a to whole sent the the of (or characters seven field first sourceIP the and adRevenue the outputs function .. grgto Task Aggregation 6.2.4 EETsucI,SMaRvne RMUserVisits sourceIP; FROM BY 7); SUM(adRevenue) GROUP 1, sourceIP, SUBSTR(sourceIP, SELECT BY query: SUM(adRevenue) GROUP Larger 7), UserVisits 1, FROM SUBSTR(sourceIP, SELECT query: Smaller etc,DM-,HdoD,adHdo(ie l executed all Hadoop(Hive) and HadoopDB, DBMS-X, Vertica, generated adRevenue total the computing involves task next The of copy non-chunked additional, an maintain therefore, We, for- the destinationURL, UserVisits by partitioned is data Since Hadoop 6. Fig. in presented is system each of performance projection The and selection the pushes planner SMS HadoopDB’s h M lne o aopBpse h nieSLquery SQL entire the pushes HadoopDB for planner SMS The Map a [23]: to identically executed was (hand-coded) Hadoop Hadoop 100 nodes

seconds 20 30 40 50 60 70 10 0 10 nodes iue5 rpTask Grep 5: Figure V ertica DB-X 50 nodes HadoopDB Hadoop 100 nodes sbcueteHv ul eetne a nbet xct this execute to unable was extended we build Hive This the systems. Hadoop-based because for is and databases parallel the for SQL date requisite the inside value the range. visitDate in a records have 134,000 that table approximately the UserVisits two in are found in There is read information table). revenue must UserVisits and is table it information Rankings (pageRank that the together be- in is them found difference join tasks key and previous sets The the data different and 2000. task 15-22, this revenue January most tween of the generated week that the sourceIP during the from visited pages of savings. I/O commensurate the to bytes outper- due 200 significantly systems to than other able more the thus form the is (sourceIP Vertica of query tuple. out UserVisits this 20 each in only in of accessed consist attributes adRevenue) and two the since storage, chunk implementation. tasks 1GB aggregation both efficient each on its for Hadoop to table due outperforms aggregate HadoopDB Hence, hash memory. entire in the fit easily can the on Hadoop rep- query. than worse larger binary much natural performs their Hive in Hence, them resentation. maintaining of instead strings into sort and aggregation simple hash or like catalog between systems system choose aggregation. the query to from rules relational statistics optimization use and can systems which Hive, database in present practice. hash aggregation standard of MapReduce sort-based advantage a because is take Combiners) query to (using smaller failed [23]) the Hadoop of for hand-coded authors aggregation our the in (and contrast, In we num- block. plan the per half rows In than input more is of groups. groups ber of of aggregation number number the sort-based that to small detecting switches upon a Hive is task, aggregation there large which when the job), optimal the of be phase to Map maintains proves the (it in strategy map execution hash-aggregate Hadoop aggregation internal than an hash better a much of performs uses part Hive it lower 8). because the Fig. by in represented bar is Hive Hadoop by the taken time aggrega- (the (substring) “small” task the is tion in query Hadoop hand-coded this to cost head task, systems Grep commercial Hadoop. the and both HadoopDB to Thus, outperform Similar and disk. compression from off 8. benefit data reading and by 7 limited Fig. in played .. onTask Join 6.2.5

seconds 100 120 nietepeiu he ak,w eeual ouetesame the use to unable were we tasks, three previous the Unlike set the of pageRank average the ﬁnding involves task join The column-oriented use that systems for well-suited is query This it as tasks both for aggregation hash use to chooses PostgreSQL aggregates partial serializes Hive Combiner, Hadoop’s Unlike optimizers exploiting of beneﬁt the illustrate results These over- an adds Hive that rule general the of reversal a observe We dis- is system benchmarked each for numbers performance The 20 40 60 80 0 V ertica 0.8 10 nodes iue6 eeto Task Selection 6: Figure DB-X HadoopDB

3.7 50 nodes HadoopDB Chunks

9.2 100 nodes Hadoop

seconds 1000 1500 2000 2500 3000 3500 4000 4500 5000 500 iue7 ag grgto Task Aggregation Large 7: Figure 0 10 nodes V ertica DB-X 50 nodes HadoopDB init h otrSLisacswt h olwn SQL: following the with instances PostgreSQL the into tion and HadoopDB Hence, for results query. hand-coded tables. this present for the Hadoop to joining able after only particular, are planned In we is strategy. operation execution filtering inefficient during of highly the joins fails for a query plan uses query a type the this filters such that joins, noticed tables, we that two Additionally, query execution. from SQL tuples a aggregates accepts and build this Although query. endfnto,F opretecnet fec ouetand document each of contents the parse to F, function, defined HTML each processed DBMS-X below. ta- described Documents as type. separately, the data file in document TEXT separately the document HadoopDB using each ble time. store load to at able each) was (56MB Vertica and each) URL (256MB aggregated. and table. extracted Documents are document the every in in appear documents that links other from links ward slight adRevenue. see by the sorting systems to and predicate due These of nodes aggregation selection of node single number the joins. final larger a accelerate for with degradation to support performance native index having an and using by mance predicate. selection the on perfor- evaluate dataset to its UserVisits order the in [23]: node scanning in each completely found by as limited is results mance similar observed we [23]. Hadoop, in detail) in identical described an (and use used we was below, as report oper- program we ‘Reduce’ MR numbers and the MR ‘Map’ for phase standard Hence, used three ators. that the [23] using in performance used Hadoop query program the the than include using lower not performance was query We did time), (and join query data overhead. total input in the significant time sorted sort adds we join if even and the that sorted many found already before in not sort Such since is a data operator key. the performing join above, join the query the the of including on utility cases, sorted the limits be requirement datasets a input both that requires perform to instance PostgreSQL aggregation. final each the from aggregates partial the .. D grgto Task Aggregation UDF 6.2.6 VvstaeBTEN‘000-5 N ‘2000-01-22’ UV.sourceIP; UV AND BY AS AND ‘2000-01-15’ GROUP UserVisits UV.destURL BETWEEN R, = UV.visitDate AS R.pageURL Rankings WHERE SUM(pageRank), FROM COUNT(pageRank), SUM(adRevenue) sourceIP, SELECT nHdoD,w uhteslcin on n ata aggrega- partial and join, selection, the push we HadoopDB, In h aalldtbsssol hoeial eal oueauser- a use to able be theoretically should databases parallel The Hadoop for files large into concatenated are documents HTML in- of number the document, each for computes, task final The perfor- higher achieve all Vertica and For DBMS-X, HadoopDB, task. benchmark this of results the summarizes 9 Fig. operator this operator, join a for [23]. support in has specified Hadoop query SQL Although the execute databases parallel of The all gathers that Hadoop in task Reduce single a use then We Hadoop 100 nodes

seconds 1000 1200 1400 200 400 600 800 iue8 ml grgto Task Aggregation Small 8: Figure 0 10 nodes V ertica DB-X 50 nodes HadoopDB Hadoop 100 nodes o hsts ohcmeca aaae ontsaelnal with linearly scale Moreover, not cluster. the do database. of databases size the commercial the of both outside task stored this are for files input the since 10. Fig. in seconds included 6000 not about is into was overhead files time This small merge node. total many per the merging that by Note lose data ones. not the larger does of however, structure original HadoopDB, the documents. HTML multiple power. expression provides framework processing layer MapReduce arbitrary the storage and efficient documents text an HTML provides for layer database The HadoopDB. HadoopDB. and Hadoop both for code MR tical fol- Documents the the with Hadoop of into contents passed SQL: complete is lowing node The PostgreSQL lat- each the case. on used table this we and in MapReduce option or SQL ter ability either HadoopDB’s in of queries advantage accept take to to decided we MapReduce, the URL. sum unique function each Reduce found of a URLs instances and of of Combine number list a a Both outputting document. task, each in Map a inside document each each for counts, links. that inward executed of is number bulk- query Vertica’s the second using URL, table a temporary disk. and local a tools the into loading on loaded file then a is into doc- file URLs concatenated This found the the writing parsing and parallel, file parser in uments This node DBMS. each the on to externally executed Java is in written be to had parser exter- made UDF the and accordingly. was separate system calls file in UDF nal raw stored the the was Hence, on data documents “a the HTML system”. but to DBMS, the due the directly, of inside it version implemented on [the] operate in- in UDF BLOB bug the character DBMS-X, known have a In and as DBMS document the databases. each side store parallel to the impossible inside was it UDF a such plement of simple number a the URL. then unique finds each and that table of URLs executed temporary instances of be A list would document. this query the with count/group-by in populated found be URLs then all would of list a emit EETul otnsFO Documents; FROM contents url, SELECT BSXadVriapromwreta aopbsdsystems Hadoop-based than worse perform Vertica and DBMS-X of files merged processes it as HadoopDB like outperforms system Hadoop hybrid a using of power the illustrates iden- used 10 we fact, Fig. In job. MR a using data the process we Next, in expressed easily more is processing text since HadoopDB, In parsed and TextInputFormat standard employed we Hadoop, In document simple a so UDFs, support currently not does Vertica im- to difficult was it practice, in that found [23] Unfortunately,

seconds 1000 1200 1400 1600 1800 2000 200 400 600 800 0

20.6 10 nodes 28.0

iue9 onTask Join 9: Figure 126.4 V ertica DB-X 34.7 50 nodes 29.4 HadoopDB 224.2 Hadoop

100 nodes 67.7 31.9 300.5 .FUTTLRNEADHETEROGE- AND FAULT TOLERANCE a into translated 7. immediately is cost benefit. load Join performance 10 the 10 as of of such factor tasks, factor certain the For of performance data. task, this higher process the that across queries amortized all is cost this Hadoop’s, time merging data the count Hadoop). not against did we since task aggregation UDF increase the with removed time. be engineering can of overhead increases. this chunks of of some is number believe We there the as (4) PostgreSQL larger and and proportionally Hadoop gets PostgreSQL, between which interaction in the compression in (3) overhead data 15%, some fol- use of factor DBMS-X not the a (2) did to approximately column-store we due by a optimistic is not overly equal are is sys- not results PostgreSQL database is (1) parallel performance facts: the the lowing of reason performance The the tems. approach to able Far is Thus Results of Summary 6.3 r en efre nasrglrnd ordc h lwnode’s slow the reduce to time. node query straggler on effect a that on tasks performed executes free a being redundantly on are speculatively exists already Hadoop or Also, to any transferred its is on node. complete data scheduled to input be as node can long task slowest as each node the MapReduce, for in takes contrast, In it task. time the the by mined some as that loss possible packet effect.”) elevated is takes of change it period the probability but short a the instances, experience EC2 may minimize network. customers Amazon to EC2 to planned impact Amazon been of be the has will small of maintenance we a parts today, This restore PDT on result, to 11:30PM maintenance working a at are performing As “Starting We and instances.”, unreachable. the Zone. issue are we Availability localized instances notifications of a US-EAST both investigating number some single are experienced for We a (e.g., frequently PDT: in experiments slowdown PM we our “4:12 node EC2, running received: and on While failure rates paper node high research slowdown. experience this or may failure nodes of individual machines, nothing hl aopBsla iei bu 0tmslne than longer times 10 about is time load HadoopDB’s While the for (except Hadoop outperforms consistently HadoopDB HadoopDB processes, background or failures of absence the In aopahee al oeac yrsatn ak ffailed of tasks restarting by tolerance fault achieves Hadoop deter- usually is time processing query databases, parallel For shared- of deployments large in 3, Section in described As seconds EU ENVIRONMENT NEOUS 1000 1200 1400 1600 1800 2000 200 400 600 800 0 iue1:UFAgeaintask Aggregation UDF 10: Figure 10 nodes V ertica DB-X 50 nodes HadoopDB Hadoop 100 nodes The results of the experiments are shown in Fig. 11. Node failure 200% n: 58 n: normal execution caused HadoopDB and Hadoop to have smaller slowdowns than 180% s: 159 time f: execution time with Vertica. Vertica’s increase in total query execution time is due to single node failure 160% s: execution time with the overhead associated with query abortion and complete restart. a single slow node n: 58 140% f: 133 In both HadoopDB and Hadoop, the tasks of the failed node are % slowdown = (n - f)/n * 100 distributed over the remaining available nodes that contain replicas 120% or (n - s)/n * 100 of the data. HadoopDB slightly outperforms Hadoop. In Hadoop 100% TaskTrackers assigned blocks not local to them will copy the data 80% first (from a replica) before processing. In HadoopDB, however, 60% percentage slowdown processing is pushed into the (replica) database. Since the number 40% n: 1102 n: 755 s: 954 of records returned after query processing is less than the raw size of n: 755 f: 1350 n: 1102 f: 854 20% s: 1181 data, HadoopDB does not experience Hadoop’s network overhead 0% on node failure. Fault-tolerance Heterogeneity In an environment where one node is extremely slow, HadoopDB Vertica HadoopDB + SMS Hadoop+Hive and Hadoop experience less than 30% increase in total query execution time, while Vertica experiences more than a 170% increase in query running time. Vertica waits for the straggler node to com- Figure 11: Fault tolerance and heterogeneity experi- plete processing. HadoopDB and Hadoop run speculative tasks on ments on 10 nodes TaskTrackers that completed their tasks. Since the data is chunked (HadoopDB has 1GB chunks, Hadoop has 256MB blocks), multi- nodes on other nodes. The JobTracker receives heartbeats from ple TaskTrackers concurrently process different replicas of unpro- TaskTrackers. If a TaskTracker fails to communicate with the cessed blocks assigned to the straggler. Thus, the delay due to pro- JobTracker for a preset period of time, TaskTracker expiry interval, cessing those blocks is distributed across the cluster. the JobTracker assumes failure and schedules all map/reduce tasks In our experiments, we discovered an assumption made by of the failed node on other TaskTrackers. This approach is different Hadoop’s task scheduler that contradicts the HadoopDB model. from most parallel databases which abort unfinished queries upon a In Hadoop, TaskTrackers will copy data not local to them from node failure and restart the entire query processing (using a replica the straggler or the replica. HadoopDB, however, does not move node instead of the failed node). PostgreSQL chunks to new nodes. Instead, the TaskTracker of the By inheriting the scheduling and job tracking features of redundant task connects to either the straggler’s database or the Hadoop, HadoopDB yields similar fault-tolerance and straggler replica’s database. If the TaskTracker connects to the straggler’s handling properties as Hadoop. database, the straggler needs to concurrently process an additional To test the effectiveness of HadoopDB in failure-prone and het- query leading to further slowdown. Therefore, the same feature erogeneous environments in comparison to Hadoop and Vertica, that causes HadoopDB to have slightly better fault tolerance we executed the aggregation query with 2000 groups (see Section than Hadoop, causes a slightly higher percentage slow down in 6.2.4) on a 10-node cluster and set the replication factor to two for heterogeneous environments for HadoopDB. We plan to modify all systems. For Hadoop and HadoopDB we set the TaskTracker the current task scheduler implementation to provide hints to expiry interval to 60 seconds. The following lists system-specific speculative TaskTrackers to avoid connecting to a straggler node settings for the experiments. and to connect to replicas instead. Hadoop (Hive): HDFS managed the replication of data. HDFS replicated each block of data on a different node selected uniformly at random. 7.1 Discussion HadoopDB (SMS): As described in Section 6, each node con- It should be pointed out that although Vertica’s percentage tains twenty 1GB-chunks of the UserVisits table. Each of these slowdown was larger than Hadoop and HadoopDB, its total query 20 chunks was replicated on a different node selected at random. time (even with the failure or the slow node) was still lower than Vertica: In Vertica, replication is achieved by keeping an extra copy Hadoop or HadoopDB. Furthermore, Vertica’s performance in the of every table segment. Each table is hash partitioned across the absence of failures is an order of magnitude faster than Hadoop and nodes and a backup copy is assigned to another node based on a HadoopDB (mostly because its column-oriented layout of data is a replication rule. On node failure, this backup copy is used until the big win for the small aggregation query). This order of magnitude lost segment is rebuilt. of performance could be translated to the same performance as For fault-tolerance tests, we terminated a node at 50% query Hadoop and HadoopDB, but using an order of magnitude fewer completion. For Hadoop and HadoopDB, this is equivalent to fail- nodes. Hence, failures and slow nodes become less likely for ing a node when 50% of the scheduled Map tasks are done. For Vertica than for Hadoop and HadoopDB. Furthermore, eBay’s Vertica, this is equivalent to failing a node after 50% of the average 6.5 petabyte database (perhaps the largest known data warehouse query completion time for the given query. worldwide as of June 2009) [4] uses only 96 nodes in a shared- To measure percentage increase in query time in heterogeneous nothing cluster. Failures are still reasonably rare at fewer than 100 environments, we slow down a node by running an I/O-intensive nodes. background job that randomly seeks values from a large file and We argue that in the future, 1000-node clusters will be com- frequently clears OS caches. This file is located on the same disk monplace for production database deployments, and 10,000-node where data for each system is stored. clusters will not be unusual. There are three trends that support We observed no differences in percentage slowdown between this prediction. First, data production continues to grow faster than HadoopDB with or without SMS and between Hadoop with or with- Moore’s law (see Section 1). Second, it is becoming clear that out Hive. Therefore, we only report results of HadoopDB with SMS from both a price/performance and (an increasingly important) and Hadoop with Hive and refer to both systems as HadoopDB and power/performance perspective, many low-cost, low-power servers Hadoop from now on. are far better than fewer heavy-weight servers [14]. Third, there is now, more than ever, a requirement to perform data analysis [5] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, inside of the DBMS, rather than pushing data to external systems R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of for analysis. Disk-heavy architectures such as the eBay 96-node virtualization. In Proc. of SOSP, 2003. DBMS do not have the necessary CPU horsepower for analytical [6] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, workloads [4]. S. Weaver, and J. Zhou. Scope: Easy and efficient parallel Hence, awaiting us in the future are heavy-weight analytic processing of massive data sets. In Proc. of VLDB, 2008. database jobs, requiring more time and more nodes. The probabil- [7] G. Czajkowski. Sorting 1pb with mapreduce. ity of failure in these next generation applications will be far larger googleblog.blogspot.com/2008/11/ than it is today, and restarting entire jobs upon a failure will be sorting-1pb-with-mapreduce.html . unacceptable (failures might be common enough that long-running [8] J. Dean and S. Ghemawat. MapReduce: Simplified Data jobs never finish!) Thus, although Hadoop and HadoopDB pay a Processing on Large Clusters. In OSDI, 2004. performance penalty for runtime scheduling, block-level restart, [9] D. DeWitt and M. Stonebraker. MapReduce: A major step and frequent checkpointing, such an overhead to achieve robust backwards. DatabaseColumn Blog. www.databasecolumn. fault tolerance will become necessary in the future. One feature com/2008/01/mapreduce-a-major-step-back.html . of HadoopDB is that it can elegantly transition between both ends [10] D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. of the spectrum. Since one chunk is the basic unit of work, it can Kumar, and M. Muralikrishna. GAMMA - A High play in the high-performance/low-fault-tolerance space of today’s Performance Dataflow Database Machine. In VLDB ’86, workloads (like Vertica) by setting a chunk size to be infinite, or in 1986. high fault tolerance by using more granular chunks (like Hadoop). [11] Facebook. Hive. Web page. In future work, we plan to explore the fault-tolerance/performance issues.apache.org/jira/browse/HADOOP-3601 tradeoff in more detail. . [12] S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database 8. CONCLUSION Machine. In VLDB ’86, 1986. Our experiments show that HadoopDB is able to approach the [13] Hadoop Project. Hadoop Cluster Setup. Web Page. performance of parallel database systems while achieving similar hadoop.apache.org/core/docs/current/cluster_ scores on fault tolerance, an ability to operate in heterogeneous en- setup.html . vironments, and software license cost as Hadoop. Although the [14] J. Hamilton. Cooperative expendable micro-slice servers performance of HadoopDB does not in general match the perfor- (cems): Low cost, low power servers for internet-scale mance of parallel database systems, much of this was due to the services. In Proc. of CIDR, 2009. fact that PostgreSQL is not a column-store and we did not use data [15] Hive Project. Hive SVN Repository. Accessed May 19th compression in PostgreSQL. Moreover, Hadoop and Hive are rel- 2009. svn.apache.org/viewvc/hadoop/hive/ . atively young open-source projects. We expect future releases to enhance performance. As a result, HadoopDB will automatically [16] J. N. Hoover. Start-Ups Bring Google’s Parallel Processing benefit from these improvements. To Data Warehousing. InformationWeek, August 29th, 2008. HadoopDB is therefore a hybrid of the parallel DBMS and [17] S. Madden, D. DeWitt, and M. Stonebraker. Database Hadoop approaches to data analysis, achieving the performance parallelism choices greatly impact scalability. and efficiency of parallel databases, yet still yielding the scalability, DatabaseColumn Blog. www.databasecolumn.com/2007/ fault tolerance, and flexibility of MapReduce-based systems. 10/database-parallelism-choices.html . The ability of HadoopDB to directly incorporate Hadoop and [18] Mayank Bawa. A $5.1M Addendum to our Series B. open source DBMS software (without code modification) makes www.asterdata.com/blog/index.php/2009/02/25/ HadoopDB particularly flexible and extensible for performing data a-51m-addendum-to-our-series-b/ . analysis at the large scales expected of future workloads. [19] C. Monash. The 1-petabyte barrier is crumbling. www.networkworld.com/community/node/31439 . 9. ACKNOWLEDGMENTS [20] C. Monash. Cloudera presents the MapReduce bull case. DBMS2 Blog. www.dbms2.com/2009/04/15/ We’d like to thank Sergey Melnik and the three anonymous re- cloudera-presents-the-mapreduce-bull-case/ . viewers for their extremely insightful feedback on an earlier version of this paper, which we incorporated into the final version. We’d [21] C. Olofson. Worldwide RDBMS 2005 vendor shares. also like to thank Eric McCall for helping us get Vertica running Technical Report 201692, IDC, May 2006. on EC2. This work was sponsored by the NSF under grants IIS- [22] C. Olston, B. Reed, U. Srivastava, R. Kumar, and 0845643 and IIS-08444809. A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of SIGMOD, 2008. 10. REFERENCES [23] A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi. A Comparison of hadoop.apache.org/core/ [1] Hadoop. Web Page. . Approaches to Large Scale Data Analysis. In Proc. of [2] HadoopDB Project. Web page. SIGMOD, 2009. db.cs.yale.edu/hadoopdb/hadoopdb.html . [24] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, [3] Vertica. www.vertica.com/ . M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. [4] D. Abadi. What is the right way to measure scale? DBMS O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. Musings Blog. dbmsmusings.blogspot.com/2009/06/ C-Store: A column-oriented DBMS. In VLDB, 2005. what-is-right-way-to-measure-scale.html . [25] D. Vesset. Worldwide data warehousing tools 2005 vendor 9Disclosure: Authors Daniel Abadi and Alexander Rasin have a shares. Technical Report 203229, IDC, August 2006. small financial stake in Vertica due to their involvement in the pre- decessor C-Store project.