Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Azza Abouzied Daniel J. Abadi Kamil Bajda-Pawlikowski New York University Abu University of Maryland, Starburst Data Dhabi College Park [email protected] [email protected] [email protected] Avi Silberschatz Yale University [email protected] ABSTRACT of this data. The key feature of these systems is that the In 2009 we explored the feasibility of building a hybrid SQL user does not have to be explicitly aware of how data is data analysis system that takes the best features from two partitioned or how machines work together to process the competing technologies: large-scale data processing systems transformations or analyses, yet these systems provide fault (such as Google MapReduce and Apache Hadoop) and par- tolerant, parallel processing of user programs. Most notable allel database management systems (such as Greenplum and of these efforts was a paper published in 2004 by Dean and Vertica). We built a prototype, HadoopDB, and demon- Ghemawat, that described Google's MapReduce framework strated that it can deliver the high SQL query performance for data processing on large clusters [28]. The MapReduce and efficiency of parallel database management systems programming model for expressing data transformations, while still providing the scalability, fault tolerance, and flex- along with the underlying system that supported fault tol- ibility of large-scale data processing systems. Subsequently, erant, parallel processing of these transformations, was at HadoopDB grew into a commercial product, Hadapt, whose the time widely used across Google's many business opera- technology was eventually acquired by Teradata. In this pa- tions, and subsequently became widely used across hundreds per, we provide an overview of HadoopDB's original design, of thousands of other businesses, through the open-source and its evolution during the subsequent ten years of research Hadoop implementation. Today, companies that package, and development effort. We describe how the project inno- distribute, support, and train companies to use Hadoop vated both in the research lab, and as a commercial product combine to form a multi-billion dollar industry. at Hadapt and Teradata. We then discuss the current vi- MapReduce, along with other large-scale data processing brant ecosystem of software projects (most of which are open systems such as Microsoft's Dryad/LINQ project [35, 47], source) that continued HadoopDB's legacy of implementing were originally designed for processing unstructured data. a systems level integration of large-scale data processing sys- One of their most famous use cases within Google and Mi- tems and parallel database technology. crosoft was the creation of the indexes needed to power their respective Internet search capabilities|which requires pro- PVLDB Reference Format: cessing large amounts of unstructured text found in Web Azza Abouzied, Daniel J. Abadi, Kamil Bajda-Pawlikowski, Avi pages. The success of these systems in processing unstruc- Silberschatz. Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology. PVLDB, 12(12): tured data led to a natural desire to also use them for pro- 2290-2299, 2019. cessing structured data. However, the final result was a DOI: https://doi.org/10.14778/3352063.3352145 major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data [29]. 1. INTRODUCTION For example, MapReduce provided fault-tolerant, parallel In the first few years of this century, several papers were execution of only two simple functions1: Map, which reads published on large-scale data processing systems: systems key-value pairs within a partition of a distributed file in that partition large amounts of data over potentially thou- parallel, applies a filter or transform to these local key-value sands of machines and provide a straightforward language pairs, and then outputs the result as key-value pairs; and in which to express complex transformations and analyses Reduce, which reads the key-value pairs output by the Map function (after the system partitions the pairs across machines by hashing the keys), and performs some arbitrary This work is licensed under the Creative Commons Attribution- per-key computation such as applying an aggregation func- NonCommercial-NoDerivatives 4.0 International License. To view a copy tion over all values associated with the same key. After per- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing forming the reduce function, the results are materialized and [email protected]. Copyright is held by the owner/author(s). Publication rights replicated to a distributed file system. The model presents licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 12 1 ISSN 2150-8097. This limitation is not shared by Dryad. Nonetheless, DOI: https://doi.org/10.14778/3352063.3352145 Hadoop implemented MapReduce instead of Dryad. SQL Query MapReduceseveral inefficiencies best meets the for fault parallel tolerance structured and ability query to processing, operate in heterogeneous environment properties. It achieves fault tolerance MapReduce Job such as: (1) Complex SQL queries can require a large num- SMS Planner by detecting and reassigning Map tasks of failed nodes to other ber of operators. Although it is possible to express these op- MapReduce nodes in the cluster (preferably nodes with replicas of the input Map Job erators as a sequence of Map and Reduce functions, database Hadoop core data). It achieves the ability to operate in a heterogeneous environ- Master node systems are most efficient when they can pipeline data be- MapReduce ment via redundant task execution. Tasks that are taking a long time HDFS Framework tween operators. The forced materialization of intermediate Catalog Loader Data to completedata by MapReduce|especiallyon slow nodes get redundantly when executed data is on replicated other nodes to NameNode JobTracker thata have distributed completed file their system assigned after tasks. each ReduceThe time function|is to complete the ex- tasktremely becomes inefficient equal to the and time slows for the down fastest query node processing. to complete the (2) InputFormat Implementations redundantlyMapReduce executed naturally task. provides By breaking support tasks for into one small, type granular of dis- Database Connector tasks,tributed the effect join of operation: faults and “straggler” the partitioned nodes can hash be join. minimized. In par- Task with MapReduceallel database has asystems, flexible query broadcast interface; joins Map and and co-partitioned Reduce func- InputFormat tionsjoins|when are just arbitrary eligible computations to be used|are written frequently in a general-purpose chosen by Node 1 Node 2 Node n language.the query Therefore, optimizer, it is possible since they for each can taskimprove to do performance anything on TaskTracker TaskTracker TaskTracker itssignificantly. input, just as long Unfortunately, as its output follows no implementation the conventions of defined broad- bycast the model. and co-partitioned In general, most joins MapReduce-based fit naturally into systems the (suchMapRe- as Database DataNode Database DataNode Database DataNode Hadoop,duce programming which directly model.implements (3) theOptimizations systems-level for details structured of the MapReducedata at the paper) storage do not level|such accept declarative as column-orientation, SQL. However, there com- pression in formats that can be operated on directly (with- are some exceptions (such as Hive). Figure 1: The HadoopDB System Architecture [18] Asout shown decompression), in previous work, and indexing|were the biggest issue hardwith MapReduce to leverage Figure 1: The Architecture of HadoopDB is performancevia the execution [23]. By framework not requiring of the the MapReduce user to first model. model and Even as studies continued to find that Hadoop performed locality by matching a TaskTracker to Map tasks that process data load data before processing, many of the performance enhancing achieving the high performance and efficiency of traditional poorly on structured data processing tasks when compared local to it. It load-balances by ensuring all available TaskTrackers tools listed above that are used by database systems are not possible. parallel database systems on structured SQL queries. to shared-nothing parallel DBMSs [40, 43], widely respected are assigned tasks. TaskTrackers regularly update the JobTracker Traditional business data analytical processing, that have standard In the next section, we give a technical overview of technical teams|such as the team at Facebook|continued with their status through heartbeat messages. reports and many repeated queries, is particularly, poorly suited for HadoopDB, according to the way it was described in the to use Hadoop for traditional SQL data analysis workloads. The InputFormat library represents the interface between the the one-time query processing model of MapReduce. original paper. In Section 3 we describe how the project Although it is impossible to fully explain the reasoning be- storage and processing layers. InputFormat implementations parse Ideally, the fault tolerance and ability to operate in heterogeneous evolved in the research lab over the past decade after the hind

Load more