Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology

Total Page:16

File Type:pdf, Size:1020Kb

Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology Azza Abouzied Daniel J. Abadi Kamil Bajda-Pawlikowski New York University Abu University of Maryland, Starburst Data Dhabi College Park [email protected] [email protected] [email protected] Avi Silberschatz Yale University [email protected] ABSTRACT of this data. The key feature of these systems is that the In 2009 we explored the feasibility of building a hybrid SQL user does not have to be explicitly aware of how data is data analysis system that takes the best features from two partitioned or how machines work together to process the competing technologies: large-scale data processing systems transformations or analyses, yet these systems provide fault (such as Google MapReduce and Apache Hadoop) and par- tolerant, parallel processing of user programs. Most notable allel database management systems (such as Greenplum and of these efforts was a paper published in 2004 by Dean and Vertica). We built a prototype, HadoopDB, and demon- Ghemawat, that described Google's MapReduce framework strated that it can deliver the high SQL query performance for data processing on large clusters [28]. The MapReduce and efficiency of parallel database management systems programming model for expressing data transformations, while still providing the scalability, fault tolerance, and flex- along with the underlying system that supported fault tol- ibility of large-scale data processing systems. Subsequently, erant, parallel processing of these transformations, was at HadoopDB grew into a commercial product, Hadapt, whose the time widely used across Google's many business opera- technology was eventually acquired by Teradata. In this pa- tions, and subsequently became widely used across hundreds per, we provide an overview of HadoopDB's original design, of thousands of other businesses, through the open-source and its evolution during the subsequent ten years of research Hadoop implementation. Today, companies that package, and development effort. We describe how the project inno- distribute, support, and train companies to use Hadoop vated both in the research lab, and as a commercial product combine to form a multi-billion dollar industry. at Hadapt and Teradata. We then discuss the current vi- MapReduce, along with other large-scale data processing brant ecosystem of software projects (most of which are open systems such as Microsoft's Dryad/LINQ project [35, 47], source) that continued HadoopDB's legacy of implementing were originally designed for processing unstructured data. a systems level integration of large-scale data processing sys- One of their most famous use cases within Google and Mi- tems and parallel database technology. crosoft was the creation of the indexes needed to power their respective Internet search capabilities|which requires pro- PVLDB Reference Format: cessing large amounts of unstructured text found in Web Azza Abouzied, Daniel J. Abadi, Kamil Bajda-Pawlikowski, Avi pages. The success of these systems in processing unstruc- Silberschatz. Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology. PVLDB, 12(12): tured data led to a natural desire to also use them for pro- 2290-2299, 2019. cessing structured data. However, the final result was a DOI: https://doi.org/10.14778/3352063.3352145 major step backward relative to the decades of research in parallel database systems that provide similar capabilities of parallel query processing over structured data [29]. 1. INTRODUCTION For example, MapReduce provided fault-tolerant, parallel In the first few years of this century, several papers were execution of only two simple functions1: Map, which reads published on large-scale data processing systems: systems key-value pairs within a partition of a distributed file in that partition large amounts of data over potentially thou- parallel, applies a filter or transform to these local key-value sands of machines and provide a straightforward language pairs, and then outputs the result as key-value pairs; and in which to express complex transformations and analyses Reduce, which reads the key-value pairs output by the Map function (after the system partitions the pairs across ma- chines by hashing the keys), and performs some arbitrary This work is licensed under the Creative Commons Attribution- per-key computation such as applying an aggregation func- NonCommercial-NoDerivatives 4.0 International License. To view a copy tion over all values associated with the same key. After per- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing forming the reduce function, the results are materialized and [email protected]. Copyright is held by the owner/author(s). Publication rights replicated to a distributed file system. The model presents licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 12 1 ISSN 2150-8097. This limitation is not shared by Dryad. Nonetheless, DOI: https://doi.org/10.14778/3352063.3352145 Hadoop implemented MapReduce instead of Dryad. SQL Query MapReduceseveral inefficiencies best meets the for fault parallel tolerance structured and ability query to processing, operate in heterogeneous environment properties. It achieves fault tolerance MapReduce Job such as: (1) Complex SQL queries can require a large num- SMS Planner by detecting and reassigning Map tasks of failed nodes to other ber of operators. Although it is possible to express these op- MapReduce nodes in the cluster (preferably nodes with replicas of the input Map Job erators as a sequence of Map and Reduce functions, database Hadoop core data). It achieves the ability to operate in a heterogeneous environ- Master node systems are most efficient when they can pipeline data be- MapReduce ment via redundant task execution. Tasks that are taking a long time HDFS Framework tween operators. The forced materialization of intermediate Catalog Loader Data to completedata by MapReduce|especiallyon slow nodes get redundantly when executed data is on replicated other nodes to NameNode JobTracker thata have distributed completed file their system assigned after tasks. each ReduceThe time function|is to complete the ex- tasktremely becomes inefficient equal to the and time slows for the down fastest query node processing. to complete the (2) InputFormat Implementations redundantlyMapReduce executed naturally task. provides By breaking support tasks for into one small, type granular of dis- Database Connector tasks,tributed the effect join of operation: faults and “straggler” the partitioned nodes can hash be join. minimized. In par- Task with MapReduceallel database has asystems, flexible query broadcast interface; joins Map and and co-partitioned Reduce func- InputFormat tionsjoins|when are just arbitrary eligible computations to be used|are written frequently in a general-purpose chosen by Node 1 Node 2 Node n language.the query Therefore, optimizer, it is possible since they for each can taskimprove to do performance anything on TaskTracker TaskTracker TaskTracker itssignificantly. input, just as long Unfortunately, as its output follows no implementation the conventions of defined broad- bycast the model. and co-partitioned In general, most joins MapReduce-based fit naturally into systems the (suchMapRe- as Database DataNode Database DataNode Database DataNode Hadoop,duce programming which directly model.implements (3) theOptimizations systems-level for details structured of the MapReducedata at the paper) storage do not level|such accept declarative as column-orientation, SQL. However, there com- pression in formats that can be operated on directly (with- are some exceptions (such as Hive). Figure 1: The HadoopDB System Architecture [18] Asout shown decompression), in previous work, and indexing|were the biggest issue hardwith MapReduce to leverage Figure 1: The Architecture of HadoopDB is performancevia the execution [23]. By framework not requiring of the the MapReduce user to first model. model and Even as studies continued to find that Hadoop performed locality by matching a TaskTracker to Map tasks that process data load data before processing, many of the performance enhancing achieving the high performance and efficiency of traditional poorly on structured data processing tasks when compared local to it. It load-balances by ensuring all available TaskTrackers tools listed above that are used by database systems are not possible. parallel database systems on structured SQL queries. to shared-nothing parallel DBMSs [40, 43], widely respected are assigned tasks. TaskTrackers regularly update the JobTracker Traditional business data analytical processing, that have standard In the next section, we give a technical overview of technical teams|such as the team at Facebook|continued with their status through heartbeat messages. reports and many repeated queries, is particularly, poorly suited for HadoopDB, according to the way it was described in the to use Hadoop for traditional SQL data analysis workloads. The InputFormat library represents the interface between the the one-time query processing model of MapReduce. original paper. In Section 3 we describe how the project Although it is impossible to fully explain the reasoning be- storage and processing layers. InputFormat implementations parse Ideally, the fault tolerance and ability to operate in heterogeneous evolved in the research lab over the past decade after the hind
Recommended publications
  • Java Linksammlung
    JAVA LINKSAMMLUNG LerneProgrammieren.de - 2020 Java einfach lernen (klicke hier) JAVA LINKSAMMLUNG INHALTSVERZEICHNIS Build ........................................................................................................................................................... 4 Caching ....................................................................................................................................................... 4 CLI ............................................................................................................................................................... 4 Cluster-Verwaltung .................................................................................................................................... 5 Code-Analyse ............................................................................................................................................. 5 Code-Generators ........................................................................................................................................ 5 Compiler ..................................................................................................................................................... 6 Konfiguration ............................................................................................................................................. 6 CSV ............................................................................................................................................................. 6 Daten-Strukturen
    [Show full text]
  • Voodoo - a Vector Algebra for Portable Database Performance on Modern Hardware
    Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware Holger Pirk Oscar Moll Matei Zaharia Sam Madden MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL [email protected] [email protected] [email protected] [email protected] ABSTRACT Single Thread Branch Multithread Branch GPU Branch In-memory databases require careful tuning and many engi- Single Thread No Branch Multithread No Branch GPU No Branch neering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and 10 hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer { even more so for a query compiler. To facilitate performance- 1 oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that ab- stracts the detailed architectural properties of the hard- ware, such as multi- or many-core architectures, caches and Absolute time in s 0.1 SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about 1 5 10 50 100 and tune than low-level C and related hardware-focused ex- Selectivity tensions (Intrinsics, OpenCL, CUDA, etc.). This enables Figure 1: Performance of branch-free selections based on our Voodoo compiler to produce (OpenCL) code that rivals cursor arithmetics [28] (a.k.a. predication) over a branching and even outperforms the fastest state-of-the-art in memory implementation (using if statements) databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache- conscious processing, predication and vectorization (again PeR [18], Legobase [14] and TupleWare [9], have arisen.
    [Show full text]
  • A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns
    A Survey on Parallel Database Systems from a Storage Perspective: Rows versus Columns Carlos Ordonez1 ? and Ladjel Bellatreche2 1 University of Houston, USA 2 LIAS/ISAE-ENSMA, France Abstract. Big data requirements have revolutionized database technol- ogy, bringing many innovative and revamped DBMSs to process transac- tional (OLTP) or demanding query workloads (cubes, exploration, pre- processing). Parallel and main memory processing have become impor- tant features to exploit new hardware and cope with data volume. With such landscape in mind, we present a survey comparing modern row and columnar DBMSs, contrasting their ability to write data (storage mecha- nisms, transaction processing, batch loading, enforcing ACID) and their ability to read data (query processing, physical operators, sequential vs parallel). We provide a unifying view of alternative storage mecha- nisms, database algorithms and query optimizations used across diverse DBMSs. We contrast the architecture and processing of a parallel DBMS with an HPC system. We cover the full spectrum of subsystems going from storage to query processing. We consider parallel processing and the impact of much larger RAM, which brings back main-memory databases. We then discuss important parallel aspects including speedup, sequential bottlenecks, data redistribution, high speed networks, main memory pro- cessing with larger RAM and fault-tolerance at query processing time. We outline an agenda for future research. 1 Introduction Parallel processing is central in big data due to large data volume and the need to process data faster. Parallel DBMSs [15, 13] and the Hadoop eco-system [30] are currently two competing technologies to analyze big data, both based on au- tomatic data-based parallelism on a shared-nothing architecture.
    [Show full text]
  • Analysis of Big Data Storage Tools for Data Lakes Based on Apache Hadoop Platform
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 8, 2021 Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform Vladimir Belov, Evgeny Nikulchev MIREA—Russian Technological University, Moscow, Russia Abstract—When developing large data processing systems, determined the emergence and use of various data storage the question of data storage arises. One of the modern tools for formats in HDFS. solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic Among the most widely known formats used in the Hadoop platform. Hadoop does not have a default data storage format, system are JSON [12], CSV [13], SequenceFile [14], Apache which leads to the task of choosing a data format when designing Parquet [15], ORC [16], Apache Avro [17], PBF [18]. a data processing system. To solve this problem, it is necessary to However, this list is not exhaustive. Recently, new formats of proceed from the results of the assessment according to several data storage are gaining popularity, such as Apache Hudi [19], criteria. In turn, experimental evaluation does not always give a Apache Iceberg [20], Delta Lake [21]. complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to Each of these file formats has own features in file structure. study the features of the format, its internal structure, In addition, differences are observed at the level of practical recommendations for use, etc. The article describes the features application. Thus, row-oriented formats ensure high writing of both widely used data storage formats and the currently speed, but column-oriented formats are better for data reading.
    [Show full text]
  • Oracle Metadata Management V12.2.1.3.0 New Features Overview
    An Oracle White Paper October 12 th , 2018 Oracle Metadata Management v12.2.1.3.0 New Features Overview Oracle Metadata Management version 12.2.1.3.0 – October 12 th , 2018 New Features Overview Disclaimer This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described in this document remains at the sole discretion of Oracle. This document in any form, software or printed matter, contains proprietary information that is the exclusive property of Oracle. This document and information contained herein may not be disclosed, copied, reproduced, or distributed to anyone outside Oracle without prior written consent of Oracle. This document is not part of your license agreement nor can it be incorporated into any contractual agreement with Oracle or its subsidiaries or affiliates. 1 Oracle Metadata Management version 12.2.1.3.0 – October 12 th , 2018 New Features Overview Table of Contents Executive Overview ............................................................................ 3 Oracle Metadata Management 12.2.1.3.0 .......................................... 4 METADATA MANAGER VS METADATA EXPLORER UI .............. 4 METADATA HOME PAGES ........................................................... 5 METADATA QUICK ACCESS ........................................................ 6 METADATA REPORTING .............................................................
    [Show full text]
  • A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem
    Noname manuscript No. (will be inserted by the editor) A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem César Soto-Valero · Nicolas Harrand · Martin Monperrus · Benoit Baudry Received: date / Accepted: date Abstract Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application’s code and its ex- ternal dependencies, and automate several software development tasks. How- ever, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application’s compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We ana- lyze 9; 639 Java artifacts hosted on Maven Central, which include a total of 723; 444 dependency relationships. Our key result is that 75:1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1=4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.
    [Show full text]
  • Developer Tool Guide
    Informatica® 10.2.1 Developer Tool Guide Informatica Developer Tool Guide 10.2.1 May 2018 © Copyright Informatica LLC 2009, 2019 This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. Informatica, the Informatica logo, PowerCenter, and PowerExchange are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or trademarks of their respective owners. U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License. Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product. The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at [email protected].
    [Show full text]
  • VECTORWISE Simply FAST
    VECTORWISE Simply FAST A technical whitepaper TABLE OF CONTENTS: Introduction .................................................................................................................... 1 Uniquely fast – Exploiting the CPU ................................................................................ 2 Exploiting Single Instruction, Multiple Data (SIMD) .................................................. 2 Utilizing CPU cache as execution memory ............................................................... 3 Other CPU performance features ............................................................................. 3 Leveraging industry best practices ................................................................................ 4 Optimizing large data scans ...................................................................................... 4 Column-based storage .............................................................................................. 4 VectorWise’s hybrid column store ........................................................................ 5 Positional Delta Trees (PDTs) .............................................................................. 5 Data compression ..................................................................................................... 6 VectorWise’s innovative use of data compression ............................................... 7 Storage indexes ........................................................................................................ 7 Parallel
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Talend Open Studio for Big Data Release Notes
    Talend Open Studio for Big Data Release Notes 6.0.0 Talend Open Studio for Big Data Adapted for v6.0.0. Supersedes previous releases. Publication date July 2, 2015 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM, Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Avro, Apache Axiom, Apache Axis, Apache Axis 2, Apache Batik, Apache CXF, Apache Cassandra, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache
    [Show full text]
  • Oracle Big Data SQL Release 4.1
    ORACLE DATA SHEET Oracle Big Data SQL Release 4.1 The unprecedented explosion in data that can be made useful to enterprises – from the Internet of Things, to the social streams of global customer bases – has created a tremendous opportunity for businesses. However, with the enormous possibilities of Big Data, there can also be enormous complexity. Integrating Big Data systems to leverage these vast new data resources with existing information estates can be challenging. Valuable data may be stored in a system separate from where the majority of business-critical operations take place. Moreover, accessing this data may require significant investment in re-developing code for analysis and reporting - delaying access to data as well as reducing the ultimate value of the data to the business. Oracle Big Data SQL enables organizations to immediately analyze data across Apache Hadoop, Apache Kafka, NoSQL, object stores and Oracle Database leveraging their existing SQL skills, security policies and applications with extreme performance. From simplifying data science efforts to unlocking data lakes, Big Data SQL makes the benefits of Big Data available to the largest group of end users possible. KEY FEATURES Rich SQL Processing on All Data • Seamlessly query data across Oracle Oracle Big Data SQL is a data virtualization innovation from Oracle. It is a new Database, Hadoop, object stores, architecture and solution for SQL and other data APIs (such as REST and Node.js) on Kafka and NoSQL sources disparate data sets, seamlessly integrating data in Apache Hadoop, Apache Kafka, • Runs all Oracle SQL queries without modification – preserving application object stores and a number of NoSQL databases with data stored in Oracle Database.
    [Show full text]
  • Hybrid Transactional/Analytical Processing: a Survey
    Hybrid Transactional/Analytical Processing: A Survey Fatma Özcan Yuanyuan Tian Pınar Tözün IBM Resarch - Almaden IBM Research - Almaden IBM Research - Almaden [email protected] [email protected] [email protected] ABSTRACT To understand HTAP, we first need to look into OLTP The popularity of large-scale real-time analytics applications and OLAP systems and how they progressed over the years. (real-time inventory/pricing, recommendations from mobile Relational databases have been used for both transaction apps, fraud detection, risk analysis, IoT, etc.) keeps ris- processing as well as analytics. However, OLTP and OLAP ing. These applications require distributed data manage- systems have very different characteristics. OLTP systems ment systems that can handle fast concurrent transactions are identified by their individual record insert/delete/up- (OLTP) and analytics on the recent data. Some of them date statements, as well as point queries that benefit from even need running analytical queries (OLAP) as part of indexes. One cannot think about OLTP systems without transactions. Efficient processing of individual transactional indexing support. OLAP systems, on the other hand, are and analytical requests, however, leads to different optimiza- updated in batches and usually require scans of the tables. tions and architectural decisions while building a data man- Batch insertion into OLAP systems are an artifact of ETL agement system. (extract transform load) systems that consolidate and trans- For the kind of data processing that requires both ana- form transactional data from OLTP systems into an OLAP lytics and transactions, Gartner recently coined the term environment for analysis. Hybrid Transactional/Analytical Processing (HTAP).
    [Show full text]