Research Article Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
File Formats for Big Data Storage Systems
International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 File Formats for Big Data Storage Systems Samiya Khan, Mansaf Alam analysis or visualization. Hadoop allows the user to store data Abstract: Big data is one of the most influential technologies of on its repository in several ways. Some of the commonly the modern era. However, in order to support maturity of big data available and understandable working formats include XML systems, development and sustenance of heterogeneous [9], CSV [10] and JSON [11]. environments is requires. This, in turn, requires integration of Although, JSON, XML and CSV are human-readable technologies as well as concepts. Computing and storage are the two core components of any big data system. With that said, big formats, but they are not the best way, to store data, on a data storage needs to communicate with the execution engine and Hadoop cluster. In fact, in some cases, storage of data in such other processing and visualization technologies to create a raw formats may prove to be highly inefficient. Moreover, comprehensive solution. This brings the facet of big data file parallel storage is not possible for data stored in such formats. formats into picture. This paper classifies available big data file In view of the fact that storage efficiency and parallelism are formats into five categories namely text-based, row-based, the two leading advantages of using Hadoop, the use of raw column-based, in-memory and data storage services. It also compares the advantages, shortcomings and possible use cases of file formats may just defeat the whole purpose. -
Impala: a Modern, Open-Source SQL Engine for Hadoop
Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder Cloudera http://impala.io/ ABSTRACT that is on par or exceeds that of commercial MPP analytic Cloudera Impala is a modern, open-source MPP SQL en- DBMSs, depending on the particular workload. gine architected from the ground up for the Hadoop data This paper discusses the services Impala provides to the processing environment. Impala provides low latency and user and then presents an overview of its architecture and high concurrency for BI/analytic read-mostly queries on main components. The highest performance that is achiev- Hadoop, not delivered by batch frameworks such as Apache able today requires using HDFS as the underlying storage Hive. This paper presents Impala from a user's perspective, manager, and therefore that is the focus on this paper; when gives an overview of its architecture and main components there are notable differences in terms of how certain technical and briefly demonstrates its superior performance compared aspects are handled in conjunction with HBase, we note that against other popular SQL-on-Hadoop systems. in the text without going into detail. Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section7 shows, 1. INTRODUCTION for single-user queries, Impala is up to 13x faster than alter- Impala is an open-source 1, fully-integrated, state-of-the- natives, and 6.7x faster on average. -
Hortonworks Data Platform Data Movement and Integration (December 15, 2017)
Hortonworks Data Platform Data Movement and Integration (December 15, 2017) docs.cloudera.com Hortonworks Data Platform December 15, 2017 Hortonworks Data Platform: Data Movement and Integration Copyright © 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. You can contact us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode ii Hortonworks Data Platform December 15, 2017 Table of Contents 1. -
Major Technical Advancements in Apache Hive
Major Technical Advancements in Apache Hive Yin Huai1 Ashutosh Chauhan2 Alan Gates2 Gunther Hagleitner2 Eric N. Hanson3 Owen O’Malley2 Jitendra Pandey2 Yuan Yuan1 Rubao Lee1 Xiaodong Zhang1 1The Ohio State University 2Hortonworks Inc. 3Microsoft 1{huai, yuanyu, liru, zhang}@cse.ohio-state.edu 2{ashutosh, gates, ghagleitner, owen, jitendra}@hortonworks.com [email protected] ABSTRACT than 100 developers have made technical efforts to improve Hive on Apache Hive is a widely used data warehouse system for Apache more than 3000 issues. With its rapid development pace, Hive has Hadoop, and has been adopted by many organizations for various been significantly updated by new innovations and research since big data analytics applications. Closely working with many users the original Hive paper [45] was published four years ago. We will and organizations, we have identified several shortcomings of Hive present its major technical advancements in this paper. in its file formats, query planning, and query execution, which are Hive was originally designed as a translation layer on top of key factors determining the performance of Hive. In order to make Hadoop MapReduce. It exposes its own dialect of SQL to users Hive continuously satisfy the requests and requirements of process- and translates data manipulation statements (queries) to a directed ing increasingly high volumes data in a scalable and efficient way, acyclic graph (DAG) of MapReduce jobs. With an SQL interface, we have set two goals related to storage and runtime performance users do not need to write tedious and sometimes difficult MapRe- in our efforts on advancing Hive. First, we aim to maximize the ef- duce programs to manipulate data stored in Hadoop Distributed fective storage capacity and to accelerate data accesses to the data Filesystem (HDFS). -
HDFS File Formats: Study and Performance Comparison
UNIVERSIDAD DE VALLADOLID ESCUELA TECNICA´ SUPERIOR INGENIEROS DE TELECOMUNICACION´ TRABAJO FIN DE MASTER MASTER UNIVERSITARIO EN INVESTIGACION´ EN TECNOLOG´IAS DE LA INFORMACION´ Y LAS COMUNICACIONES HDFS File Formats: Study and Performance Comparison Autor: D. Alvaro´ Alonso Isla Tutor: Dr. D. Miguel Angel´ Mart´ınez Prieto Dr. D. An´ıbal Bregon´ Bregon´ Valladolid, 29 de Junio de 2018 T´ITULO: HDFS File Formats: Study and Perfor- mance Comparison AUTOR: D. Alvaro´ Alonso Isla TUTOR: Dr. D. Miguel Angel´ Mart´ınez Prieto Dr. D. An´ıbal Bregon´ Bregon´ DEPARTAMENTO: Departamento de Informatica´ Tribunal PRESIDENTE: Dr. D. Pablo de la Fuente Redondo VOCAL: Dr. D. Alejandra Mart´ınez Mones´ SECRETARIO: Dr. D. Guillermo Vega Gorgojo FECHA: 29 de Junio de 2018 CALIFICACION´ : iii Resumen del TFM El sistema distribuido Hadoop se esta´ volviendo cada vez mas´ popular a la hora de al- macenar y procesar grandes cantidades de datos (Big Data). Al estar compuesto por muchas maquinas,´ su sistema de ficheros, llamado HDFS (Hadoop Distributed File Sys- tem), tambien´ es distribuido. Al ser HDFS un sistema de almacenamiento distinto a los tradicionales, se han desarrollado nuevos formatos de almacenamiento de ficheros para ajustarse a las caracter´ısticas particulares de HDFS. En este trabajo estudiamos estos nuevos formatos de ficheros, prestando especial atencion´ a sus caracter´ısticas particulares, con el fin de ser capaces de intuir cual´ de ellos tendra´ un mejor funcionamiento segun´ las necesidades de los datos que tengamos. Para alcanzar este objetivo, hemos propuesto un marco teorico´ de comparacion,´ con el que poder reconocer facilmente´ que´ formatos se ajustan a nuestras necesidades. -
Albis: High-Performance File Format for Big Data Systems
Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler, IBM Research, Zurich https://www.usenix.org/conference/atc18/presentation/trivedi This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18). July 11–13, 2018 • Boston, MA, USA ISBN 978-1-939133-02-1 Open access to the Proceedings of the 2018 USENIX Annual Technical Conference is sponsored by USENIX. Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler IBM Research, Zurich Abstract Over the last decade, a variety of external file formats such as Parquet, ORC, Arrow, etc., have been developed to store large volumes of relational data in the cloud. As high-performance networking and storage devices are used pervasively to process this data in frameworks like Spark and Hadoop, we observe that none of the popular file formats are capable of delivering data access rates close to the hardware. Our analysis suggests that multi- ple antiquated notions about the nature of I/O in a dis- Figure 1: Relational data processing stack in the cloud. tributed setting, and the preference for the “storage effi- ciency” over performance is the key reason for this gap. RDPS systems typically do not manage their storage. In this paper we present Albis, a high-performance file They leverage a variety of external file formats to store format for storing relational data on modern hardware. and access data. Figure 1 shows a typical RDPS stack in Albis is built upon two key principles: (i) reduce the CPU the cloud. -
Building High Performance Data Analytics Systems Based on Scale-Out Models
Building High Performance Data Analytics Systems based on Scale-out Models Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Yin Huai, B.E. Graduate Program in Computer Science and Engineering The Ohio State University 2015 Dissertation Committee: Xiaodong Zhang, Advisor Feng Qin Spyros Blanas © Copyright by Yin Huai 2015 Abstract To respond to the data explosion, new system infrastructures have been built based on scale-out models for the purposes of high data availability and reliable large-scale com- putations. With an increasing amount of adoptions of data analytics systems, users con- tinuously demand high throughput and high performance on various applications. In this dissertation, we identify three critical issues to achieve high throughput and high perfor- mance for data analytics, which are efficient table placement methods (i.e. the method to place structured data), generating high quality distributed query plans without unnecessary data movements, and effective support of out-of-band communications. To address these three issues, we have conducted a comprehensive study on design choices of different table placement methods, designed and implemented two optimizations to remove unnecessary data movements in distributed query plans, and introduced a system facility called SideWalk to facilitate the implementation of out-of-band communications. In our first work of table placement methods, we comprehensively studied existing table placement methods and generalized the basic structure of table placement methods. Based on the basic structure, we conducted a comprehensive evaluation of different design choices of table placement methods on I/O performance. -
SAS with Hadoop: Performance Considerations & Monitoring
SAS with Hadoop: Performance considerations & monitoring strategies Date: June, 2016 Contact Information: RAPHAEL POUMAREDE Principal Technical Architect, Global Enablement and Learning ☎ +33 (0) 614706554 [email protected] Table of Contents 0 Introduction and setting expectations ................... 1 1 Data Management (SAS/ACCESS) ......................... 4 1.1 Push data management down to Hadoop! ........................................ 4 1.2 Optimize your SELECT ....................................................................... 6 1.3 The “Order by” hurdle ........................................................................ 8 1.4 Create a Hive table from SAS (CTAS) .............................................. 12 1.5 MapReduce or Tez? .......................................................................... 14 1.6 File formats and partitioning ............................................................ 16 1.7 Other tricks ........................................................................................ 20 2 In-Hadoop processing ........................................... 23 2.1 Introduction ....................................................................................... 23 2.2 Push processing down to Hadoop! ................................................. 25 2.3 File formats & partitioning ................................................................ 30 2.4 Resource management & Hadoop tuning ....................................... 30 3 Lessons learned .................................................... -
Hortonworks Data Platform (Hdp®) Developer Quick Start 4 Days
TRAINING OFFERING | DEV-201 HORTONWORKS DATA PLATFORM (HDP®) DEVELOPER QUICK START 4 DAYS This 4 day training course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Apache Pig and Apache Hive, and developing applications on Apache Spark. Topics include: Essential understanding of HDP & its capabilities, Hadoop, YARN, HDFS, MapReduce/Tez, data ingestion, using Pig and Hive to perform data analytics on Big Data and an introduction to Spark Core, Spark SQL, Apache Zeppelin, and additional Spark features. PREREQUISITES Students should be familiar with programming principles and have experience in software development. SQL and light scripting knowledge is also helpful. No prior Hadoop knowledge is required. TARGET AUDIENCE Developers and data engineers who need to understand and develop applications on HDP FORMAT 50% Lecture/Discussion 50% Hands-0n Labs AGENDA SUMMARY Day 1: HDP Essentials and an Introduction to Pig Day 2: Apache Hive Day 3: Apache Spark Day 4: Apache Spark Continued DAY 1 OBJECTIVES • Describe the Case for Hadoop • Describe the Trends of Volume, Velocity and Variety • Discuss the Importance of Open Enterprise Hadoop • Describe the Hadoop Ecosystem Frameworks Across the Following Five Architectural Categories: o Data Management o Data Access o Data Governance & Integration o Security o Operations • Describe the Function and Purpose of the Hadoop Distributed File System (HDFS) • List the Major Architectural Components of HDFS and their Interactions • Describe Data -
Apache Spark Guide Important Notice © 2010-2021 Cloudera, Inc
Apache Spark Guide Important Notice © 2010-2021 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0, including any notices, is included herein. A copy of the Apache License Version 2.0 can also be found here: https://opensource.org/licenses/Apache-2.0 Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. -
Cloudera Introduction Important Notice © 2010-2021 Cloudera, Inc
Cloudera Introduction Important Notice © 2010-2021 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0, including any notices, is included herein. A copy of the Apache License Version 2.0 can also be found here: https://opensource.org/licenses/Apache-2.0 Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. -
Install Guide
DMX Install Guide Version 9.10 DMX Install Guide Copyright 1990, 2020 Syncsort Incorporated. All rights reserved. This document contains unpublished, confidential, and proprietary information of Syncsort Incorporated. No disclosure or use of any portion of the contents of this document may be made without the express written consent of Syncsort Incorporated. Getting technical support: Customers with a valid maintenance contact can get technical assistance via MySupport. There you will find product downloads and documentation for the products to which you are entitled, as well as an extensive knowledge base. Version 9.10 Last Update: 15 May 2020 Contents DMX Overview ............................................................................................................... 4 Installing DMX/DMX-h ................................................................................................... 4 DMX-h Overview .......................................................................................... 4 Prerequisites ................................................................................................ 5 Step-by-Step Installation ............................................................................. 8 Configuring the DMX Run-time Service ................................................... 22 Applying a New License Key to an Existing Installation ......................... 26 Running DMX ............................................................................................................... 28 Graphical User Interfaces