9781782161097 Preview.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
[ 1 ] Apache Hive Cookbook Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data world Hanish Bansal Saurabh Chauhan Shrey Mehrotra BIRMINGHAM - MUMBAI Apache Hive Cookbook Copyright © 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2016 Production reference: 1260416 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-108-0 www.packtpub.com Credits Authors Project Coordinator Hanish Bansal Bijal Patel Saurabh Chauhan Shrey Mehrotra Proofreader Safs Editing Reviewer Aristides Villarreal Bravo Indexer Priya Sane Commissioning Editor Wilson D'souza Graphics Kirk D'Penha Acquisition Editor Tushar Gupta Production Coordinator Shantanu N. Zagade Content Development Editor Anish Dhurat Cover Work Shantanu N. Zagade Technical Editor Vishal K. Mewada Copy Editor Dipti Mankame About the Authors Hanish Bansal is a software engineer with over 4 years of experience in developing big data applications. He loves to study emerging solutions and applications mainly related to big data processing, NoSQL, natural language processing, and neural networks. He has worked on various technologies such as Spring Framework, Hibernate, Hadoop, Hive, Flume, Kafka, Storm, and NoSQL databases, which include HBase, Cassandra, MongoDB, and search engines such as Elasticsearch. In 2012, he completed his graduation in Information Technology stream from Jaipur Engineering College and Research Center, Jaipur, India. He was also the technical reviewer of the book Apache Zookeeper Essentials. In his spare time, he loves to travel and listen to music. You can read his blog at http://hanishblogger.blogspot.in/ and follow him on Twitter at https://twitter.com/hanishbansal786. I would like to thank my parents for their love, support, encouragement and the amazing chances they've given me over the years. Saurabh Chauhan is a module lead with close to 8 years of experience in data warehousing and big data applications. He has worked on multiple Extract, Transform and Load tools, such as Oracle Data Integrator and Informatica as well as on big data technologies such as Hadoop, Hive, Pig, Sqoop, and Flume. He completed his bachelor of technology in 2007 from Vishveshwarya Institute of Engineering and Technology. In his spare time, he loves to travel and discover new places. He also has a keen interest in sports. I would like to thank everyone who has supported me throughout my life. Shrey Mehrotra has 6 years of IT experience and, since the past 4 years, in designing and architecting cloud and big data solutions for the governmental and fnancial domains. Having worked with big data R&D Labs and Global Data and Analytical Capabilities, he has gained insights into Hadoop, focusing on HDFS, MapReduce, and YARN. His technical strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java. He likes spending time performing R&D on different big data technologies. He is the co- author of the book Learning YARN, a certifed Hadoop developer, and has also written various technical papers. In his free time, he listens to music, watches movies, and spending time with friends. I would like to thank my mom and dad for giving me support to accomplish anything I wanted. Also, I would like to thank my friends, who bear with me while I am busy writing. About the Reviewer Aristides Villarreal Bravo is a Java developers, a member of the NetBeans Dream Team, and a Java User Groups leader. He has organized and participated in various conferences and seminars related to Java, JavaEE, NetBeans, NetBeans Platform, free software, and mobile devices, nationally and internationally. He has written tutorials and blogs about Java, NetBeans, and web development. He has participated in several interviews on sites such as NetBeans, NetBeans Dzone, and JavaHispano. He has developed plugins for NetBeans. He has been a technical reviewer for the book PrimeFaces Blueprints. Aristides is the CEO of Javscaz Software Developers. He lives in Panamá To my mother, father, and all family and friends. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub fles available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print, and bookmark content f On demand and accessible via a web browser Table of Contents Preface v Chapter 1: Developing Hive 1 Introduction 1 Deploying Hive on a Hadoop cluster 2 Deploying Hive Metastore 3 Installing Hive 6 Confguring HCatalog 10 Understanding different components of Hive 11 Compiling Hive from source 13 Hive packages 15 Debugging Hive 16 Running Hive 17 Changing confgurations at runtime 18 Chapter 2: Services in Hive 19 Introducing HiveServer2 19 Understanding HiveServer2 properties 21 Confguring HiveServer2 high availability 22 Using HiveServer2 Clients 24 Introducing the Hive metastore service 34 Confguring high availability of metastore service 36 Introducing Hue 36 Chapter 3: Understanding the Hive Data Model 43 Introduction 43 Using numeric data types 45 Using string data types 46 Using Date/Time data types 47 Using miscellaneous data types 48 Using complex data types 48 i Table of Contents Using operators 50 Partitioning 57 Partitioning a managed table 58 Partitioning an external table 65 Bucketing 65 Chapter 4: Hive Data Defnition Language 69 Introduction 70 Creating a database schema 70 Dropping a database schema 72 Altering a database schema 73 Using a database schema 74 Showing database schemas 74 Describing a database schema 75 Creating tables 76 Dropping tables 78 Truncating tables 79 Renaming tables 80 Altering table properties 80 Creating views 81 Dropping views 82 Altering the view properties 83 Altering the view as select 83 Showing tables 84 Showing partitions 85 Show the table properties 85 Showing create table 86 HCatalog 87 WebHCat 88 Chapter 5: Hive Data Manipulation Language 89 Introduction 89 Loading fles into tables 90 Inserting data into Hive tables from queries 93 Inserting data into dynamic partitions 96 Writing data into fles from queries 98 Enabling transactions in Hive 99 Inserting values into tables from SQL 101 Updating data 104 Deleting data 105 ii Table of Contents Chapter 6: Hive Extensibility Features 107 Introduction 108 Serialization and deserialization formats and data types 108 Exploring views 112 Exploring indexes 114 Hive partitioning 115 Creating buckets in Hive 119 Analytics functions in Hive 121 Windowing in Hive 125 File formats 132 Chapter 7: Joins and Join Optimization 137 Understanding the joins concept 137 Using a left/right/full outer join 141 Using a left semi join 145 Using a cross join 147 Using a map-side join 149 Using a bucket map join 152 Using a bucket sort merge map join 154 Using a skew join 155 Chapter 8: Statistics in Hive 157 Bringing statistics in to Hive 157 Table and partition statistics in Hive 158 Column statistics in Hive 163 Top K statistics in Hive 165 Chapter 9: Functions in Hive 167 Using built-in functions 167 Using the built-in User-defned Aggregation Function (UDAF) 184 Using the built-in User Defned Table Function (UDTF) 188 Creating custom User-Defned Functions (UDF) 191 Chapter 10: Hive Tuning 193 Enabling predicate pushdown optimizations in Hive 193 Optimizations to reduce the number of map 198 Sampling 200 Chapter 11: Hive Security 207 Securing Hadoop 207 Authorizing Hive 212 Confguring the SQL standards-based authorization 215 Authenticating Hive 222 iii Table of Contents Chapter 12: Hive Integration with Other Frameworks 229 Working with Apache Spark 229 Working with Accumulo 233 Working with HBase 234 Working with Google Drill 240 Index 243 iv Preface Hive is an open source big data framework in the Hadoop ecosystem. It provides an SQL-like interface to query data stored in HDFS. Underlying it runs MapReduce programs corresponding to the SQL query. Hive was initially developed by Facebook and later added to the Hadoop ecosystem. Hive is currently the most preferred framework to query data in Hadoop. Because most of the historical data is stored in RDBMS data stores, including Oracle and Teradata. It is convenient for the developers to run similar SQL statements in Hive to query data. Along with simple SQL statements, Hive supports wide variety of windowing and analytical functions, including rank, row num, dense rank, lead, and lag.