9781783983070 Preview.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
1 HBase High Performance Cookbook Exciting projects that will teach you how complex data can be exploited to gain maximum insights Ruchir Choudhry BIRMINGHAM - MUMBAI HBase High Performance Cookbook Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: January 2017 Production reference: 1250117 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78398-306-3 www.packtpub.com Credits Author Project Coordinator Ruchir Choudhry Nidhi Joshi Reviewer Proofreader Vinay Kumar Boddula Safs Editing Commissioning Editor Indexer Amarabha Banerjee Mariammal Chettiyar Acquisition Editor Graphics Larissa Pinto Disha Haria Content Development Editor Production Coordinator Tejas Limkar Arvindkumar Gupta Technical Editor Cover Work Danish Shaikh Arvindkumar Gupta Copy Editor Manisha Singh About the Author Ruchir Choudhry is a principle architect in one of the largest e-commerce companies and specializes in leading, articulating, technology vision, strategizing, and implementing very large-scale software engineering-driven technology changes, with a track record of over 16 years of success. He was responsible for leading strategy, architecture, engineering, and operations of multi- tenant e-commerce sites and platforms in US, UK, Brazil, and other major markets for Walmart. The sites helped Walmart enter new markets and/or grow its market share. The sites combined service millions of customers and take in orders with annual revenues exceeding $2.0 billion. His personal interest is in performance and scalability. Recently, he has become obsessed with technology as a vehicle to drive and prioritize optimization across organizations and in the world. He is a core team member in conceptualizing, designing, and reshaping a new platform that will serve the next generation of frontend engineering, based on the cutting edge technology in WalMart.com and NBC/GE/VF Image ware. He has led some of the most complex and technologically challenging R&D and innovative projects in VF Image Ware, Walmart.com and in GE/NBC (China and Vancouver Olympic websites), Hiper World Cyber Tech Limited (which created the frst wireless-based payment gateway of India that worked on non-smart phones, which was presented at Berlin in 1999). He is the author of more than 8 white papers, which spans from biometric-based single sign on to Java Cards, performance tuning, and JVM tuning, among others. He was a presenter of JVM, performance optimization using Jboss, in Berlin and various other places. Ruchir Choudhry did his BE at Bapuji Institute of Technology, MBA in information technology at National institute of Engineering and Technology, and his MS Systems at BITS Pilani. He is currently working and consulting on HBase, Spark, and Cassandra. He can be reached at [email protected] About the Reviewer Vinay Boddula is pursuing his PhD in computer science focusing on machine learning, big data analytics, and data integration. He is experienced with many open source platforms for data storage, programming, and machine learning, and enjoys applying his research skills in addressing sustainability and smarter planet challenges. His master's degree in embedded system design helps him develop sensor systems for environment monitoring using SoC and microcontrollers. Vinay has worked and interned as a software developer at companies based in India and United States. His work focuses on developing real-time alert engines using complex event processing on historic and streaming data. I would like to thank my teacher, colleagues, friends, and family for their patience and unconditional love. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub fles available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print, and bookmark content f On demand and accessible via a web browser Customer Feedback Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously—that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn. You can also review for us on a regular basis by joining our reviewers' club. If you're interested in joining, or would like to learn more about the benefts we offer, please contact us: [email protected]. Table of Contents Preface v Chapter 1: Confguring HBase 1 Introduction 1 Confguring and deploying HBase 2 Using the flesystem 21 Administering clusters 36 Managing clusters 48 Chapter 2: Loading Data from Various DBs 57 Introduction 57 Extracting data from Oracle 58 Loading data using Oracle Big data connector 67 Bulk utilities 70 Using Hive with Apache HBase 75 Using Sqoop 77 Chapter 3: Working with Large Distributed Systems Part I 83 Introduction 83 Scaling elastically or Auto Scaling with built-in fault tolerance 84 Auto Scaling HBase using AWS 97 Works on different VM/physical, cloud hardware 104 Chapter 4: Working with Large Distributed Systems Part II 111 Introduction 111 Read path 115 Write Path 118 Snappy 131 LZO compression 132 LZ4 compressor 134 Replication 135 i Table of Contents Chapter 5: Working with Scalable Structure of tables 141 Introduction 141 HBase data model part 1 142 HBase data model part 2 149 How HBase truly scales on key and schema design 153 Chapter 6: HBase Clients 165 Introduction 165 HBase REST and Java Client 165 Working with Apache Thrift 180 Working with Apache Avro 184 Working with Protocol buffer 196 Working with Pig and using Shell 201 Chapter 7: Large-Scale MapReduce 205 Introduction 205 Chapter 8: HBase Performance Tuning 221 Introduction 221 Working with infrastructure/operating systems 222 Working with Java virtual machines 230 Changing the confguration of components 234 Working with HDFS 237 Chapter 9: Performing Advanced Tasks on HBase 239 Machine learning using Hbase 239 Real-time data analysis using Hbase and Mahout 260 Full text indexing using Hbase 271 Chapter 10: Optimizing Hbase for Cloud 277 Introduction 277 Confguring Hbase for the Cloud 278 Connecting to an Hbase cluster using the command line 283 Backing up and restoring Hbase 284 Terminating an HBase cluster 286 Accessing HBase data with hive 286 Viewing the Hbase user interface 289 Monitoring HBase with CloudWatch 292 Monitoring Hbase with Ganglia 295 ii Table of Contents Chapter 11: Case Study 301 Introduction 301 Confguring Lily Platform 302 Integrating elastic search with Hbase 316 Confguring 316 Index 325 iii Preface The objective of this book is to guide the reader through setting up, developing, and integrating different clients. It also shows the internals of read/write operations and how to maintain and scale HBase clusters in production. This book also allows engineers to kick start their project for recommendations, relevancy, and machine learning. What this book covers Chapter 1, Confguring HBase, provides in-depth knowledge of how to set up, confgure, administer, and manage a large and scalable cluster. Chapter 2, Loading Data, will deep dive into how we can extract data from various input sources using different process such as bulk load, put, and using MapReduce. Chapter 3, Working with Large Distributed Systems I, talks about the internal architecture of HBase, how it connects, and provides a very scalable model. Chapter 4, Working with Large Distributed Systems II, gives more details and is an extension of Chapter 3. Chapter 5, Working with the Scalable Structure of Tables, allows us to understand how data can be modeled and how to design a scalable data model using HBase. Chapter 6, HBase Client, allows the users to understand how we can communicate with core HBase using various type of clients. Chapter 7, Large-Scale MapReduce, shows how to design a large scale MapReduce job using HBase, how the internals of it work, and how to optimize the HBase framework to do it. Chapter 8, HBase Performance Tuning, will walk you through the process of fne-tuning read and write at consistent speeds, agnostic to the scale at which it's running. Chapter 9, Performing Advance Task, will discuss some advance topics about machine learning using Mahout libraries and real-time text data analysis.