Hadoop Operations and Cluster Management Cookbook
Total Page:16
File Type:pdf, Size:1020Kb
Hadoop Operations and Cluster Management Cookbook Over 60 recipes showing you how to design, configure, manage, monitor, and tune a Hadoop cluster Shumin Guo BIRMINGHAM - MUMBAI Hadoop Operations and Cluster Management Cookbook Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2013 Production Reference: 1170713 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-516-3 www.packtpub.com Cover Image by Girish Suryavanshi ([email protected]) Credits Author Project Coordinator Shumin Guo Anurag Banerjee Reviewers Proofreader Hector Cuesta-Arvizu Lauren Tobon Mark Kerzner Harvinder Singh Saluja Indexer Hemangini Bari Acquisition Editor Kartikey Pandey Graphics Abhinash Sahu Lead Technical Editor Madhuja Chaudhari Production Coordinator Nitesh Thakur Technical Editors Sharvari Baet Cover Work Nitesh Thakur Jalasha D'costa Veena Pagare Amit Ramadas About the Author Shumin Guo is a PhD student of Computer Science at Wright State University in Dayton, OH. His research fields include Cloud Computing and Social Computing. He is enthusiastic about open source technologies and has been working as a System Administrator, Programmer, and Researcher at State Street Corp. and LexisNexis. I would like to sincerely thank my wife, Min Han, for her support both technically and mentally. This book would not have been possible without encouragement from her. About the Reviewers Hector Cuesta-Arvizu provides consulting services for software engineering and data analysis with over eight years of experience in a variety of industries, including financial services, social networking, e-learning, and Human Resources. Hector holds a BA in Informatics and an MSc in Computer Science. His main research interests lie in Machine Learning, High Performance Computing, Big Data, Computational Epidemiology, and Data Visualization. He has also helped in the technical review of the book Raspberry Pi Networking Cookbook by Rick Golden, Packt Publishing. He has published 12 scientific papers in International Conferences and Journals. He is an enthusiast of Lego Robotics and Raspberry Pi in his spare time. You can follow him on Twitter at https://twitter.com/hmCuesta. Mark Kerzner holds degrees in Law, Math, and Computer Science. He has been designing software for many years and Hadoop-based systems since 2008. He is the President of SHMsoft, a provider of Hadoop applications for various verticals, and a co-author of the book/project Hadoop Illuminated. He has authored and co-authored books and patents. I would like to acknowledge the help of my colleagues, in particular Sujee Maniyam, and last but not least, my multitalented family. Harvinder Singh Saluja has over 20 years of software architecture and development experience, and is the co-founder of MindTelligent, Inc. He works as Oracle SOA, Fusion MiddleWare, and Oracle Identity and Access Manager, and Oracle Big Data Specialist and Chief Integration Specialist at MindTelligent, Inc. Harvinder's strengths include his experience with strategy, concepts, and logical and physical architecture and development using Java/JEE/ADF/SEAM, SOA/AIA/OSB/OSR/OER, and OIM/OAM technologies. He leads and manages MindTelligent's onshore and offshore and Oracle SOA/OSB/AIA/OSB/OER/OIM/OAM engagements. His specialty includes the AIA Foundation Pack – development of custom PIPS for Utilities, Healthcare, and Energy verticals. His integration engagements include CC&B (Oracle Utilities Customer Care and Billing), Oracle Enterprise Taxation and Policy, Oracle Utilities Mobile Workforce Management, Oracle Utilities Meter Data Management, Oracle eBusiness Suite, Siebel CRM, and Oracle B2B for EDI – X12 and EDIFACT. His strengths include enterprise-wide security using Oracle Identity and Access Management, OID/OVD/ODSM/OWSM, including provisioning, workflows, reconciliation, single sign-on, SPML API, Connector API, and Web Services message and transport security using OWSM and Java cryptography. He was awarded JDeveloper Java Extensions Developer of the Year award in 2003 by Oracle magazine. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Big Data and Hadoop 7 Introduction 7 Defining a Big Data problem 8 Building a Hadoop-based Big Data platform 9 Choosing from Hadoop alternatives 13 Chapter 2: Preparing for Hadoop Installation 17 Introduction 17 Choosing hardware for cluster nodes 19 Designing the cluster network 21 Configuring the cluster administrator machine 23 Creating the kickstart file and boot media 29 Installing the Linux operating system 35 Installing Java and other tools 39 Configuring SSH 44 Chapter 3: Configuring a Hadoop Cluster 49 Introduction 49 Choosing a Hadoop version 50 Configuring Hadoop in pseudo-distributed mode 51 Configuring Hadoop in fully-distributed mode 60 Validating Hadoop installation 70 Configuring ZooKeeper 80 Installing HBase 83 Installing Hive 87 Installing Pig 88 Installing Mahout 89 Table of Contents Chapter 4: Managing a Hadoop Cluster 93 Introduction 94 Managing the HDFS cluster 94 Configuring SecondaryNameNode 103 Managing the MapReduce cluster 105 Managing TaskTracker 107 Decommissioning DataNode 111 Replacing a slave node 112 Managing MapReduce jobs 114 Checking job history from the web UI 127 Importing data to HDFS 133 Manipulating files on HDFS 136 Configuring the HDFS quota 140 Configuring CapacityScheduler 142 Configuring Fair Scheduler 146 Configuring Hadoop daemon logging 150 Configuring Hadoop audit logging 155 Upgrading Hadoop 157 Chapter 5: Hardening a Hadoop Cluster 163 Introduction 163 Configuring service-level authentication 163 Configuring job authorization with ACL 166 Securing a Hadoop cluster with Kerberos 169 Configuring web UI authentication 176 Recovering from NameNode failure 180 Configuring NameNode high availability 185 Configuring HDFS federation 192 Chapter 6: Monitoring a Hadoop Cluster 199 Introduction 199 Monitoring a Hadoop cluster with JMX 200 Monitoring a Hadoop cluster with Ganglia 207 Monitoring a Hadoop cluster with Nagios 217 Monitoring a Hadoop cluster with Ambari 224 Monitoring a Hadoop cluster with Chukwa 235 ii Table of Contents Chapter 7: Tuning a Hadoop Cluster for Best Performance 245 Introduction 246 Benchmarking and profiling a Hadoop cluster 247 Analyzing job history with Rumen 259 Benchmarking a Hadoop cluster with GridMix 262 Using Hadoop Vaidya to identify performance problems 269 Balancing data blocks for a Hadoop cluster 274 Choosing a proper block size 277 Using compression for input and output 278 Configuring speculative execution 281 Setting proper number of map and reduce slots for the TaskTracker 285 Tuning the JobTracker configuration 286 Tuning the TaskTracker configuration 289 Tuning shuffle, merge, and sort parameters 293 Configuring memory for a Hadoop cluster 297 Setting proper number of parallel copies 300 Tuning JVM parameters 301 Configuring JVM Reuse 302 Configuring the reducer initialization time 304 Chapter 8: Building a Hadoop Cluster with Amazon EC2 and S3 307 Introduction 307 Registering with Amazon Web Services (AWS) 308 Managing AWS security credentials 312 Preparing a local machine for EC2 connection 316 Creating an Amazon Machine Image (AMI) 317 Using S3 to host data 336 Configuring a Hadoop cluster with the new AMI 337 Index 345 iii Table of Contents iv Preface Today, many organizations are facing the Big Data problem. Managing and processing Big Data