Hadoop Real-World Solutions Cookbook
Total Page:16
File Type:pdf, Size:1020Kb
Hadoop Real-World Solutions Cookbook Realistic, simple code examples to solve problems at scale with Hadoop and related technologies Jonathan R. Owens Jon Lentz Brian Femiano BIRMINGHAM - MUMBAI Hadoop Real-World Solutions Cookbook Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: February 2013 Production Reference: 1280113 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-84951-912-0 www.packtpub.com Cover Image by iStockPhoto Credits Authors Project Coordinator Jonathan R. Owens Abhishek Kori Jon Lentz Brian Femiano Proofreader Stephen Silk Reviewers Edward J. Cody Indexer Monica Ajmera Mehta Daniel Jue Bruce C. Miller Graphics Conidon Miranda Acquisition Editor Robin de Jongh Layout Coordinator Conidon Miranda Lead Technical Editor Azharuddin Sheikh Cover Work Conidon Miranda Technical Editor Dennis John Copy Editors Brandt D'Mello Insiya Morbiwala Aditya Nair Alfida Paiva Ruta Waghmare About the Authors Jonathan R. Owens has a background in Java and C++, and has worked in both private and public sectors as a software engineer. Most recently, he has been working with Hadoop and related distributed processing technologies. Currently, he works for comScore, Inc., a widely regarded digital measurement and analytics company. At comScore, he is a member of the core processing team, which uses Hadoop and other custom distributed systems to aggregate, analyze, and manage over 40 billion transactions per day. I would like to thank my parents James and Patricia Owens, for their support and introducing me to technology at a young age. Jon Lentz is a Software Engineer on the core processing team at comScore, Inc., an online audience measurement and analytics company. He prefers to do most of his coding in Pig. Before working at comScore, he developed software to optimize supply chains and allocate fixed-income securities. To my daughter, Emma, born during the writing of this book. Thanks for the company on late nights. Brian Femiano has a B.S. in Computer Science and has been programming professionally for over 6 years, the last two of which have been spent building advanced analytics and Big Data capabilities using Apache Hadoop. He has worked for the commercial sector in the past, but the majority of his experience comes from the government contracting space. He currently works for Potomac Fusion in the DC/Virginia area, where they develop scalable algorithms to study and enhance some of the most advanced and complex datasets in the government space. Within Potomac Fusion, he has taught courses and conducted training sessions to help teach Apache Hadoop and related cloud-scale technologies. I'd like to thank my co-authors for their patience and hard work building the code you see in this book. Also, my various colleagues at Potomac Fusion, whose talent and passion for building cutting-edge capability and promoting knowledge transfer have inspired me. About the Reviewers Edward J. Cody is an author, speaker, and industry expert in data warehousing, Oracle Business Intelligence, and Hyperion EPM implementations. He is the author and co-author respectively of two books with Packt Publishing, titled The Business Analyst's Guide to Oracle Hyperion Interactive Reporting 11 and The Oracle Hyperion Interactive Reporting 11 Expert Guide. He has consulted to both commercial and federal government clients throughout his career, and is currently managing large-scale EPM, BI, and data warehouse implementations. I would like to commend the authors of this book for a job well done, and would like to thank Packt Publishing for the opportunity to assist in the editing of this publication. Daniel Jue is a Sr. Software Engineer at Sotera Defense Solutions and a member of the Apache Software Foundation. He has worked in peace and conflict zones to showcase the hidden dynamics and anomalies in the underlying "Big Data", with clients such as ACSIM, DARPA, and various federal agencies. Daniel holds a B.S. in Computer Science from the University of Maryland, College Park, where he also specialized in Physics and Astronomy. His current interests include merging distributed artificial intelligence techniques with adaptive heterogeneous cloud computing. I'd like to thank my beautiful wife Wendy, and my twin sons Christopher and Jonathan, for their love and patience while I research and review. I owe a great deal to Brian Femiano, Bruce Miller, and Jonathan Larson for allowing me to be exposed to many great ideas, points of view, and zealous inspiration. Bruce Miller is a Senior Software Engineer for Sotera Defense Solutions, currently employed at DARPA, with most of his 10-year career focused on Big Data software development. His non-work interests include functional programming in languages like Haskell and Lisp dialects, and their application to real-world problems. www.packtpub.com Support files, eBooks, discount offers and more You might want to visit www.packtpub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packtpub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packtpub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://packtLib.packtPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.packtpub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Hadoop Distributed File System – Importing and Exporting Data 7 Introduction 8 Importing and exporting data into HDFS using Hadoop shell commands 8 Moving data efficiently between clusters using Distributed Copy 15 Importing data from MySQL into HDFS using Sqoop 16 Exporting data from HDFS into MySQL using Sqoop 21 Configuring Sqoop for Microsoft SQL Server 25 Exporting data from HDFS into MongoDB 26 Importing data from MongoDB into HDFS 30 Exporting data from HDFS into MongoDB using Pig 33 Using HDFS in a Greenplum external table 35 Using Flume to load data into HDFS 37 Chapter 2: HDFS 39 Introduction 39 Reading and writing data to HDFS 40 Compressing data using LZO 42 Reading and writing data to SequenceFiles 46 Using Apache Avro to serialize data 50 Using Apache Thrift to serialize data 54 Using Protocol Buffers to serialize data 58 Setting the replication factor for HDFS 63 Setting the block size for HDFS 64 Table of Contents Chapter 3: Extracting and Transforming Data 65 Introduction 65 Transforming Apache logs into TSV format using MapReduce 66 Using Apache Pig to filter bot traffic from web server logs 69 Using Apache Pig to sort web server log data by timestamp 72 Using Apache Pig to sessionize web server log data 74 Using Python to extend Apache Pig functionality 77 Using MapReduce and secondary sort to calculate page views 78 Using Hive and Python to clean and transform geographical event data 84 Using Python and Hadoop Streaming to perform a time series analytic 89 Using MultipleOutputs in MapReduce to name output files 94 Creating custom Hadoop Writable and InputFormat to read geographical event data 98 Chapter 4: Performing Common Tasks Using Hive, Pig, and MapReduce 105 Introduction 105 Using Hive to map an external table over weblog data in HDFS 106 Using Hive to dynamically create tables from the results of a weblog query 108 Using the Hive string UDFs to concatenate fields in weblog data 110 Using Hive to intersect weblog IPs and determine the country 113 Generating n-grams over news archives using MapReduce 115 Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives 120 Using Pig to load a table and perform a SELECT operation with GROUP BY 125 Chapter 5: Advanced Joins 127 Introduction 127 Joining data in the Mapper using MapReduce 128 Joining data using Apache Pig replicated join 132 Joining sorted data using Apache Pig merge join 134 Joining skewed data using Apache Pig skewed join 136 Using a map-side join in Apache