Mapreduce Design Patterns
Total Page:16
File Type:pdf, Size:1020Kb
MapReduce Design Patterns Donald Miner and Adam Shook MapReduce Design Patterns by Donald Miner and Adam Shook Copyright © 2013 Donald Miner and Adam Shook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Andy Oram and Mike Hendrickson Proofreader: Dawn Carelli Production Editor: Christopher Hearse Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest December 2012: First Edition Revision History for the First Edition: 2012-11-20 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449327170 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MapReduce Design Patterns, the image of Père David’s deer, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-32717-0 [LSI] For William Table of Contents Preface. ix 1. Design Patterns and MapReduce. 1 Design Patterns 2 MapReduce History 4 MapReduce and Hadoop Refresher 4 Hadoop Example: Word Count 7 Pig and Hive 11 2. Summarization Patterns. 13 Numerical Summarizations 14 Pattern Description 14 Numerical Summarization Examples 17 Inverted Index Summarizations 32 Pattern Description 32 Inverted Index Example 35 Counting with Counters 37 Pattern Description 37 Counting with Counters Example 40 3. Filtering Patterns. 43 Filtering 44 Pattern Description 44 Filtering Examples 47 Bloom Filtering 49 Pattern Description 49 Bloom Filtering Examples 53 Top Ten 58 Pattern Description 58 Top Ten Examples 63 v Distinct 65 Pattern Description 65 Distinct Examples 68 4. Data Organization Patterns. 71 Structured to Hierarchical 72 Pattern Description 72 Structured to Hierarchical Examples 76 Partitioning 82 Pattern Description 82 Partitioning Examples 86 Binning 88 Pattern Description 88 Binning Examples 90 Total Order Sorting 92 Pattern Description 92 Total Order Sorting Examples 95 Shuffling 99 Pattern Description 99 Shuffle Examples 101 5. Join Patterns. 103 A Refresher on Joins 104 Reduce Side Join 108 Pattern Description 108 Reduce Side Join Example 111 Reduce Side Join with Bloom Filter 117 Replicated Join 119 Pattern Description 119 Replicated Join Examples 121 Composite Join 123 Pattern Description 123 Composite Join Examples 126 Cartesian Product 128 Pattern Description 128 Cartesian Product Examples 132 6. Metapatterns. 139 Job Chaining 139 With the Driver 140 Job Chaining Examples 141 With Shell Scripting 150 vi | Table of Contents With JobControl 153 Chain Folding 158 The ChainMapper and ChainReducer Approach 163 Chain Folding Example 163 Job Merging 168 Job Merging Examples 170 7. Input and Output Patterns. 177 Customizing Input and Output in Hadoop 177 InputFormat 178 RecordReader 179 OutputFormat 180 RecordWriter 181 Generating Data 182 Pattern Description 182 Generating Data Examples 184 External Source Output 189 Pattern Description 189 External Source Output Example 191 External Source Input 195 Pattern Description 195 External Source Input Example 197 Partition Pruning 202 Pattern Description 202 Partition Pruning Examples 205 8. Final Thoughts and the Future of Design Patterns. 217 Trends in the Nature of Data 217 Images, Audio, and Video 217 Streaming Data 218 The Effects of YARN 219 Patterns as a Library or Component 220 How You Can Help 220 A. Bloom Filters. 221 Index. 227 Table of Contents | vii Preface Welcome to MapReduce Design Patterns! This book will be unique in some ways and familiar in others. First and foremost, this book is obviously about design patterns, which are templates or general guides to solving problems. We took a look at other design patterns books that have been written in the past as inspiration, particularly Design Patterns: Elements of Reusable Object-Oriented Software, by Gamma et al. (1995), which is commonly referred to as “The Gang of Four” book. For each pattern, you’ll see a template that we reuse over and over that we loosely based off of their book. Repeatedly seeing a similar template will help you get to the specific information you need. This will be especially useful in the future when using this book as a reference. This book is a bit more open-ended than a book in the “cookbook” series of texts as we don’t call out specific problems. However, similarly to the cookbooks, the lessons in this book are short and categorized. You’ll have to go a bit further than just copying and pasting our code to solve your problems, but we hope that you will find a pattern to get you at least 90% of the way for just about all of your challenges. This book is mostly about the analytics side of Hadoop or MapReduce. We intentionally try not to dive into too much detail on how Hadoop or MapReduce works or talk too long about the APIs that we are using. These topics have been written about quite a few times, both online and in print, so we decided to focus on analytics. In this preface, we’ll talk about how to read this book since its format might be a bit different than most books you’ve read. Intended Audience The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with ix writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it. This book is also intended for anyone wanting to learn more about the MapReduce paradigm. The book goes deeply into the technical side of MapReduce with code ex‐ amples and detailed explanations of the inner workings of a MapReduce system, which will help software engineers develop MapReduce analytics. However, quite a bit of time is spent discussing the motivation of some patterns and the common use cases for these patterns, which could be interesting to someone who just wants to know what a system like Hadoop can do. To get the most out of this book, we suggest you have some knowledge of Hadoop, as all of the code examples are written for Hadoop and many of the patterns are discussed in a Hadoop context. A brief refresher will be given in the first chapter, along with some suggestions for additional reading material. Pattern Format The patterns in this book follow a single template format so they are easier to read in succession. Some patterns will omit some of the sections if they don’t make sense in the context of that pattern. Intent This section is a quick description of the problem the pattern is intended to solve. Motivation This section explains why you would want to solve this problem or where it would appear. Some use cases are typically discussed in brief. Applicability This section contains a set of criteria that must be true to be able to apply this pattern to a problem. Sometimes these are limitations in the design of the pattern and sometimes they help you make sure this pattern will work in your situation. Structure This section explains the layout of the MapReduce job itself. It’ll explain what the map phase does, what the reduce phase does, and also lets you know if it’ll be using any custom partitioners, combiners, or input formats. This is the meat of the pattern and explains how to solve the problem. x | Preface Consequences This section is pretty short and just explains what the output of the pattern will be. This is the end goal of the output this pattern produces. Resemblances For readers that have some experience with SQL or Pig, this section will show anal‐ ogies of how this problem would be solved with these other languages. You may even find yourself reading this section first as it gets straight to the point of what this pattern does. Sometimes, SQL, Pig, or both are omitted if what we are doing with MapReduce is truly unique. Known Uses This section outlines some common use cases for this pattern. Performance Analysis This section explains the performance profile of the analytic produced by the pat‐ tern. Understanding this is important because every MapReduce analytic needs to be tweaked and configured properly to maximize performance. Without the knowl‐ edge of what resources it is using on your cluster, it would be difficult to do this. The Examples in This Book All of the examples in this book are written for Hadoop version 1.0.3. MapReduce is a paradigm that is seen in a number of open source and commercial systems these days, but we had to pick one to make our examples consistent and easy to follow, so we picked Hadoop.