Big Data & Analytics: Mapreduce/Hadoop – A

BIG DATA & ANALYTICS MAPREDUCE/HADOOP – A PROGRAMMER’S PERSPECTIVE Tushar Telichari ([email protected]) Principal Engineer – NetWorker Development EMC Proven Specialist - Data Center Architect EMC Corporation Table of Contents Introduction .................................................................................................................................................. 3 MapReduce Framework ............................................................................................................................ 4 What is MapReduce? ............................................................................................................................ 4 MapReduce programming model and constructs ............................................................................. 4 Steps in MapReduce ......................................................................................................................... 4 ―Hello World‖: Word Count Program ............................................................................................... 5 Analysis of the Word Count Program .............................................................................................. 6 Hadoop Setup & Maintenance ................................................................................................................. 7 Lab environment ..................................................................................................................................... 7 Prerequisites ........................................................................................................................................... 7 Setting up Hadoop environment on a single machine ...................................................................... 8 Installing Hadoop ................................................................................................................................ 8 Configuring Hadoop ........................................................................................................................... 8 Creating the HDFS filesystem ........................................................................................................ 11 Starting the Hadoop environment .................................................................................................. 12 Hadoop daemons ............................................................................................................................. 13 Verifying the Hadoop environment ................................................................................................ 14 Hadoop Web Interfaces................................................................................................................... 16 Setting up a Hadoop cluster ............................................................................................................... 19 Installing Hadoop .............................................................................................................................. 20 Configuring Hadoop ......................................................................................................................... 20 Starting the Hadoop environment .................................................................................................. 21 Verifying the Hadoop environment ................................................................................................ 22 MapReduce/Hadoop Programming ....................................................................................................... 23 Developing custom applications based on MapReduce paradigm ............................................... 23 WordCount program ........................................................................................................................ 23 Program analysis .............................................................................................................................. 25 Interacting with the Hadoop Distributed File System (HDFS) ........................................................... 26 References ................................................................................................................................................ 27 Disclaimer: The views, processes, or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies. 2012 EMC Proven Professional Knowledge Sharing 2 Introduction The volume of data being generated is growing exponentially and enterprises are struggling to manage and analyze it. Most existing tools and methodologies to filter and analyze this data offer inadequate speed and performance to yield meaningful results. Big Data have significant potential to create value for both businesses and consumers. Now, there are a growing number of technologies used to aggregate, manipulate, manage, and analyze Big Data. This article covers two of the most prominent technologies; MapReduce and Hadoop. 1. MapReduce is a software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system. 2. Hadoop is an open source software framework inspired by Google’s MapReduce and Google File System. This Knowledge Sharing article takes an in-depth look at MapReduce, Hadoop, and Hadoop ecosystem and will cover, but is not limited to, the following areas: 1. Hadoop Setup and Maintenance Setting up Hadoop environment on a single machine Setting up a Hadoop cluster 2. MapReduce/Hadoop Programming Developing custom applications based on MapReduce paradigm 3. Interacting with the Hadoop Distributed File System (HDFS) 2012 EMC Proven Professional Knowledge Sharing 3 MapReduce Framework What is MapReduce? MapReduce is a parallel programming model developed by Google as a mechanism for processing large amounts of raw data, e.g., web pages the search engine has crawled. This data is so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset. MapReduce is an abstraction that allows simple computations to be performed while hiding the details of parallelization, data distribution, load balancing, and fault tolerance. MapReduce programming model and constructs MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. Steps in MapReduce 1. Map works independently to convert input data to key value pairs 2. Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the 0) per key Step Input Output map <k1, v1> list<k2, v2> reduce <k2, list(v2)> list<k3, v3> Sequence of operations in MapReduce 1. Input to the application must be structured as a list of (key/value) pairs, list(<k1, v1>). The input format for processing multiple files is usually list(<String filename, String file_content>). The input format for processing one large file, such as a log file, is list(<Integer line_number, String log_event>). 2. The list of (key/value) pairs is broken up and each individual (key/value) pair, <k1, v1>, is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by the mapper. The mapper transforms each <k1, v1> pair into a list of <k2, v2> pairs. The details of this transformation largely determine what the MapReduce program 2012 EMC Proven Professional Knowledge Sharing 4 does. The (key/value) pairs are processed in arbitrary order. The transformation must be self-contained in that its output is dependent only on one single (key/value) pair. “Hello World”: Word Count Program Word count is the traditional ―hello world‖ program for MapReduce. The problem definition is to count the number of times each word occurs in a set of documents. The word count program is used extensively throughout the document. The program reads in a stream of text and emits each word as a key with a value of 1. Pseudo-code: Map(String input_key, String input_value) { Reduce(String key, Iterator intermediate_values) { // input_key: document name // key: a word, same for input and // input_value: document contents output for each word w in input_values { // intermediate_values: a list of counts EmitIntermediate(w, "1"); int result = 0; } for each v in intermediate_values { } result += ParseInt(v); Emit(AsString(result)); } } 2012 EMC Proven Professional Knowledge Sharing 5 Analysis of the Word Count Program 1. Map function – The mapper takes <String input_key, String input_value> as input parameters and ignores filename. It can output a list of <String word, Integer count> but can be even simpler. The counts will be aggregated in a later stage; the output can be a list of <String word, Integer 1> with repeated entries. The complete aggregation is done later in the program lifecycle. i.e., in the output list we can have the (key/value) pair <‖some_text‖, 3> once or we can have the pair <‖some_text‖, 1> three times. 2. Reduce function – The map output for one document may be a list with pair <‖some_text‖, 1> three times, and the map output for another document may be a list with pair <‖some_text‖, 1> twice. The aggregated pair the reducer will see

Big Data & Analytics: Mapreduce/Hadoop – A

Mapreduce and Beyond

Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗

Mapreduce: Simplified Data Processing On

Overview of Mapreduce and Spark

Finding Connected Components in Huge Graphs with Mapreduce

Large-Scale Graph Mining @ Google NY

Mapreduce Indexing Strategies: Studying Scalability and Efﬁciency ⇑ Richard Mccreadie , Craig Macdonald, Iadh Ounis

Mapreduce Basics

Mapreduce: a Major Step Backwards - the Database Column

Matlab-To-Map Reduce Automation

Mapreduce: a Flexible Data Processing Tool

Scaling to Build the Consolidated Audit Trail