BIG DATA & ANALYTICS MAPREDUCE/HADOOP – A PROGRAMMER’S PERSPECTIVE Tushar Telichari ([email protected]) Principal Engineer – NetWorker Development EMC Proven Specialist - Data Center Architect EMC Corporation

Table of Contents Introduction ...... 3 MapReduce Framework ...... 4 What is MapReduce? ...... 4 MapReduce programming model and constructs ...... 4 Steps in MapReduce ...... 4 ―Hello World‖: Word Count Program ...... 5 Analysis of the Word Count Program ...... 6 Hadoop Setup & Maintenance ...... 7 Lab environment ...... 7 Prerequisites ...... 7 Setting up Hadoop environment on a single machine ...... 8 Installing Hadoop ...... 8 Configuring Hadoop ...... 8 Creating the HDFS filesystem ...... 11 Starting the Hadoop environment ...... 12 Hadoop daemons ...... 13 Verifying the Hadoop environment ...... 14 Hadoop Web Interfaces...... 16 Setting up a Hadoop cluster ...... 19 Installing Hadoop ...... 20 Configuring Hadoop ...... 20 Starting the Hadoop environment ...... 21 Verifying the Hadoop environment ...... 22 MapReduce/Hadoop Programming ...... 23 Developing custom applications based on MapReduce paradigm ...... 23 WordCount program ...... 23 Program analysis ...... 25 Interacting with the Hadoop Distributed (HDFS) ...... 26 References ...... 27 Disclaimer: The views, processes, or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2012 EMC Proven Professional Knowledge Sharing 2

Introduction The volume of data being generated is growing exponentially and enterprises are struggling to manage and analyze it. Most existing tools and methodologies to filter and analyze this data offer inadequate speed and performance to yield meaningful results.

Big Data have significant potential to create value for both businesses and consumers. Now, there are a growing number of technologies used to aggregate, manipulate, manage, and analyze Big Data.

This article covers two of the most prominent technologies; MapReduce and Hadoop.

1. MapReduce is a software framework introduced by for processing huge datasets on certain kinds of problems on a distributed system. 2. Hadoop is an open source software framework inspired by Google’s MapReduce and .

This Knowledge Sharing article takes an in-depth look at MapReduce, Hadoop, and Hadoop ecosystem and will cover, but is not limited to, the following areas:

1. Hadoop Setup and Maintenance

 Setting up Hadoop environment on a single machine

 Setting up a Hadoop cluster

2. MapReduce/Hadoop Programming

 Developing custom applications based on MapReduce paradigm

3. Interacting with the Hadoop Distributed File System (HDFS)

2012 EMC Proven Professional Knowledge Sharing 3

MapReduce Framework What is MapReduce? MapReduce is a parallel programming model developed by Google as a mechanism for processing large amounts of raw data, e.g., web pages the search engine has crawled. This data is so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies since the same computations are performed on each CPU, but with a different dataset.

MapReduce is an abstraction that allows simple computations to be performed while hiding the details of parallelization, data distribution, load balancing, and .

MapReduce programming model and constructs MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.

Steps in MapReduce 1. Map works independently to convert input data to key value pairs 2. Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the 0) per key

Step Input Output map list reduce list

Sequence of operations in MapReduce

1. Input to the application must be structured as a list of (key/value) pairs, list(). The input format for processing multiple is usually list(). The input format for processing one large file, such as a log file, is list(). 2. The list of (key/value) pairs is broken up and each individual (key/value) pair, , is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by the mapper. The mapper transforms each pair into a list of pairs. The details of this transformation largely determine what the MapReduce program

2012 EMC Proven Professional Knowledge Sharing 4

does. The (key/value) pairs are processed in arbitrary order. The transformation must be self-contained in that its output is dependent only on one single (key/value) pair.

“Hello World”: Word Count Program Word count is the traditional ―hello world‖ program for MapReduce. The problem definition is to count the number of times each word occurs in a set of documents. The word count program is used extensively throughout the document.

The program reads in a stream of text and emits each word as a key with a value of 1.

Pseudo-code:

Map(String input_key, String input_value) { Reduce(String key, Iterator intermediate_values) { // input_key: document name // key: a word, same for input and // input_value: document contents output for each word w in input_values { // intermediate_values: a list of counts EmitIntermediate(w, "1"); int result = 0; } for each v in intermediate_values { } result += ParseInt(v);

Emit(AsString(result)); } }

2012 EMC Proven Professional Knowledge Sharing 5

Analysis of the Word Count Program 1. Map function – The mapper takes as input parameters and ignores filename. It can output a list of but can be even simpler. The counts will be aggregated in a later stage; the output can be a list of with repeated entries. The complete aggregation is done later in the program lifecycle. i.e., in the output list we can have the (key/value) pair <‖some_text‖, 3> once or we can have the pair <‖some_text‖, 1> three times. 2. Reduce function – The map output for one document may be a list with pair <‖some_text‖, 1> three times, and the map output for another document may be a list with pair <‖some_text‖, 1> twice. The aggregated pair the reducer will see is <‖some_text‖, list(1,1,1,1,1)>. The output of reducer function is <‖some_text‖, 5>, which is the total number of times ―some_text‖ has occurred in the document set. Each reducer works on a different word.

2012 EMC Proven Professional Knowledge Sharing 6

Hadoop Setup and Maintenance The next few sections will explain how to get a Hadoop installation up and running on single and multinode machine(s).

Lab environment

1. Two virtual machines running SUSE Linux Enterprise Server 11 (i586) SP1 2. Java SE Runtime Environment (build 1.6.0_30-b12) 3. distribution version 0.20.203.0

Prerequisites

1. Create dedicated usergroup and user for Hadoop groupadd Hadoop useradd -G Hadoop hduser mkdir -p /home/hduser chown hduser:Hadoop /home/hduser 2. Java development kit version 1.6.x – download from http://www.java.com/en/download/manual.jsp ncdqd110:/usr/java/jdk1.6.0_30/bin # java -version java version "1.6.0_30" Java SE Runtime Environment (build 1.6.0_30-b12) Java HotSpot Client VM (build 20.5-b03, mixed mode, sharing) 3. Apache Hadoop distribution version 0.20.203.0 – download from http://www.apache.org/dyn/closer.cgi/Hadoop/common/ 4. Ensure SSH service is up and running o Generate SSH key for Hadoop user o ssh-keygen -t rsa -P "" o Enable SSH access to local machine o cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys o Save host key fingerprint o hduser@ncdqd110:~> ssh ncdqd110 o The authenticity of host 'ncdqd110 ()' can't be established o RSA key fingerprint is o Are you sure you want to continue connecting (yes/no)? yes

2012 EMC Proven Professional Knowledge Sharing 7

Warning: Permanently added 'ncdqd110, )' (RSA) to the list of known hosts Last login: Tue Feb 7 13:38:08 2012 from ncdqd110.site

Setting up Hadoop environment on a single machine

Installing Hadoop The following steps explain how to install the downloaded Hadoop package.

1. Extract the Hadoop distribution tar file (Hadoop-0.20.203.0rc1.tar.gz) to /usr/local or any desired directory and give appropriate permissions to the Hadoop directory mv Hadoop-0.20.203.0 Hadoop chown -R hduser:Hadoop Hadoop 2. Setup .bashrc file for the hduser user hduser@ncdqd110:~> cat .bashrc export JAVA_HOME=/usr/java/jdk1.6.0_30 export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/usr/local/Hadoop export HADOOP_VERSION=0.20.203.0 export PATH=$HADOOP_HOME/bin:$PATH

Configuring Hadoop The following steps explain how to configure the installed Hadoop package. There are 4 configuration files which must be modified as per the installation and host configuration.

1. In file /usr/local/Hadoop/conf/ Hadoop-env.sh modify the JAVA_HOME to the appropriate Java installation directory. ----SNIP---- # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/java/jdk1.6.0_30 ----SNIP---- 2. Create directory on disk to be used for HDFS mkdir /space/hdfs chown hduser:Hadoop /space/hdfs

2012 EMC Proven Professional Knowledge Sharing 8

3. Modify /usr/local/Hadoop/conf/core-site.xml as follows hduser@ncdqd110:/usr/local/Hadoop/conf> cat core-site.xml Hadoop.tmp.dir /space/hdfs A base directory for HDFS and other temporary directories. fs.default.name hdfs://ncdqd110:54310 The name of the default file system. 4. Modify /usr/local/Hadoop/conf/mapred-site.xml as follows hduser@ncdqd110:/usr/local/Hadoop/conf> cat mapred-site.xml mapred.job.tracker ncdqd110:54311 The host and port that the MapReduce job tracker runs at.

2012 EMC Proven Professional Knowledge Sharing 9

5. Modify /usr/local/Hadoop/conf/hdfs-site.xml as follows hduser@ncdqd110:/usr/local/Hadoop/conf> cat hdfs-site.xml dfs.replication 1 Default block replication.

2012 EMC Proven Professional Knowledge Sharing 10

Creating the HDFS filesystem Like any filesystem, the HDFS (/space/hdfs) must be formatted (initialized) before use. hduser@ncdqd110:~> Hadoop namenode -format

12/02/07 14:31:51 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = ncdqd110.site/

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 0.20.203.0

STARTUP_MSG: build = http://svn.apache.org/repos/asf/Hadoop/common/branches/branch-0.20- security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011

************************************************************/

12/02/07 14:31:51 INFO util.GSet: VM type = 32-bit

12/02/07 14:31:51 INFO util.GSet: 2% max memory = 19.33375 MB

12/02/07 14:31:51 INFO util.GSet: capacity = 2^22 = 4194304 entries

12/02/07 14:31:51 INFO util.GSet: recommended=4194304, actual=4194304

12/02/07 14:31:51 INFO namenode.FSNamesystem: fsOwner=hduser

12/02/07 14:31:51 INFO namenode.FSNamesystem: supergroup=supergroup

12/02/07 14:31:51 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/02/07 14:31:51 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/02/07 14:31:51 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/02/07 14:31:51 INFO namenode.NameNode: Caching file names occuring more than 10 times

12/02/07 14:31:51 INFO common.Storage: Image file of size 112 saved in 0 seconds.

12/02/07 14:31:52 INFO common.Storage: Storage directory /space/hdfs/dfs/name has been successfully formatted.

12/02/07 14:31:52 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ncdqd110.site/

************************************************************/

2012 EMC Proven Professional Knowledge Sharing 11

Starting the Hadoop environment The Hadoop daemons can be started using the start-all.sh script bundled with the Hadoop package. hduser@ncdqd110:~> start-all.sh starting namenode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-namenode-ncdqd110.out ncdqd110: starting datanode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-datanode- ncdqd110.out ncdqd110: starting secondarynamenode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser- secondarynamenode-ncdqd110.out starting jobtracker, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-jobtracker-ncdqd110.out ncdqd110: starting tasktracker, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-tasktracker- ncdqd110.out

The list of Hadoop processes can be obtained as follows – hduser@ncdqd110:~> jps

22593 TaskTracker

22455 JobTracker

22379 SecondaryNameNode

23106 Jps

22251 DataNode

22112 NameNode

2012 EMC Proven Professional Knowledge Sharing 12

Hadoop daemons The Hadoop environment is comprised of the following daemons

1. NameNode The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how the files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. 2. DataNode Each slave machine in a Hadoop cluster hosts a DataNode daemon to perform the grunt work of the distributed filesystem—reading and writing HDFS blocks to actual files on the local filesystem. When a HDFS file operation is required, the file is broken into blocks and the NameNode will notify the client which DataNode each block resides in. The client communicates directly with the DataNode daemons to process the local files corresponding to the blocks. DataNode may communicate with other DataNodes to replicate its data blocks for redundancy. 3. Secondary NameNode The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn’t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. 4. JobTracker The JobTracker daemon is the liaison between the application and Hadoop. Once a client submits code to the Hadoop cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically re-launch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster. 5. TaskTracker The JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTrackers manage the execution of individual tasks on each slave node.

2012 EMC Proven Professional Knowledge Sharing 13

Verifying the Hadoop environment The Hadoop environment can be verified by running any of the various examples bundled with the Hadoop package.

The following example estimates the value of π hduser@ncdqd110:/usr/local/Hadoop> Hadoop jar Hadoop-examples-0.20.203.0.jar pi 2 10

Number of Maps = 2

Samples per Map = 10

Wrote input for Map #0

Wrote input for Map #1

Starting Job

12/02/07 14:55:30 INFO mapred.FileInputFormat: Total input paths to process : 2

12/02/07 14:55:30 INFO mapred.JobClient: Running job: job_201202071444_0001

12/02/07 14:55:31 INFO mapred.JobClient: map 0% reduce 0%

12/02/07 14:55:44 INFO mapred.JobClient: map 100% reduce 0%

12/02/07 14:55:59 INFO mapred.JobClient: map 100% reduce 100%

12/02/07 14:56:04 INFO mapred.JobClient: Job complete: job_201202071444_0001

12/02/07 14:56:04 INFO mapred.JobClient: Counters: 26

12/02/07 14:56:04 INFO mapred.JobClient: Job Counters

12/02/07 14:56:04 INFO mapred.JobClient: Launched reduce tasks=1

12/02/07 14:56:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14425

12/02/07 14:56:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

12/02/07 14:56:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

12/02/07 14:56:04 INFO mapred.JobClient: Launched map tasks=2

12/02/07 14:56:04 INFO mapred.JobClient: Data-local map tasks=2

12/02/07 14:56:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12877

12/02/07 14:56:04 INFO mapred.JobClient: File Input Format Counters

12/02/07 14:56:04 INFO mapred.JobClient: Bytes Read=236

2012 EMC Proven Professional Knowledge Sharing 14

12/02/07 14:56:04 INFO mapred.JobClient: File Output Format Counters

12/02/07 14:56:04 INFO mapred.JobClient: Bytes Written=97

12/02/07 14:56:04 INFO mapred.JobClient: FileSystemCounters

12/02/07 14:56:04 INFO mapred.JobClient: FILE_BYTES_READ=50

12/02/07 14:56:04 INFO mapred.JobClient: HDFS_BYTES_READ=482

12/02/07 14:56:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=63540

12/02/07 14:56:04 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215

12/02/07 14:56:04 INFO mapred.JobClient: Map-Reduce Framework

12/02/07 14:56:04 INFO mapred.JobClient: Map output materialized bytes=56

12/02/07 14:56:04 INFO mapred.JobClient: Map input records=2

12/02/07 14:56:04 INFO mapred.JobClient: Reduce shuffle bytes=56

12/02/07 14:56:04 INFO mapred.JobClient: Spilled Records=8

12/02/07 14:56:04 INFO mapred.JobClient: Map output bytes=36

12/02/07 14:56:04 INFO mapred.JobClient: Map input bytes=48

12/02/07 14:56:04 INFO mapred.JobClient: Combine input records=0

12/02/07 14:56:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=246

12/02/07 14:56:04 INFO mapred.JobClient: Reduce input records=4

12/02/07 14:56:04 INFO mapred.JobClient: Reduce input groups=4

12/02/07 14:56:04 INFO mapred.JobClient: Combine output records=0

12/02/07 14:56:04 INFO mapred.JobClient: Reduce output records=0

12/02/07 14:56:04 INFO mapred.JobClient: Map output records=4

Job Finished in 34.332 seconds

Estimated value of Pi is 3.80000000000000000000

2012 EMC Proven Professional Knowledge Sharing 15

Hadoop Web Interfaces Hadoop comes with several web interfaces which are, by default (see conf/Hadoop-default.xml), available at these locations:

1. http://:50030/ – Monitoring for MapReduce job tracker(s)

2012 EMC Proven Professional Knowledge Sharing 16

2. http://:50060/ – Task tracker(s)

2012 EMC Proven Professional Knowledge Sharing 17

3. http://:50070/ – HDFS name node(s)

2012 EMC Proven Professional Knowledge Sharing 18

Setting up a Hadoop cluster Hadoop cluster setup can be done by configuring a single Hadoop environment on two or more individual machines and then linking them together. The link is achieved by configuring a master/slave model.

2012 EMC Proven Professional Knowledge Sharing 19

Installing Hadoop Install Hadoop on all the participating machines as mentioned in the ―Setting up Hadoop environment on a single machine‖ section.

Configuring Hadoop In addition to the configuration steps mentioned in the ―Setting up Hadoop environment on a single machine‖ section, the following changes are required:

1. Verify all machines are reachable by ssh and can be resolved via name 2. Modify masters & slaves file. In the example, ncdqd110 is the master node and ncdqd107 is the slave node hduser@ncdqd110:/usr/local/Hadoop/conf> cat masters ncdqd110 hduser@ncdqd110:/usr/local/Hadoop/conf> cat slaves ncdqd110 ncdqd107 3. Update hdfs-site.xml to include the number of participating nodes (2, in this example) dfs.replication 2 Default block replication.

2012 EMC Proven Professional Knowledge Sharing 20

Starting the Hadoop environment The Hadoop processes can be started using the start-all.sh script bundled with the Hadoop package. The datanode process is started on the slave nodes. hduser@ncdqd110:~> start-all.sh starting namenode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-namenode-ncdqd110.out ncdqd107: starting datanode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-datanode- ncdqd107.out ncdqd110: starting datanode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-datanode- ncdqd110.out ncdqd110: starting secondarynamenode, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser- secondarynamenode-ncdqd110.out starting jobtracker, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-jobtracker-ncdqd110.out ncdqd107: starting tasktracker, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-tasktracker- ncdqd107.out ncdqd110: starting tasktracker, logging to /usr/local/Hadoop/bin/../logs/Hadoop-hduser-tasktracker- ncdqd110.out

Daemons running on the master & slave nodes hduser@ncdqd110:/usr/local/hadoop> jps

3930 JobTracker

3852 SecondaryNameNode

3555 NameNode

3698 DataNode

17505 Jps

4063 TaskTracker hduser@ncdqd107:~> jps

16044 DataNode

28125 Jps

16160 TaskTracker

2012 EMC Proven Professional Knowledge Sharing 21

Verifying the Hadoop environment The π example can be run on the master node (ncdqd110) to verify the cluster environment. The logfile on the slave machine (ncdqd107) indicates the communication between the two nodes.

Sample log ("Hadoop-hduser-datanode-ncdqd107.log") file

----SNIP----

2012-02-07 16:40:12,076 INFO org.apache.Hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.31.227.107:50010, dest: /10.31.227.107:45306, bytes: 37, op: HDFS_READ, cliID: DFSClient_attempt_201202071639_0001_m_000002_0, offset: 0, srvID: DS-1302676866-10.31.227.107- 50010-1328611292566, blockid: blk_-1910190658400991473_1059, duration: 19738620

2012-02-07 16:40:20,939 INFO org.apache.Hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-383433043823652373_1088 src: /10.31.227.110:41630 dest: /10.31.227.107:50010

2012-02-07 16:40:20,949 INFO org.apache.Hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.31.227.110:41630, dest: /10.31.227.107:50010, bytes: 50, op: HDFS_WRITE, cliID: DFSClient_attempt_201202071639_0001_r_000000_0, offset: 0, srvID: DS-1302676866-10.31.227.107- 50010-1328611292566, blockid: blk_-383433043823652373_1088, duration: 3065972

----SNIP----

2012 EMC Proven Professional Knowledge Sharing 22

MapReduce/Hadoop Programming The following section analyzes the WordCount example bundled with Hadoop distribution.

Developing custom applications based on MapReduce paradigm

WordCount program Source code – /usr/local/Hadoop/src/examples/org/apache/Hadoop/examples/WordCount.java package org.apache.Hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.Hadoop.conf.Configuration; import org.apache.Hadoop.fs.Path; import org.apache.Hadoop.io.IntWritable; import org.apache.Hadoop.io.Text; import org.apache.Hadoop.mapreduce.Job; import org.apache.Hadoop.mapreduce.Mapper; import org.apache.Hadoop.mapreduce.Reducer; import org.apache.Hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.Hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.Hadoop.util.GenericOptionsParser; public class WordCount {

public static class TokenizerMapper extends Mapper{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

2012 EMC Proven Professional Knowledge Sharing 23 public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

2012 EMC Proven Professional Knowledge Sharing 24

Program analysis The WordCount program is divided into the following logical sections.

1. Job configuration  Identify classes implementing Mapper and Reducer interfaces job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class);  Specify inputs, outputs job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 2. Job submission Submit the job to the cluster and wait for it to finish. job.waitForCompletion 3. Mapper method TokenizerMapper The Mapper implementation, via the TokenizerMapper method, processes one line at a time. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < , 1> (context.write(word, one)) 4. Reducer method The Reducer implementation, via the IntSumReducer method, just sums the values, which are the occurrence counts for each key (i.e. words, in this example).

2012 EMC Proven Professional Knowledge Sharing 25

Interacting with the Hadoop Distributed File System (HDFS) The HDFS operations are performed via the ―Hadoop dfs‖ option. hduser@ncdqd110:/usr/local/Hadoop> Hadoop dfs Usage: java FsShell [-ls ] [-lsr ] [-du ] [-dus ] [-count[-q] ] [-mv ] [-cp ] [-rm [-skipTrash] ] [-rmr [-skipTrash] ] [-expunge] [-put ... ] [-copyFromLocal ... ] [-moveFromLocal ... ] [-get [-ignoreCrc] [-crc] ] [-getmerge [addnl]] [-cat ] [-text ] [-copyToLocal [-ignoreCrc] [-crc] ] [-moveToLocal [-crc] ] [-mkdir ] [-setrep [-R] [-w] ] [-touchz ] [-test -[ezd] ] [-stat [format] ] [-tail [-f] ] [-chmod [-R] PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-help [cmd]]

Generic options supported are -conf specify an application configuration file -D use value for given property -fs specify a namenode -jt specify a job tracker -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is bin/Hadoop command [genericOptions] [commandOptions]

2012 EMC Proven Professional Knowledge Sharing 26

References MapReduce tutorial on the Hadoop site http://Hadoop.apache.org/common/docs/current/mapred_tutorial.html Hadoop API http://hadoop.apache.org/common/docs/current/api/index.html

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED ―AS IS.‖ EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

2012 EMC Proven Professional Knowledge Sharing 27