Hadoop Install and Basic Use Guide

Hadoop Install and Basic Use Guide Authors: Daniel Lust, Anthony Tallercio

Introduction

This guide will show you how to install Apache Hadoop on a Linux environment. Hadoop allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task. It was written in Java and is used by many large popular companies with many nodes, such as Facebook1 and Yahoo2, but can be used on pretty much any hardware at any scale. This guide will go over the necessary requirements for Hadoop, the full installation process, as well as executing a job (like MapReduce).

Your average Hadoop cluster will consist of two major parts: a single master node and multiple working nodes. The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode. A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two. A worker node can be a data-only worker node or a compute-only worker node. JRE 1.6 or higher is required on all machines to run Hadoop and SSH is required between all clusters for all startup and shutdown scripts to run properly.

Hadoop Tutorial

Required:  Multiple computers or VM’s with UBUNTU o SSH Installed o Sun Java 6 o A stable version of Hadoop . http://www.bizdirusa.com/mirrors/apache//hado op/common/stable/

The following tutorial is a combination of tutorials based off of Michael G Noll’s Running Hadoop On Ubuntu Linux and hadoop

This will successfully set up a single node. The next few steps will require a profile with administrator access.

If Ubuntu is a clean install it will be necessary to set up a root password $ sudo passwd root $ [ sudo] password for user : $ Enter new UNIX password:

1 http://en.wikipedia.org/wiki/Apache_Hadoop#Facebook 2 http://en.wikipedia.org/wiki/Apache_Hadoop#Yahoo.21 $ Retype new UNIX password: $ passwd: password updated successfully

Install Java Sun 6 with the following commands in Terminal $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" $ sudo apt-get update $ sudo apt-get install sun-java6-jdk  accept the user agreement by scrolling down and pressing enter on “OK” $ sudo update-java-alternatives -s java-6-sun

Install SSH with the following commands in Terminal $ sudo get-apt install SSH

Download and unzip hadoop in desired location. To make the file easier to find, we’ll be using the local folder located in path: /usr/local/. The following commands will extract the hadoop tar file and rename the folder hadoop in /usr/local.

$ cd /usr/local $ sudo tar zxf hadoop-0.20.203.Orcl.tar.gz $ sudo mv hadoop-0.20.203 hadoop

For security purposes we will create a dedicated Hadoop user account and create a tmp directory for hadoop $ sudo addgroup hadoop $ sudo adduser –ingroup hadoop hduser  this will create a new user named “hduser” $ cd /usr/local/ $ sudo chown –R hduser:hadoop hadoop $ cd $ sudo mkdir – p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp

Depending on which version of Hadoop you’re using, a common issue labedl HADOOP-7261 describes:

“IPV6 addresses not handles currently in the common library methods. IPV6 can return address as "0:0:0:0:0:0:port". Some utility methods such as NetUtils#createSocketAddress(), NetUtils#normalizeHostName(), NetUtils#getHostNameOfIp() to name a few, do not handle IPV6 address and expect address to be of format host:port.

Until IPV6 is formally supported, I propose disabling IPV6 for junit tests to avoid problems seen in HDFS-“

To disable IPv6 type the following command to open a file in gedit. $ sudo gedit /etc/sysctl.conf

Enter the following scripts at the end of the file, save and close. This require a reboot. #disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

To check whether IPv6 is enabled or not type in the terminal $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 $ 0 ( IPv6 Enabled) 0r $ 1 (IPv6 Disabled)

Update $HOME/.bash file.

$ sudo gedit /etc/bash.bashrc

Enter the following information at the end of the file:

# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and # compress job outputs with LZOP (not covered in this tutorial): # Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less }

# Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin

At this point we should be able to log into the user we created “hduser” using $ su – hduser - this can also be used to log into different usernames.

From here we need to generate a an SSH key for for hduser hduser@computerName:~$ ssh-keygen –t rsa –P “” Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@computerName The key's randomart image is: [...snipp...]

Enable ssh access to your local machine with: hduser@computerName:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Log into ssh and accept public key: hduser@computerName:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...]

Exit the SSH connection hduser@computerName:~$ exit

Edit the following files: Hadoop-env.sh Core-site.xml Mapred-site.xml Hdfs-site.xml

Hadoop-env.sh : We must change this file to export the correct version of JAVA hduser@computerName:~$ gedit /usr/local/hadoop/conf/Hadoop-env.sh

Change # The Java implementation to use. Required # export JAVA_HOME=/user/lib/j2sdk1.5-sun To # The Java implementation to use. Required export JAVA_HOME=/usr/lib/jvm/java-6-sun

The following files will be re-edited in the future to accommodate for multiple nodes. For now we will focus on a single node. core-site.xml hduser@computerName:~$ gedit /usr/local/hadoop/conf/core-site.xml

hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories.

fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. mapred-site.xml hduser@computerName:~$ gedit /usr/local/hadoop/conf/mapred-site.xml mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. hdfs-site.xml hduser@computerName:~$ gedit /usr/local/hadoop/conf/hdfs-site.xml dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

The first time we start our hadoop cluster, its necessary to format the hadoop filesystem, however like all formats, this will erase all data in the HDFS. The HDFS is the primary storage system used by hadoop applications, and is responsible for multiple replicas of data blocks and distributes them to other computer nodes through out the cluster. This provides an easy reliable, and extremely rapid form of computations. The following command will format the HDFS for our new cluster of computers hduser@computerName:~$ /usr/local/hadoop/bin/hadoop namenode –format

At this point we should be ready to start hadoop. hduser@computerName:~$ /usr/local/hadoop/bin/start-all.sh

The various outputs produced will display what SSH connections each process is making. To see the current process (there should be six), simply type : hduser@computerName:~$ jps 2287 TaskTracker 1992 jobTracker 1121 DataNode 9912 SecondaryNameNode 91921 JPS 1812 NameNode

The numbers in front of each process can vary from time to time. This is simply the PID of that process. Not important for this.

If you are not seeing all 6 process listed above, it may be necessary to SSH into the local host. Before you do so, you will want to end all the processes. After ssh into localhost, start the processes again, and check the running processes with “jps”.

Jps should be the only process running after “stop-all.sh” command.

hduser@computerName:~$ /usr/local/hadoop/bin/stop-all.sh ------exits running process------hduser@computerName:~$ ssh localhost hduser@computerName:~$ /usr/local/hadoop/bin/start-all.sh

This completes the tutorial on how to set up a single nod for hadoop. If you wish to add additional computers to the cluster, follow the above tutorial for each computer. The overall objective of this is to set up at least two computers running hadoop to demonstrate its potential power.

Multi-node cluster Once you have all your computers set up according to single node guide we can move on to configure our computers to work as one. The cluster will be set up where one will be recognized as the MASTER computer, where as all other computers will be set up as a SLAVE.

Before we begin, make sure all processes for hadoop are stopped. The next step will require you to log into a different account with admin access to edit. You must know the local IP address for each computer set up for hadoop on the network.

Open a list of all known hosts on each computer. hduser@computerName:~$ su – root * OR ANY ADMIN ACCOUNT * $ sudo gedit /etc/hosts

Update the hosts file with the known IP address of each computer and a designated name. After trial and error with several different configurations I found it necessary to include not only the master/ip and slave/ip but the computer Name/local ip, AND DELETING THE localhost entry. This file must be the same on the MASTER and ALL SLAVE computers.

If not properly set up, it can lead to a “Too many fetch-failures issue”, where the Reduce process will take an extremely long time to complete. Each demonstration in this tutorial should take no longer than 2min 30seconds.

More info on conflict https://issues.apache.org/jira/browse/HADOOP-1930

Example of my host setup # etc/hosts 190.111.2.10 Master 190.111.2.10 ComputerName

190.111.2.11 Slave 190.111.2.11 ComputerName2

190.111.2.12 Slave2 190.111.2.12 ComputerName3

Once you have your host file set up, this allows us to SSH into each one of the computers. Log back into “hduser” and SSH into master.

$ su – hduser hduser@computerName:~$ ssh Master ----accept necessary public keys----

Now we need to authorize the Master machine to be able to log into each SSH with out the redundancy of entering multiple passwords hduser@Master:~$ ssh-copy-id –I $HOME/. Ssh/id_rsa.pub hduser@slave hduser@Master:~$ ssh-copy-id –I $HOME/. Ssh/id_rsa.pub hduser@slave2

For our own peace of mind, we will SSH into each of the slaves to check connectivity, and also give us an opportunity to accept the public keys or RSA Key fingerprints. hduser@Master:~$ ssh Master hduser@Master:~$ ssh Slave hduser@Slave:~$ exit hduser@Master:~$ ssh Slave2 hduser@Slave2:~$ exit hduser@Master:~$ exit hduser@computerName:~$

Now that we’ve tested our SSH connections, its time to start configuring our designated MASTER computer. On the Master computer enter the command to open up our master file. hduser@computerName:~$ gedit /usr/local/hadoop/conf/masters

Delete the localhost entry, and add “Master” to the file. Save and close hduser@computerName:~$ gedit /usr/local/hadoop/conf/slaves

In this file we will add the master, and all slaves. Master Slave1 Slave2 (additional slaves if you have them)

Save and close.

Example Host, masters, and slaves file edit The next steps will be configured on all computers; MASTER AND SLAVES. We will pull up files we edited for the single setup and set them to a multi-cluster configuration.

core-site.xml In this file we will edit the Value from “hdfs://localhost:54310” to “hdfs://Master:54310” hduser@computerName:~$ gedit /usr/local/hadoop/conf/core-site.xml hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories.

fs.default.name hdfs://Master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. mapred-site.xml Like the previous file, the value here will also be changed from Localhost to Master. hduser@computerName:~$ gedit /usr/local/hadoop/conf/mapred-site.xml mapred.job.tracker Master:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. hdfs-site.xml This file, we will change the value from 1 to how ever many computers are set up to use hadoop. For my tutorial I’m using 3 computers or Nodes. hduser@computerName:~$ gedit /usr/local/hadoop/conf/hdfs-site.xml dfs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

Once you’ve updated all the files for the master and slave, we need to format our HDFS again. First we will SSH into the master computer hduser@computerName:~$ ssh Master hduser@Master:~$ /usr/local/hadoop/bin/hadoop namenode –format

Now its time to run our multi-cluster set up. Simple use the start command, and use jps to make sure each process was successfully started. hduser@cMaster:~$ /usr/local/hadoop/bin/start-all.sh hduser@cMaster:~$ jps

All six processes should be running. If each process was started correctly, the master computer should of used SSH to log into each slave computer and start two to three processes determined on the task. If more will be needed, hadoop will start the process. Type in the commands on the slave computers to check the running processes.

$ su – hduser hduser@computerName2:~$ jps 1121 DataNode 91921 JPS 1812 NameNode

If you would like to stop all process, simply type in “/usr/local/hadoop/bin/stop- all.sh”. This command will close all process on the master and slaves, and also end all processes associated with Hadoop.

The next step is considering that all computers are communicating efficiently and all necessary processes are running. This step will demonstrate a Word counting program executed by hadoop. There will be two programs written in Python named mapper.py and reducer.py. Copy the code and save the file in the directory /usr/local/hadoop.

Save this code using any text edior as mapper.py in directory /usr/local/hadoop import sys def read_input(file): for line in file: # split the line into words yield line.split() def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for words in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 for word in words: print '%s%s%d' % (word, separator, 1) if __name__ == "__main__": main()

Save this code in any text editor as reducer.py in directory /usr/local/hadoop from itertools import groupby from operator import itemgetter import sys def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["", ""] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item pass if __name__ == "__main__": main()

Next we want to get our books. Create a folder in the directory /tmp named gutenberg. In this folder save the following eBooks as a Plain Text UTF-8. Links provided below. http://www.gutenberg.org/etext/20417 http://www.gutenberg.org/etext/5000 http://www.gutenberg.org/etext/4300 http://www.gutenberg.org/etext/132 http://www.gutenberg.org/etext/1661 http://www.gutenberg.org/etext/972 http://www.gutenberg.org/etext/19699

With all the processes running and the ebooks copied, we need to take these books and copy them to our HDFS hduser@Master:~$ /usr/local/hadoop/bin/hadoop dfs – copyFromLocal /tmp/Gutenberg /user/hduser/Gutenberg

Finally its time to run hadoop on the ebooks to do a word count. hduser@Master:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop- *streaming*.jar - file /usr/local/hadoop/mapper.py - mapper /usr/local/hadoop/mapper.py - file /usr/local/hadoop/reducer.py - reducer /usr/local/hadoop/reducer.py - input /user/hduser/gutenberg/* - output /user/hduser/gutenberg- output

This code should execute hadoop with the python code we created, you should have a similar out like below.

During and after the process, you can keep track of all running and completed jobs Through the following links . http://localhost:50030/ - Web UI for MapReduce Job Trackers http://localhost:50060/ - Web UI for task trackers http://localhost:50070/ - Web UI for HDFS name node.

The MapReduce job tracker is a good way to confirm that all computers are communicating with one another. Under the node section, it should have the number of how ever many computers were set up.

3 Nodes

You can also display the name of the computers and stats from the Job.

Once the job is complete you can print out the result files from the terminal using the command:

“usr/local/hadoop/bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-00000” Example Output of words counted file

files stored in /tmp/Gutenberg-output

To test the processing power of hadoop, I ran same process using a different number of nodes.

With 1 node, the job completed in : 1min 45seconds With 2 nodes, the job completed in: 1 min 24 seconds With 3 nodes, the job completed in: 1 min

END OF TUTORIAL

Conclusion

The above tutorials are everything you need to get started with Apache Hadoop. After installing and running Hadoop, you might have discovered that running into an issue is a possibility and sometimes troubleshooting will be nessesary. You also probably discovered that getting past those problems and achieving the task in mind isn’t too difficult when taking the correct steps. Hadoop can be used on a very large scale and hopefully this guide is everything you need to get a good start. Resources http://hadoop.apache.org/hdfs/ http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single- node-cluster/ http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program- in-python http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi- node-cluster/ http://en.wikipedia.org/wiki/Apache_Hadoop#Facebook http://en.wikipedia.org/wiki/Apache_Hadoop#Yahoo.21 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html https://issues.apache.org/jira/browse/HADOOP-1930 http://hadoop.apache.org/common/docs/current/ http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html http://hadoop.apache.org/common/docs/current/single_node_setup.html http://hadoop.apache.org/common/docs/current/cluster_setup.html