How to Set up a Hadoop Cluster Using Oracle Solaris
Total Page:16
File Type:pdf, Size:1020Kb
How to Set Up a Hadoop Cluster Using Oracle Solaris Hands-On Labs of the System Admin and Developer Community of OTN by Orgad Kimchi with contributions from Jeff Taylor How to set up a Hadoop cluster using the Oracle Solaris Zones, ZFS, and network virtualization technologies. Lab Introduction This hands-on lab presents exercises that demonstrate how to set up an Apache Hadoop cluster using Oracle Solaris 11 technologies such as Oracle Solaris Zones, ZFS, and network virtualization. Key topics include the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce programming model. We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes. In addition, you will see how you can combine the Oracle Solaris 11 technologies for better scalability and data security, and you will learn how to load data into the Hadoop cluster and run a MapReduce job. Prerequisites This hands-on lab is appropriate for system administrators who will be setting up or maintaining a Hadoop cluster in production or development environments. Basic Linux or Oracle Solaris system administration experience is a prerequisite. Prior knowledge of Hadoop is not required. System Requirements This hands-on lab is run on Oracle Solaris 11 in Oracle VM VirtualBox. The lab is self- contained. All you need is in the Oracle VM VirtualBox instance. For those attending the lab at Oracle OpenWorld, your laptops are already preloaded with the correct Oracle VM VirtualBox image. If you want to try this lab outside of Oracle OpenWorld, you will need an Oracle Solaris 11 system. Do the following to set up your machine: 1. If you do not have Oracle Solaris 11, download it here. 2. Download the Oracle Solaris 11.1 VirtualBox Template (file size 1.7GB). 3. Install the template as described here. (Note: On step 4 of Exercise 2 for installing the template, set the RAM size to 4 GB in order to get good performance.) Notes for Oracle Open World Attendees Each attendee will have his or her own laptop for the lab. In this lab we are going to use the “welcome1” password for all the user accounts Oracle Solaris 11 uses the GNOME desktop. If you have used the desktops on Linux or other UNIX operating systems, the interface should be familiar. Here are some quick basics in case the interface is new for you. o In order to open a terminal window in the GNOME desktop system, right-click the background of the desktop, and select Open Terminal in the pop-up menu. o The following source code editors are provided on the lab machines: vi (type vi in a terminal window) and emacs (type emacs in a terminal window). Summary of Lab Exercises This hands-on lab consists the following exercises covering various Oracle Solaris and Apache Hadoop technologies: Download and Install Hadoop Configure the Network Time Protocol Create the Scripts Create the NameNodes, DataNodes, and ResourceManager Zones Configure the Active NameNode Set Up SSH Set Up the Standby NameNode and the ResourceManager Set Up the DataNode Zones Verify the SSH Setup Verify Name Resolution Format the Hadoop File System Start the Hadoop Cluster About Hadoop High Availability Configure Manual Failover About Apache ZooKeeper and Automatic Failover Configure Automatic Failover Conclusion The Case for Hadoop The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. To store data, Hadoop uses the Hadoop Distributed File System (HDFS), which provides high- throughput access to application data and is suitable for applications that have large data sets. For more information about Hadoop and HDFS, see http://hadoop.apache.org/. The Hadoop cluster building blocks are as follows: Active NameNode: The centerpiece of HDFS, which stores file system metadata and is responsible for all client operations Standby NameNode: A secondary NameNode that synchronizes its state with the active NameNode in order to provide fast failover if the active NameNode goes down ResourceManager: The global resource scheduler, which directs the slave NodeManager daemons to perform the low-level I/O tasks DataNodes: Nodes that store the data in the HDFS file system and are also known as slaves; these nodes run the NodeManager process that communicates with the ResourceManager History Server: Provides REST APIs in order to allow the user to get the status of finished applications and provides information about finished jobs In the previous Hadoop version, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Hadoop version 2.2 provides the ability to build an HDFS cluster with high availability (HA), and this article describes the steps involved in building such a configuration. In the example presented in this article, all the Hadoop cluster building blocks are installed using Oracle Solaris Zones, ZFS, and Unified Archive. Figure 1 shows the architecture: Figure 1 Exercise 1: Install Hadoop 1. In Oracle VM VirtualBox, enable a bidirectional "shared clipboard" between the host and the guest in order to enable copying and pasting text from this file. Figure 2 In this lab, we will use the Apache Hadoop "15 October, 2013: Release 2.2.0" release. Note: Oracle OpenWorld attendees can skip the following step (because the preloaded Oracle VM VirtualBox image already provides the Hadoop image). Download the Hadoop binary file using a web browser. Open the Firefox web browser from the desktop and download the file. Figure 3 Open a terminal window by right-clicking any point in the background of the desktop and selecting Open Terminal in the pop-up menu. Figure 4 Important: In the examples presented in this article, the command prompt indicates which user needs to run each command in addition to indicating the environment where the command should be run. For example, the command prompt root@global _zone:~# indicates that user root needs to run the command from the global zone. Note: For Oracle OpenWorld attendees, the root password has been provided in the one-pager associated with this lab. For those running this lab outside of Oracle OpenWorld, enter the root password you entered when you followed the steps in the "System Requirements" section. oracle@global_zone:~$ su - Password: Oracle Corporation SunOS 5.11 11.1 September 2012 Set up the virtual network interface card (VNIC) in order to enable network access to the global zone from the non-global zones. root@global_zone:~# dladm create-vnic -l net0 vnic0 root@global_zone:~# ipadm create-ip vnic0 root@global_zone:~# ipadm create-addr -T static -a local=192.168.1.100/24 vnic0/addr Verify the VNIC creation: root@global_zone:~# ipadm show-addr vnic0 ADDROBJ TYPE STATE ADDR vnic0/addr static ok 192.168.1.100/24 In the global zone, create the /usr/local directory if it doesn't exist. Note: The cluster configuration will share the Hadoop directory structure (/usr/local/hadoop) across the zones as a read-only file system. Every Hadoop cluster node needs to be able to write its logs to an individual directory. The directory /var/log is a best-practice directory for every Oracle Solaris Zone. root@global_zone:~# mkdir -p /usr/local 1. Copy the Hadoop tarball to /usr/local: root@global_zone:~# cp /export/home/oracle/hadoop-2.2.0.tar.gz /usr/local Unpack the tarball: root@global_zone:~# cd /usr/local root@global_zone:~# tar -xfz /usr/local/hadoop-2.2.0.tar.gz 2. Create the hadoop group: root@global_zone:~# groupadd -g 200 hadoop 3. Create a symlink for the Hadoop binaries: root@global_zone:~# ln -s /usr/local/hadoop-2.2.0 /usr/local/hadoop 4. Give ownership to the hadoop group: root@global_zone:~# chown -R root:hadoop /usr/local/hadoop-2.2.0 5. Change the permissions: root@global_zone:~# chmod -R 755 /usr/local/hadoop-2.2.0 6. Edit the Hadoop configuration files, which are shown in Table 1: Table 1. Hadoop Configuration Files File Name Description hadoop-env.sh Specifies environment variable settings used by Hadoop yarn-env.sh Specifies environment variable settings used by YARN mapred-env.sh Specifies environment variable settings used by MapReduce Contains a list of machine names that run the DataNode and Slaves NodeManager pair of daemons core-site.xml Specifies parameters relevant to all Hadoop daemons and clients hdfs-site.xml Specifies parameters used by the HDFS daemons and clients mapred- Specifies parameters used by the MapReduce daemons and clients site.xml Specifies the configurations for the ResourceManager and yarn-site.xml NodeManager 7. Run the following commands to change the hadoop-env.sh script: root@global_zone:~# export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop root@global_zone:~# cd $HADOOP_CONF_DIR Append the following lines to the hadoop-env.sh script: root@global_zone:~# echo "export JAVA_HOME=/usr/java" >> hadoop-env.sh root@global_zone:~# echo "export HADOOP_LOG_DIR=/var/log/hadoop/hdfs" >> hadoop-env.sh Append the following lines to the yarn-env.sh script: root@global_zone:~# vi yarn-env.sh export JAVA_HOME=/usr/java export YARN_LOG_DIR=/var/log/hadoop/yarn export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop Append the following lines to the mapred-env.sh script: root@global_zone:~# echo "export JAVA_HOME=/usr/java" >> mapred-env.sh root@global_zone:~# echo "export HADOOP_MAPRED_LOG_DIR=/var/log/hadoop/mapred" >> mapred-env.sh root@global_zone:~# echo "export HADOOP_MAPRED_IDENT_STRING=mapred" >> mapred-env.sh Edit the slaves file to replace the localhost entry with the following lines: root@global_zone:~# vi slaves data-node1 data-node2 data-node3 Edit the core-site.xml file so it looks like the following: Note: fs.defaultFS is the URI that describes the NameNode address (protocol specifier, hostname, and port) for the cluster.