Hortonworks Data Platform May 26, 2015

docs.hortonworks.com Hortonworks Data Platform May 26, 2015 Hortonworks Data Platform: Using Apache Hadoop Copyright © 2012, 2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 3.0 License. http://creativecommons.org/licenses/by-sa/3.0/legalcode ii Hortonworks Data Platform May 26, 2015 Table of Contents 1. Using Apache Hadoop ................................................................................................ 1 1.1. Hadoop Common ............................................................................................. 1 1.2. Using Hadoop HDFS ......................................................................................... 1 1.3. Using Hadoop MapReduce on YARN ................................................................ 2 1.3.1. Running MapReduce on Hadoop YARN ................................................. 2 1.3.2. MapReduce Version 2 Troubleshooting Guide ...................................... 18 1.3.3. YARN Components .............................................................................. 45 1.3.4. Using Hadoop YARN ............................................................................ 52 iii Hortonworks Data Platform May 26, 2015 1. Using Apache Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high- availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: • Hadoop Common: The set of common utilities that support other Hadoop modules • Hadoop Distributed File System (HDFS): A distributed file system that provides high- throughput access to application data • Hadoop YARN: Framework for job scheduling and cluster resource management • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets To learn more about Apache Hadoop, see Apache Hadoop. In this section: • Hadoop Common • Using Hadoop HDFS • Using Hadoop MapReduce on YARN • Using Hadoop YARN 1.1. Hadoop Common Hadoop Common is the set of common utilities that support other Hadoop modules. In this section: • Setting up a Single Node Cluster • Cluster Setup • CLI MiniCluster • Using File System (FS) Shell • Hadoop Commands Reference 1.2. Using Hadoop HDFS HDFS is a distributed file system that provides high-throughput access to data. It provides a limited interface for managing the file system to allow it to scale and provide high throughput. HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. 1 Hortonworks Data Platform May 26, 2015 Use the following resources to learn more about Hadoop HDFS: • HDFS High Availability Using the Quorum Journal Manager • HDFS High Availability with NFS • WebHDFS REST API 1.3. Using Hadoop MapReduce on YARN In Hadoop version 1, MapReduce was responsible for both processing and cluster resource management. In Apache Hadoop version 2, cluster resource management has been moved from MapReduce into YARN, thus enabling other application engines to utilize YARN and Hadoop, while also improving the performance of MapReduce. Use the following resources to learn more about using Hadoop MapReduce on YARN: • Running MapReduce on Hadoop YARN • MapReduce Version 2 Troubleshooting Guide • YARN Components • YARN Capacity Scheduler • Encrypted Shuffle • Pluggable Shuffle and Pluggable Sort 1.3.1. Running MapReduce on Hadoop YARN The introduction of a Hadoop Version 2 has changed the way MapReduce applications run on a cluster. Unlike the monolithic MapReduce-Schedule in Hadoop Version 1, Hadoop YARN has generalized the cluster resources available to users. To ensure compatibility with Hadoop Version 1, the YARN team has written a MapReduce framework that works on top of YARN. The framework is highly compatible with Hadoop Version 1, with only a small number of issues to consider. As with Hadoop Version 1, Hadoop YARN includes virtually the same MapReduce examples and benchmarks that help demonstrate how Hadoop YARN functions. Use the following resources to learn more about Running MapReduce on YARN: • Running MapReduce Examples on Hadoop YARN • MapReduce Compatibility • The MapReduce Application Master • Calculating the Capacity of a Node • Changes to the Shuffle Service • Using Windows Clients to Submit MapReduce Jobs Service • Running Existing Hadoop Version 1 Applications on YARN • Running Existing Version 1 Code on YARN 2 Hortonworks Data Platform May 26, 2015 • Uber Jobs (Technical Preview) 1.3.1.1. Running MapReduce Examples on Hadoop YARN The MapReduce examples are located in hadoop-[VERSION]/share/hadoop/ mapreduce. Depending on where you installed Hadoop, this path may vary. For the purposes of this example let’s define: export YARN_EXAMPLES=$YARN_HOME/share/hadoop/mapreduce $YARN_HOME should be defined as part of your installation. Also, the following examples have a version tag, in this case “2.1.0-beta.” Your installation may have a different version tag. The following sections provide some examples of Hadoop YARN programs and benchmarks. Listing Available Examples Using our $YARN_HOME environment variable, we can get a list of the available examples by running: yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.1.0-beta.jar This command returns a list of the available examples: An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate-based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate-based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that counts the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets. multifilewc: A job that counts words from several files. pentomino: A map/reduce tile-laying program that finds solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort. terasort: Run the terasort. teravalidate: Check the results of the terasort. wordcount: A map/reduce program that counts the words in the input 3 Hortonworks Data Platform May 26, 2015 files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. To illustrate several features of Hadoop YARN, we will show you how to run the pi and terasort examples, as well as the TestDFSIO benchmark. Running the pi Example To run the pi example with 16 maps and 100000 samples, run the following command: yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.1.0-beta.jar pi 16 100000 This command should return the following result (after the Hadoop messages): 13/10/14 20:10:01 INFO mapreduce.Job: map 0% reduce 0% 13/10/14 20:10:08 INFO mapreduce.Job: map 25% reduce 0% 13/10/14 20:10:16 INFO mapreduce.Job: map 56% reduce 0% 13/10/14 20:10:17 INFO mapreduce.Job: map 100% reduce 0% 13/10/14 20:10:17

Load more