Amazon Elastic Mapreduce Developer Guide API Version 2009-03-31 Amazon Elastic Mapreduce Developer Guide

Amazon Elastic MapReduce Developer Guide API Version 2009-03-31 Amazon Elastic MapReduce Developer Guide Amazon Elastic MapReduce: Developer Guide Copyright © 2014 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. The following are trademarks of Amazon Web Services, Inc.: Amazon, Amazon Web Services Design, AWS, Amazon CloudFront, Cloudfront, CloudTrail, Amazon DevPay, DynamoDB, ElastiCache, Amazon EC2, Amazon Elastic Compute Cloud, Amazon Glacier, Kinesis, Kindle, Kindle Fire, AWS Marketplace Design, Mechanical Turk, Amazon Redshift, Amazon Route 53, Amazon S3, Amazon VPC. In addition, Amazon.com graphics, logos, page headers, button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other countries. Amazon©s trademarks and trade dress may not be used in connection with any product or service that is not Amazon©s, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Amazon Elastic MapReduce Developer Guide Table of Contents What is Amazon EMR? .................................................................................................................. 1 What Can You Do with Amazon EMR? ...................................................................................... 2 Hadoop Programming on Amazon EMR ............................................................................ 2 Data Analysis and Processing on Amazon EMR ................................................................. 3 Data Storage on Amazon EMR ........................................................................................ 3 Move Data with Amazon EMR ......................................................................................... 3 Amazon EMR Features .......................................................................................................... 3 Resizeable Clusters ....................................................................................................... 4 Pay Only for What You Use .............................................................................................. 4 Easy to Use .................................................................................................................. 4 Use Amazon S3 or HDFS ............................................................................................... 4 Parallel Clusters ............................................................................................................ 4 Hadoop Application Support ............................................................................................ 4 Save Money with Spot Instances ...................................................................................... 5 AWS Integration ............................................................................................................ 5 Instance Options ........................................................................................................... 5 MapR Support .............................................................................................................. 5 Business Intelligence Tools ............................................................................................. 5 User Control ................................................................................................................. 5 Management Tools ........................................................................................................ 6 Security ....................................................................................................................... 6 How Does Amazon EMR Work? ............................................................................................... 6 Hadoop ....................................................................................................................... 6 Nodes ......................................................................................................................... 7 Steps .......................................................................................................................... 8 Cluster ........................................................................................................................ 9 What Tools are Available for Amazon EMR? ............................................................................. 10 Learn more about Amazon EMR ............................................................................................ 12 Learn More About Hadoop and AWS Services Used with Amazon EMR: ....................................... 13 Get Started: Count Words with Amazon EMR .................................................................................. 14 Sign up for the Service ......................................................................................................... 15 Launch the Cluster ............................................................................................................... 15 Monitor the Cluster .............................................................................................................. 22 View the Results .................................................................................................................. 25 View the Debug Logs (Optional) ..................................................................................... 25 Clean Up ............................................................................................................................ 27 Where Do I Go From Here? ................................................................................................... 29 Next Steps ................................................................................................................. 29 Plan an Amazon EMR Cluster ........................................................................................................ 30 Choose an AWS Region ....................................................................................................... 30 Choose a Region using the Console ............................................................................... 31 Specify a Region using the AWS CLI or Amazon EMR CLI .................................................. 32 Choose a Region Using an SDK or the API ...................................................................... 32 Choose the Number and Type of Virtual Servers ....................................................................... 33 Calculate the HDFS Capacity of a Cluster ........................................................................ 33 Guidelines for the Number and Type of Virtual Servers ....................................................... 33 Instance Groups .......................................................................................................... 34 Virtual Server Configurations ......................................................................................... 35 Ensure Capacity with Reserved Instances (Optional) ......................................................... 36 Lower Costs with Spot Instances (Optional) ...................................................................... 37 Configure the Virtual Server Software ...................................................................................... 53 Choose an Amazon Machine Image (AMI) ....................................................................... 53 Choose a Version of Hadoop ......................................................................................... 86 File Systems compatible with Amazon EMR ..................................................................... 95 Create Bootstrap Actions to Install Additional Software (Optional) ....................................... 110 API Version 2009-03-31 iii Amazon Elastic MapReduce Developer Guide Choose the Cluster Lifecycle: Long-Running or Transient .......................................................... 123 Prepare Input Data (Optional) .............................................................................................. 125 Types of Input Amazon EMR Can Accept ....................................................................... 125 How to Get Data Into Amazon EMR .............................................................................. 126 Prepare an Output Location (Optional) ................................................................................... 137 Create and Configure an Amazon S3 Bucket .................................................................. 137 What formats can Amazon EMR return? ........................................................................ 139 How to write data to an Amazon S3 bucket you don©t own ................................................. 139 Compress the Output of your Cluster ............................................................................. 141 Configure Access to the Cluster ............................................................................................ 142 Create SSH Credentials for the Master Node .................................................................. 142 Configure IAM User Permissions .................................................................................. 144 Set Access Policies for IAM Users ................................................................................

Amazon Elastic Mapreduce Developer Guide API Version 2009-03-31 Amazon Elastic Mapreduce Developer Guide

Hitachi Solution for Databases in Enterprise Data Warehouse Offload Package for Oracle Database with Mapr Distribution of Apache

Database Solutions on AWS

Mapr Spark Certification Preparation Guide

IBM Big SQL (With Hbase), Splice Major Contributor to the Apache Be a Major Determinant“ Machine (Which Incorporates Hbase Madlib Project

PROCESSING LARGE / BIG DATA SET THROUGH Mapr and PIG

Your Expert Guide to Hadoop Big Data Platforms

Digging Into Hadoop-Based Big Data Architectures

From Apache Hadoop/Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies Executive Summary

Big Data Fundamentals

Implementing Informatica® Big Data Management 10.2 in an Amazon

Big Data Solutions Glossary

Implementation of Hadoop -2