Università degli Studi di Roma “Tor Vergata”

MapReduce and Hadoop

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini The reference Big Data stack

High-level Interfaces Support / Integration

Data Processing

Data Storage

Resource Management

Valeria Cardellini - SABD 2016/17 1 The Big Data landscape

Valeria Cardellini - SABD 2016/17 2 The open-source landscape for Big Data

Valeria Cardellini - SABD 2016/17 3 Processing Big Data • Some frameworks and platforms to process Big Data – Relational DBMS: old fashion, no longer applicable – MapReduce & Hadoop: store and process data sets at massive scale (especially Volume+Variety), also known as batch processing – Data stream processing: process fast data (in real- time) as data is being generated, without storing

Valeria Cardellini - SABD 2016/17 4 MapReduce

Valeria Cardellini - SABD 2016/17 5 Parallel programming: background

• Parallel programming – Break processing into parts that can be executed concurrently on multiple processors • Challenge – Identify tasks that can run concurrently and/or groups of data that can be processed concurrently – Not all problems can be parallelized!

Valeria Cardellini - SABD 2016/17 6 Parallel programming: background • Simplest environment for parallel programming – No dependency among data • Data can be split into equal-size chunks – Each process can work on a chunk – Master/worker approach • Master – Initializes array and splits it according to the number of workers – Sends each worker the sub-array – Receives the results from each worker • Worker: – Receives a sub-array from master – Performs processing – Sends results to master • Single Program, Multiple Data (SPMD): technique to achieve parallelism

– The most common style of parallel programming Valeria Cardellini - SABD 2016/17 7 Key idea behind MapReduce: Divide and conquer • A feasible approach to tackling large-data problems – Partition a large problem into smaller sub-problems – Independent sub-problems executed in parallel – Combine intermediate results from each individual worker • The workers can be: – Threads in a processor core – Cores in a multi-core processor – Multiple processors in a machine – Many machines in a cluster • Implementation details of divide and conquer are

complex Valeria Cardellini - SABD 2016/17 8 Divide and conquer: how?

• Decompose the original problem in smaller, parallel tasks • Schedule tasks on workers distributed in a cluster, keeping into account: – Data locality – Resource availability • Ensure workers get the data they need • Coordinate synchronization among workers • Share partial results • Handle failures

Valeria Cardellini - SABD 2016/17 9 Key idea behind MapReduce: Scale out, not up! • For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers – Cost of super-computers is not linear – Datacenter efficiency is a difficult problem to solve, but recent improvements • Processing data is quick, I/O is very slow • Sharing vs. shared nothing: – Sharing: manage a common/global state – Shared nothing: independent entities, no common state • Sharing is difficult: – Synchronization, deadlocks – Finite bandwidth to access data from SAN

– Temporal dependencies are complicated (restarts) Valeria Cardellini - SABD 2016/17 10 MapReduce • Programming model for processing huge amounts of data sets over thousands of servers – Originally proposed by in 2004: “MapReduce: simplified data processing on large clusters” http://bit.ly/2iq7jlY – Based on a shared nothing approach • Also an associated implementation (framework) of the distributed system that runs the corresponding programs • Some examples of applications for Google: – Web indexing – Reverse Web-link graph – Distributed sort – Web access statistics Valeria Cardellini - SABD 2016/17 11 MapReduce: programmer view • MapReduce hides system-level details – Key idea: separate the what from the how – MapReduce abstracts away the “distributed” part of the system – Such details are handled by the framework

• Programmers get simple API – Don’t have to worry about handling • Parallelization • Data distribution • Load balancing • Fault tolerance

Valeria Cardellini - SABD 2016/17 12 Typical Big Data problem

• Iterate over a large number of records • Extract something of interest from each record • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output

Key idea: provide a functional abstraction of the two Map and Reduce operations

Valeria Cardellini - SABD 2016/17 13 MapReduce: model • Processing occurs in two phases: Map and Reduce – Functional programming roots (e.g., Lisp) • Map and Reduce are defined by the programmer • Input and output: sets of key-value pairs • Programmers specify two functions: map and reduce

• map(k1, v1) [(k2, v2)]

• reduce(k2, [v2]) [(k3, v3)] – (k, v) denotes a (key, value) pair – […] denotes a list – Keys do not have to be unique: different pairs can have the same key – Normally the keys of input elements are not relevant

Valeria Cardellini - SABD 2016/17 14 Map • Execute a function on a set of key-value pairs (input shard) to create a new list of values map (in_key, in_value) list(out_key, intermediate_value) • Map calls are distributed across machines by automatically partitioning the input data into M shards • MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function

Valeria Cardellini - SABD 2016/17 15 Reduce

• Combine values in sets to create a new value reduce (out_key, list(intermediate_value)) list(out_value)

Valeria Cardellini - SABD 2016/17 16 Your first MapReduce example

• Example: sum-of-squares (sum the square of numbers from 1 to n) in MapReduce fashion • Map function: map square [1,2,3,4,5] returns [1,4,9,16,25] • Reduce function: reduce [1,4,9,16,25] returns 55 (the sum of the square elements)

Valeria Cardellini - SABD 2016/17 17 MapReduce program • A MapReduce program, referred to as a job, consists of: – Code for Map and Reduce packaged together – Configuration parameters (where the input lies, where the output should be stored) – Input data set, stored on the underlying distributed file system • The input will not fit on a single computer’s disk

• Each MapReduce job is divided by the system into smaller units called tasks – Map tasks – Reduce tasks

• The output of MapReduce job is also stored on the underlying distributed file system

Valeria Cardellini - SABD 2016/17 18 MapReduce computaon 1. Some number of Map tasks each are given one or more chunks of data from a distributed file system. 2. These Map tasks turn the chunk into a sequence of key-value pairs. – The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. 3. The key-value pairs from each Map task are collected by a master controller and sorted by key. 4. The keys are divided among all the Reduce tasks, so all key- value pairs with the same key wind up at the same Reduce task. 5. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. – The manner of combination of values is determined by the code written by the user for the Reduce function. 6. Output key-value pairs from each reducer are written persistently back onto the distributed file system 7. The output ends up in r , where r is the number of reducers. – Such output may be the input to a subsequent MapReduce phase

Valeria Cardellini - SABD 2016/17 19 Where the magic happens • Implicit between the map and reduce phases is a distributed “group by” operation on intermediate keys – Intermediate data arrive at each reducer in order, sorted by the key – No ordering is guaranteed across reducers • Intermediate keys are transient – They are not stored on the distributed file system – They are “spilled” to the local disk of each machine in the cluster

Input Splits Intermediate Outputs Final Outputs

(k, v) Map (k’, v’) Reduce (k’’, v’’) Pairs Funcon Pairs Funcon Pairs

Valeria Cardellini - SABD 2016/17 20 MapReduce computaon: the complete picture

Valeria Cardellini - SABD 2016/17 21 A simplified view of MapReduce: example

• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs • Reducers are applied to all intermediate values associated with the same intermediate key • Between the map and reduce phase lies a barrier that involves a large distributed sort and group by Valeria Cardellini - SABD 2016/17 22 “Hello World” in MapReduce: WordCount • Problem: count the number of occurrences for each word in a large collection of documents • Input: repository of documents, each document is an element • Map: read a document and emit a sequence of key-value pairs where: – Keys are words of the documents and values are equal to 1: (w1, 1), (w2, 1), … , (wn, 1) • Shuffle and sort: group by key and generates pairs of the form (w1, [1, 1, … , 1]) , . . . , (wn, [1, 1, … , 1]) • Reduce: add up all the values and emits (w1, k) ,…, (wn, l) • Output: (w,m) pairs where: – w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents Valeria Cardellini - SABD 2016/17 23 WordCount in pracce Map Shuffle Reduce

Valeria Cardellini - SABD 2016/17 24 WordCount: Map

• Map emits each word in the document with an associated value equal to “1”

Map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1” );

Valeria Cardellini - SABD 2016/17 25 WordCount: Reduce

• Reduce adds up all the “1” emitted for a given word

Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result))

• This is pseudo-code; for the complete code of the example see the MapReduce paper

Valeria Cardellini - SABD 2016/17 26 Another example: WordLengthCount • Problem: count how many words of certain lengths exist in a collection of documents • Input: a repository of documents, each document is an element • Map: read a document and emit a sequence of key-value pairs where the key is the length of a word and the value is the word itself: (i, w1), … , (j, wn) • Shuffle and sort: group by key and generate pairs of the form (1, [w1, … , wk]) , … , (n, [wr, … , ws]) • Reduce: count the number of words in each list and emit: (1, l) , … , (p, m) • Output: (l,n) pairs, where l is a length and n is the total number of words of length l in the input documents Valeria Cardellini - SABD 2016/17 27 MapReduce: execuon overview

Valeria Cardellini - SABD 2016/17 28 Apache Hadoop

Valeria Cardellini - SABD 2016/17 29 What is Apache Hadoop? • Open-source software framework for reliable, scalable, distributed data-intensive computing – Originally developed by Yahoo! • Goal: storage and processing of data-sets at massive scale • Infrastructure: clusters of commodity hardware • Core components: – HDFS: Hadoop Distributed File System – Hadoop YARN – Hadoop MapReduce • Includes a number of related projects – Among which Apache Pig, Apache Hive, Apache HBase • Used in production by Facebook, IBM, Linkedin, Twitter,

Yahoo! and many others Valeria Cardellini - SABD 2016/17 30 Hadoop runs on clusters • Compute nodes are stored on racks – 8-64 compute nodes on a rack • There can be many racks of compute nodes • The nodes on a single rack are connected by a network, typically gigabit Ethernet • Racks are connected by another level of network or a switch • The bandwidth of intra-rack communication is usually much greater than that of inter-rack communication • Compute nodes can fail! Solution: – Files are stored redundantly

– Computations are divided into tasks Valeria Cardellini - SABD 2016/17 31 Hadoop runs on commodity hardware

• Pros – You are not tied to expensive, proprietary offerings from a single vendor – You can choose standardized, commonly available hardware from any of a large range of vendors to build your cluster – Scale out, not up • Cons – Significant failure rates – Solution: replication!

Valeria Cardellini - SABD 2016/17 32 Commodity hardware • A typical specification of commodity hardware: – Processor 2 quad-core 2-2.5GHz CP – Memory 16-24 GB ECC RAM – Storage 4 × 1TB SATA disks – Network Gigabit Ethernet • Yahoo! has a huge installation of Hadoop: – > 100,000 CPUs in > 40,000 computers – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop

33 Valeria Cardellini - SABD 2016/17 Hadoop core

• HDFS ü Already examined

• Hadoop YARN ü Already examined

• Hadoop MapReduce – System to write applications which process large data sets in parallel on large clusters (> 1000 nodes) of commodity hardware in a reliable, fault-tolerant manner

Valeria Cardellini - SABD 2016/17 34

Valeria Cardellini - SABD 2016/17 35 Hadoop core: MapReduce • Programming paradigm for expressing distributed computaons over mulple servers • The powerhouse behind most of today’s big data processing • Also used in other MPP environments and NoSQL databases (e.g., Verca and MongoDB) • Improving Hadoop usage – Improving programmability: Pig and Hive – Improving data access: HBase, Sqoop and Flume

Valeria Cardellini - SABD 2016/17 36 MapReduce data flow: no Reduce task

Valeria Cardellini - SABD 2016/17 37 MapReduce data flow: single Reduce task

Valeria Cardellini - SABD 2016/17 38 MapReduce data flow: mulple Reduce tasks

Valeria Cardellini - SABD 2016/17 39 Oponal Combine task • Many MapReduce jobs are limited by the cluster bandwidth – How to minimize the data transferred between map and reduce tasks? • Use Combine task – An optional localized reducer that applies a user- provided method – Performs data aggregation on intermediate data of the same key for the Map task’s output before transmitting the result to the Reduce task • Takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection pairs

Valeria Cardellini - SABD 2016/17 40 Programming languages for Hadoop • Java is the default programming language • In Java you write a program with at least three parts: 1. A Main method which configures the job, and lauches it • Set number of reducers • Set mapper and reducer classes • Set partitioner • Set other Hadoop configurations 2. A Mapper class • Takes (k,v) inputs, writes (k,v) outputs 3. A Reducer Class

• Takes k, Iterator[v] inputs, and writes (k,v) outputs Valeria Cardellini - SABD 2016/17 41 Programming languages for Hadoop

• Hadoop Streaming API to write map and reduce functions in programming languages other than Java • Uses Unix standard streams as interface between Hadoop and MapReduce program • Allows to use any language (e.g., Python) that can read standard input (stdin) and write to standard output (stdout) to write the MapReduce program – Standard input is stream data going into a program; it is from the keyboard – Standard output is the stream where a program writes its output data; it is the text terminal – Reducer interface for streaming is different than in Java: instead of receiving reduce(k, Iterator[v]), your script is sent one line per value, including the key

Valeria Cardellini - SABD 2016/17 42 WordCount on Hadoop

• Let’s analyze the WordCount code on Hadoop • The code in Java hp://bit.ly/1m9xHym

• The same code in Python – Use Hadoop Streaming API for passing data between the Map and Reduce code via standard input and standard output – See Michael Noll’s tutorial http://bit.ly/19YutB1

Valeria Cardellini - SABD 2016/17 43 Hadoop installaon

• Some straightforward alternatives – A tutorial on manual single-node installation to run Hadoop on Ubuntu Linux http://bit.ly/1eMFVFm – Cloudera CDH, Hortonworks HDP, MapR • Integrated Hadoop-based stack containing all the components needed for production, tested and packaged to work together • Include at least Hadoop, YARN, HDFS (not in MapR), HBase, Hive, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark

Valeria Cardellini - SABD 2016/17 44 Hadoop installaon • Cloudera CDH – See QuickStarts http://bit.ly/2oMHob4 – Single-node cluster: requires to have previously installed an hypervisor (VMware, KVM, or VirtualBox) – Multi-node cluster: requires to have previously installed Docker (container technology, https://www.docker.com) – QuickStarts not for use in production • Hortonworks HDP – Requires 6.5 GB of disk space and Linux distribution – Can be installed either manually or using Ambari (open source management platform for Hadoop) – Hortonworks Sandbox on a VM • Requires either VirtualBox, VMWare or Docker • MapR

– Own distributed file system rather than HDFS Valeria Cardellini - SABD 2016/17 45

Valeria Cardellini - SABD 2016/17 46 Recall YARN architecture • Global ResourceManager (RM) • A set of per-application ApplicationMasters (AMs) • A set of NodeManagers (NMs)

Valeria Cardellini - SABD 2016/17 47 YARN job execuon

Valeria Cardellini - SABD 2016/17 48 Data locality opmizaon

• Hadoop tries to run the map task on a data-local node (data locality optimization) – So to not use cluster bandwidth – Otherwise rack-local – Off-rack as last choice

Valeria Cardellini - SABD 2016/17 49 Hadoop configuraon

• Tuning Hadoop clusters for good performance is somehow magic – HW systems, SW systems (including OS), and tuning/ optimization of Hadoop infrastructure components, e.g., • Optimal number of disks that maximizes I/O bandwidth • BIOS parameters • File system type (on Linux, EXT4 is better) • Open file handles • Optimal HDFS block size • JVM settings for Java heap usage and garbage collections

See http://bit.ly/2nT7TIn for some detailed hints

Valeria Cardellini - SABD 2016/17 50 How many mappers and reducers

• Can be set in a configuration file (JobConfFile) or during command line execution – For mappers, it is just an hint • Number of mappers – Driven by the number of HDFS blocks in the input files – You can adjust the HDFS block size to adjust the number of maps • Number of reducers – Can be user-defined (default is 1)

See http://bit.ly/2oK0D5A

Valeria Cardellini - SABD 2016/17 51 Hadoop in the Cloud

• Pros: – Gain Cloud scalability and elasticity – Do not need to manage and provision the infrastructure and the platform • Main challenges: – Move data to the Cloud • Latency is not zero (because of speed of light)! • Minor issue: network bandwidth – Data security and privacy

Valeria Cardellini - SABD 2016/17 52 Amazon Elasc MapReduce (EMR)

• Distribute the computational work across a cluster of virtual servers running in Amazon cloud (EC2 instances) • Cluster managed with Hadoop • Input and output: Amazon S3, DynamoDB • Not only Hadoop, also Hive, Pig and Spark

Valeria Cardellini - SABD 2016/17 53 Create EMR cluster

Steps: 1. Provide cluster name 2. Select EMR release and applications to install 3. Select instance type 4. Select number of instances 5. Select EC2 key-pair to connect securely to the cluster See http://amzn.to/2nnujp5

Valeria Cardellini - SABD 2016/17 54 EMR cluster details

Valeria Cardellini - SABD 2016/17 55 Hadoop on Google Cloud Plaorm

• Distribute the computational work across a cluster of virtual servers running in the • Cluster managed with Hadoop • Input and output: , HDFS • Not only Hadoop, also Spark

Valeria Cardellini - SABD 2016/17 56 References

• Dean and Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI 2004. http://static.googleusercontent.com/media/ research.google.com/it//archive/mapreduce-osdi04.pdf • White, “Hadoop: The Definitive Guide, 4th edition”, O’Reilly, 2015. • Miner and Shook, “MapReduce Design Patterns”, O’Reilly, 2012. • Leskovec, Rajaraman, and Ullman, “Mining of Massive Datasets”, chapter 2, 2014. http://www.mmds.org

Valeria Cardellini - SABD 2016/17 57