<<

27/04/2015

Fundamentals of Distributed Systems

CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2015

Lecture 4: DFS & MapReduce I

Aidan Hogan aidhog@.com

Inside circa 1997/98

MASSIVE DATA PROCESSING (THE GOOGLE WAY …)

Inside Google circa 2015 Google’s Cooling System

1 27/04/2015

Google’s Recycling Initiative Google Architecture (ca. 1998)

Information Retrieval • Crawling • Inverted indexing • Word-counts • Link-counts • greps/sorts • PageRank • Updates …

Google Engineering Google Engineering

• Google • Massive amounts of data – Store data across multiple machines • Each task needs communication protocols – Transparent Distributed File System – • Each task needs Replication / Self-healing • MapReduce • Multiple tasks running concurrently – Programming abstraction for distributed tasks – Handles fault tolerance Ad hoc solutions would repeat the same code – Supports many “simple” distributed tasks! • , Pregel, Percolator, Dremel …

Google Re-Engineering

Google File System (GFS)

MapReduce (GFS)

BigTable

2 27/04/2015

What is a File-System? What is a Distributed File-System

• Breaks into chunks (or clusters) • Same thing, but distributed • Remembers the sequence of clusters What would transparency / flexibility / reliability / • Records directory/file structure performance / scalability mean for a distributed file system? • • Tracks file meta-data Transparency: Like a normal file-system • – File size Flexibility: Can mount new machines – Last access • Reliability: Has replication – Permissions • Performance: Fast storage/retrieval – Locks • Scalability: Can store a lot of data / support a lot of machines

Google File System (GFS) GFS: Pipelined Writes File System (In-Memory) • 64MB per chunk • Files are huge /blue.txt [3 chunks] • 64 bit label for each chunk 1: {A1, C1, E1} • Assume replication factor of 3 2: {A1, B1, D1} 3: {B1, D1, E1} Master blue.txt /orange.txt • Files often read or appended [2 chunks] (150 MB: 3 chunks) 1: {B1, D1, E1} orange.txt – Writes in the middle of a file not (really) supported 2: {A1, C1, E1} (100MB: 2 chunks) • Concurrency important

A1 B1 C1 D1 E1 • Failures are frequent

1 3 3 • Streaming important 2 1 1 1 2 1 2 2 3 1 2 2

Chunk-servers (slaves)

GFS: Pipelined Writes (In Words) GFS: Fault Tolerance File System (In-Memory) • 64MB per chunk /blue.txt [3 chunks] • 64 bit label for each chunk 1. Client asks Master to write a file 1: {A1, B1, E1} • Assume replication factor of 3 2: {A1, B1, D1} 3: {B1, D1, E1} Master blue.txt 2. Master returns a primary chunkserver and /orange.txt [2 chunks] (150 MB: 3 chunks) 1: {B1, D1, E1} secondary chunkservers orange.txt 2: {A1, D1, E1} 3. Client writes to primary chunkserver and tells it (100MB: 2 chunks) the secondary chunkservers

4. Primary chunkserver passes data onto A1 B1C1 D1 E1 secondary chunkserver, which passes on … 5. When finished, message comes back through 1 3 2 3 2 1 1 1 1 the pipeline that all chunkservers have written 2 1 2 2 3 1 – Otherwise client sends again 2 2 Chunk-servers (slaves)

3 27/04/2015

GFS: Fault Tolerance (In Words) GFS: Direct Reads I’m looking for File System (In-Memory) /blue.txt • Master sends regular “Heartbeat” pings /blue.txt [3 chunks] 1: {A1, C1, E1} 2: {A1, B1, D1} 1 2 3 3: {B1, D1, E1} Master /orange.txt [2 chunks] • If a chunkserver doesn’t respond 1: {B1, D1, E1} 1. Master finds out what chunks it had 2: {A1, C1, E1}

2. Master assigns new chunkserver for each chunk Client 3. Master tells new chunkserver to copy from a specific

existing chunkserver A1 B1 C1 D1 E1

• Chunks are prioritised by number of remaining 1 3 3 replicas, then by demand 2 1 1 1 2 1 2 2 3 1 2 2

Chunk-servers (slaves)

GFS: Direct Reads (In Words) GFS: Modification Consistency Masters assign leases to one replica: a “primary replica” 1. Client asks Master for file Client wants to change a file: 2. Master returns location of a chunk 1. Client asks Master for the replicas (incl. primary) – Returns a ranked list of replicas 2. Master returns replica info to the client 3. Client reads chunk directly from chunkserver 3. Client sends change data 4. Client asks primary to 4. Client asks Master for next chunk execute the changes 5. Primary asks secondaries to change Software makes transparent for client! 6. Secondaries acknowledge to primary Remember: Concurrency! 7. Primary acknowledges to client Data & Control Decoupled

GFS: Rack Awareness GFS: Rack Awareness

Core Core Switch Switch

Rack A Rack B Rack C Switch Switch Switch

4 27/04/2015

GFS: Rack Awareness GFS: Rack Awareness (In Words)

• Make sure replicas not on same rack Core – In case rack switch fails! Switch • But communication can be slower: Files: – Within rack: pass one switch (rack switch) /orange.txt Rack A Rack B – Across racks: pass three switches (two racks and a core) Switch Switch 1: {A1, A4, B3} 2: {A5, B1, B5} 1 A1 2 B1 Racks: • (Typically) pick two racks with low traffic A: {A1, A2, A3, A4, A5} – Two nodes in same rack, one node in another rack A2 B2 B: {B1, B2, B3, B4, B5} • (Assuming 3x replication) A3 1 B3 1 A4 B4 • Only necessary if more than one rack!  2 A5 2 B5

GFS: Other Operations GFS: Weaknesses?

Rebalancing: Spread storage out evenly What do you see as the core weaknesses of the Google File System?

Deletion: • Master node single point of failure • Just rename the file with hidden file name – Use hardware replication – To recover, rename back to original version – Logs and checkpoints! – Otherwise, three days later will be wiped • Master node is a bottleneck – Use more powerful machine Monitoring Stale Replicas: Dead slave reappears – Minimise master node traffic with old data: master keeps version info and will • Master-node kept in memory recycle old chunks – Each chunk needs 64 bytes – Chunk data can be queried from each slave

GFS: White-Paper

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

5 27/04/2015

Google Re-Engineering HDFS

• HDFS-to-GFS

Google File System (GFS) – Data-node = Chunkserver/Slave – Name-node = Master

• HDFS does not support modifications

• Otherwise pretty much the same except … – GFS is proprietary (hidden in Google) – HDFS is open source (Apache!)

HDFS Interfaces HDFS Interfaces

Google Re-Engineering

Google File System (GFS)

MapReduce GOOGLE’S MAP-REDUCE

6 27/04/2015

Better partitioning MapReduce in Google MapReduce: Word Count method? • Divide & Conquer: A1 B1 C1 Input Distr. File Sys. 1 3 2 Map() 1. Word count A1 B1 C1 How could we do a distributed top-k word count? (Partition/Shuffle) (Distr. Sort) a-k l-q r-z 2. Total searches per user Reduce() 3. PageRank a 10,023 lab 8,123 rag 543 aa 6,234 label 983 rat 1,233 4. Inverted-indexing … … … Output

MapReduce (in more detail) MapReduce (in more detail)

1. Input: Read from the cluster (e.g., a DFS) 3. Partition: Assign sets of keymap values to reducer – Chunks raw data for mappers machines How might Partition work in the word-count case? – Maps raw data to initial (keyin, valuein) pairs What might Input do in the word-count case? 4. Shuffle: Data are moved from mappers to reducers (e.g., using DFS) 2. Map: For each (keyin, valuein) pair, generate zero-to-many (key , value ) pairs map map 5. Comparison/Sort: Each reducer sorts the data – / can be diff. type to / keyin valuein keymap valuemap by key using a comparison function What might Map do in the word-count case? – Sort is taken care of by the framework

MapReduce MapReduce: Word Count PseudoCode

6. Reduce: Takes a bag of (keymap, valuemap) pairs with the same keymap value, and produces zero-to-many outputs for each bag – Typically zero-or-one outputs

How might Reduce work in the word-count case?

7. Output: Merge-sorts the results from the reducers / writes to stable storage

7 27/04/2015

MapReduce: Scholar Example MapReduce: Scholar Example

Assume that in we have inputs like:

paperA1 citedBy paperB1

How can we use MapReduce to count the total incoming citations per paper?

MapReduce as a Dist. Sys. MapReduce: Benefits for Programmers

• Transparency: Abstracts physical machines • Takes care of low-level implementation: • Flexibility: Can mount new machines; can run – Easy to handle inputs and output a variety of types of jobs – No need to handle network communication • Reliability: Tasks are monitored by a master – No need to write sorts or joins node using a -beat; dead jobs restart • Abstracts machines (transparency) • Performance: Depends on the application – Fault tolerance (through heart-beats) code but exploits parallelism! – Abstracts physical locations • Scalability: Depends on the application code but can serve as the basis for massive data – Add / remove machines processing! – Load balancing

MapReduce: Benefits for Programmers

Time for more important things …

HADOOP OVERVIEW

8 27/04/2015

Hadoop Architecture HDFS: Traditional / SPOF NameNode 1. NameNode appends Client copy dfs/blue.txt rm dfs/orange.txt edits to log file rmdir dfs/ 2. SecondaryNameNode NameNode JobTracker mkdir new/ mv new/ dfs/ copies log file and fsimage image, makes SecondaryNameNode checkpoint, copies DataNode 1 JobNode 1 image back 3. NameNode loads DataNode 2 JobNode 2 DataNode 1 image on start-up and makes remaining edits … … … SecondaryNameNode not DataNode n JobNode n DataNode n a backup NameNode

What is the secondary name-node? Hadoop: High Availability • Name-node quickly logs all file-system actions 1 2 in a sequential (but messy) way Active NameNode StandbyActive NameNode NameNode JournalManager JournalManager • Secondary name-node keeps the main fsimage fsimage file up-to-date based on logs • When the primary name-node boots back up, JournalNode 1 12 it loads the fsimage file and applies the fs edits fs edits remaining log to it • Hence secondary name-node helps make … 21 boot-ups faster, helps keep file system image up-to-date and takes load away from primary JournalNode n 21

1. Input/Output (cmd) > hdfs dfs

PROGRAMMING WITH HADOOP

9 27/04/2015

1. Input/Output (Java) 1. Input (Java) Creates a file system for default configuration

Check if the file exists; if so delete

Create file and write a message

Open and read back

2. Map (Writable for values)

Mapper

OutputCollector will collect the Map key/value pairs

Same order

“Reporter” can provide counters (not needed in the Emit output and progress to client running example)

(WritableComparable for keys/values) 3. Partition

PartitionerInterface New Interface

Same as before

Needed to sort keys

Needed for default partition function (This happens to be the default partition method!)

(not needed in the (not needed in the running example) running example)

10 27/04/2015

4. Shuffle 5. Sort/Comparision

Methods in WritableComparator

(not needed in the running example)

6. Reduce 7. Output / Input (Java) Creates a file Reducer default configuration

Key, Iterator over Check if the file all values for that exists; if so key, output key– delete value pair collector, reporter Create file and write a message

Open and read back

Write to output

7. Output (Java) Control Flow Create a JobClient, a JobConf and pass it the main class

Set the type of output key and value in the configuration

Set input and output paths

Set the mapper class

Set the reducer class (and optionally “combiner”)

Pass the configuration to the client and run

11 27/04/2015

More in Hadoop: Combiner More in Hadoop: Reporter • Map-side “mini-reduction”

Reporter has a group of • Keeps a fixed-size buffer in memory maps of counters

• Reduce within that buffer – e.g., count words in buffer – Lessens bandwidth needs

• In Hadoop: can simply use Reducer class 

More in Hadoop: Chaining Jobs More in Hadoop: Distributed Cache

• Sometimes we need to chain jobs • Some tasks need “global knowledge” • In Hadoop, can pass a set of Jobs to the client – For example, a white-list of conference venues and journals that should be considered in the citation count • x.addDependingJob(y) – Typically small

• Use a distributed cache: – Makes data available locally to all nodes

Distributed File Systems • Google File System (GFS) – Master and Chunkslaves – Replicated pipelined writes – Direct reads – Minimising master traffic – Fault-tolerance: self-healing – Rack awareness RECAP – Consistency and modifications • Hadoop Distributed File System – NameNode and DataNodes

12 27/04/2015

MapReduce MapReduce/GFS Revision

1. Input • GFS: distributed file system 2. Map – Implemented as HDFS 3. Partition 4. Shuffle 5. Comparison/Sort • MapReduce: distributed processing 6. Reduce framework – Implemented as Hadoop 7. Output

Hadoop

• FileSystem

• Mapper

• OutputCollector

• Writable, WritableComparable

• Partitioner

• Reducer

• JobClient/JobConf

13