Massive Data Processing (The Google Way …)

27/04/2015 Fundamentals of Distributed Systems CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2015 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 1997/98 MASSIVE DATA PROCESSING (THE GOOGLE WAY …) Inside Google circa 2015 Google’s Cooling System 1 27/04/2015 Google’s Recycling Initiative Google Architecture (ca. 1998) Information Retrieval • Crawling • Inverted indexing • Word-counts • Link-counts • greps/sorts • PageRank • Updates … Google Engineering Google Engineering • Google File System • Massive amounts of data – Store data across multiple machines • Each task needs communication protocols – Transparent Distributed File System – • Each task needs fault tolerance Replication / Self-healing • MapReduce • Multiple tasks running concurrently – Programming abstraction for distributed tasks – Handles fault tolerance Ad hoc solutions would repeat the same code – Supports many “simple” distributed tasks! • BigTable, Pregel, Percolator, Dremel … Google Re-Engineering Google File System (GFS) MapReduce GOOGLE FILE SYSTEM (GFS) BigTable 2 27/04/2015 What is a File-System? What is a Distributed File-System • Breaks files into chunks (or clusters) • Same thing, but distributed • Remembers the sequence of clusters What would transparency / flexibility / reliability / • Records directory/file structure performance / scalability mean for a distributed file system? • • Tracks file meta-data Transparency: Like a normal file-system • – File size Flexibility: Can mount new machines – Last access • Reliability: Has replication – Permissions • Performance: Fast storage/retrieval – Locks • Scalability: Can store a lot of data / support a lot of machines Google File System (GFS) GFS: Pipelined Writes File System (In-Memory) • 64MB per chunk • Files are huge /blue.txt [3 chunks] • 64 bit label for each chunk 1: {A1, C1, E1} • Assume replication factor of 3 2: {A1, B1, D1} 3: {B1, D1, E1} Master blue.txt /orange.txt • Files often read or appended [2 chunks] (150 MB: 3 chunks) 1: {B1, D1, E1} orange.txt – Writes in the middle of a file not (really) supported 2: {A1, C1, E1} (100MB: 2 chunks) • Concurrency important A1 B1 C1 D1 E1 • Failures are frequent 1 3 3 • Streaming important 2 1 1 1 2 1 2 2 3 1 2 2 Chunk-servers (slaves) GFS: Pipelined Writes (In Words) GFS: Fault Tolerance File System (In-Memory) • 64MB per chunk /blue.txt [3 chunks] • 64 bit label for each chunk 1. Client asks Master to write a file 1: {A1, B1, E1} • Assume replication factor of 3 2: {A1, B1, D1} 3: {B1, D1, E1} Master blue.txt 2. Master returns a primary chunkserver and /orange.txt [2 chunks] (150 MB: 3 chunks) 1: {B1, D1, E1} secondary chunkservers orange.txt 2: {A1, D1, E1} 3. Client writes to primary chunkserver and tells it (100MB: 2 chunks) the secondary chunkservers 4. Primary chunkserver passes data onto A1 B1C1 D1 E1 secondary chunkserver, which passes on … 5. When finished, message comes back through 1 3 2 3 2 1 1 1 1 the pipeline that all chunkservers have written 2 1 2 2 3 1 – Otherwise client sends again 2 2 Chunk-servers (slaves) 3 27/04/2015 GFS: Fault Tolerance (In Words) GFS: Direct Reads I’m looking for File System (In-Memory) /blue.txt • Master sends regular “Heartbeat” pings /blue.txt [3 chunks] 1: {A1, C1, E1} 2: {A1, B1, D1} 1 2 3 3: {B1, D1, E1} Master /orange.txt [2 chunks] • If a chunkserver doesn’t respond 1: {B1, D1, E1} 1. Master finds out what chunks it had 2: {A1, C1, E1} 2. Master assigns new chunkserver for each chunk Client 3. Master tells new chunkserver to copy from a specific existing chunkserver A1 B1 C1 D1 E1 • Chunks are prioritised by number of remaining 1 3 3 replicas, then by demand 2 1 1 1 2 1 2 2 3 1 2 2 Chunk-servers (slaves) GFS: Direct Reads (In Words) GFS: Modification Consistency Masters assign leases to one replica: a “primary replica” 1. Client asks Master for file Client wants to change a file: 2. Master returns location of a chunk 1. Client asks Master for the replicas (incl. primary) – Returns a ranked list of replicas 2. Master returns replica info to the client 3. Client reads chunk directly from chunkserver 3. Client sends change data 4. Client asks primary to 4. Client asks Master for next chunk execute the changes 5. Primary asks secondaries to change Software makes transparent for client! 6. Secondaries acknowledge to primary Remember: Concurrency! 7. Primary acknowledges to client Data & Control Decoupled GFS: Rack Awareness GFS: Rack Awareness Core Core Switch Switch Rack A Rack B Rack C Switch Switch Switch 4 27/04/2015 GFS: Rack Awareness GFS: Rack Awareness (In Words) • Make sure replicas not on same rack Core – In case rack switch fails! Switch • But communication can be slower: Files: – Within rack: pass one switch (rack switch) /orange.txt Rack A Rack B – Across racks: pass three switches (two racks and a core) Switch Switch 1: {A1, A4, B3} 2: {A5, B1, B5} 1 A1 2 B1 Racks: • (Typically) pick two racks with low traffic A: {A1, A2, A3, A4, A5} – Two nodes in same rack, one node in another rack A2 B2 B: {B1, B2, B3, B4, B5} • (Assuming 3x replication) A3 1 B3 1 A4 B4 • Only necessary if more than one rack! 2 A5 2 B5 GFS: Other Operations GFS: Weaknesses? Rebalancing: Spread storage out evenly What do you see as the core weaknesses of the Google File System? Deletion: • Master node single point of failure • Just rename the file with hidden file name – Use hardware replication – To recover, rename back to original version – Logs and checkpoints! – Otherwise, three days later will be wiped • Master node is a bottleneck – Use more powerful machine Monitoring Stale Replicas: Dead slave reappears – Minimise master node traffic with old data: master keeps version info and will • Master-node metadata kept in memory recycle old chunks – Each chunk needs 64 bytes – Chunk data can be queried from each slave GFS: White-Paper HADOOP DISTRIBUTED FILE SYSTEM (HDFS) 5 27/04/2015 Google Re-Engineering HDFS • HDFS-to-GFS Google File System (GFS) – Data-node = Chunkserver/Slave – Name-node = Master • HDFS does not support modifications • Otherwise pretty much the same except … – GFS is proprietary (hidden in Google) – HDFS is open source (Apache!) HDFS Interfaces HDFS Interfaces Google Re-Engineering Google File System (GFS) MapReduce GOOGLE’S MAP-REDUCE 6 27/04/2015 Better partitioning MapReduce in Google MapReduce: Word Count method? • Divide & Conquer: A1 B1 C1 Input Distr. File Sys. 1 3 2 Map() 1. Word count A1 B1 C1 How could we do a distributed top-k word count? (Partition/Shuffle) (Distr. Sort) a-k l-q r-z 2. Total searches per user Reduce() 3. PageRank a 10,023 lab 8,123 rag 543 aa 6,234 label 983 rat 1,233 4. Inverted-indexing … … … Output MapReduce (in more detail) MapReduce (in more detail) 1. Input: Read from the cluster (e.g., a DFS) 3. Partition: Assign sets of keymap values to reducer – Chunks raw data for mappers machines How might Partition work in the word-count case? – Maps raw data to initial (keyin, valuein) pairs What might Input do in the word-count case? 4. Shuffle: Data are moved from mappers to reducers (e.g., using DFS) 2. Map: For each (keyin, valuein) pair, generate zero-to-many (key , value ) pairs map map 5. Comparison/Sort: Each reducer sorts the data – / can be diff. type to / keyin valuein keymap valuemap by key using a comparison function What might Map do in the word-count case? – Sort is taken care of by the framework MapReduce MapReduce: Word Count PseudoCode 6. Reduce: Takes a bag of (keymap, valuemap) pairs with the same keymap value, and produces zero-to-many outputs for each bag – Typically zero-or-one outputs How might Reduce work in the word-count case? 7. Output: Merge-sorts the results from the reducers / writes to stable storage 7 27/04/2015 MapReduce: Scholar Example MapReduce: Scholar Example Assume that in Google Scholar we have inputs like: paperA1 citedBy paperB1 How can we use MapReduce to count the total incoming citations per paper? MapReduce as a Dist. Sys. MapReduce: Benefits for Programmers • Transparency: Abstracts physical machines • Takes care of low-level implementation: • Flexibility: Can mount new machines; can run – Easy to handle inputs and output a variety of types of jobs – No need to handle network communication • Reliability: Tasks are monitored by a master – No need to write sorts or joins node using a heart-beat; dead jobs restart • Abstracts machines (transparency) • Performance: Depends on the application – Fault tolerance (through heart-beats) code but exploits parallelism! – Abstracts physical locations • Scalability: Depends on the application code but can serve as the basis for massive data – Add / remove machines processing! – Load balancing MapReduce: Benefits for Programmers Time for more important things … HADOOP OVERVIEW 8 27/04/2015 Hadoop Architecture HDFS: Traditional / SPOF NameNode 1. NameNode appends Client copy dfs/blue.txt rm dfs/orange.txt edits to log file rmdir dfs/ 2. SecondaryNameNode NameNode JobTracker mkdir new/ mv new/ dfs/ copies log file and fsimage image, makes SecondaryNameNode checkpoint, copies DataNode 1 JobNode 1 image back 3. NameNode loads DataNode 2 JobNode 2 DataNode 1 image on start-up and makes remaining edits … … … SecondaryNameNode not DataNode n JobNode n DataNode n a backup NameNode What is the secondary name-node? Hadoop: High Availability • Name-node quickly logs all file-system actions 1 2 in a sequential (but messy) way Active NameNode StandbyActive NameNode NameNode JournalManager JournalManager • Secondary name-node keeps the main fsimage fsimage file up-to-date based on logs • When the primary name-node boots back up, JournalNode 1 12 it loads the fsimage file and applies the fs edits fs edits remaining log to it • Hence secondary name-node helps make … 21 boot-ups faster, helps keep file system image up-to-date and takes load away from primary JournalNode n 21 1.

Massive Data Processing (The Google Way …)

Mapreduce and Beyond

Google Apps: an Introduction to Docs, Scholar, and Maps

The Google File System (GFS)

A Review on GOOGLE File System

F1 Query: Declarative Querying at Scale

A Study of Cryptographic File Systems in Userspace

The Informal Sector and Economic Growth of South Africa and Nigeria: a Comparative Systematic Review

Cloud Computing Bible Is a Wide-Ranging and Complete Reference

Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗

4 Google Scholar-Books

Social Factors Associated with Chronic Non-Communicable

Mapreduce: Simplified Data Processing On