<<

MapReduce: Simplified Data Processing on Large Clusters Jeffry Dean and Sanjay Ghemawat ODSI 2004 Presented by Anders Miltner Problem

• When applied on to , simple algorithms become more complicated • Parallelized computations • Distributed data • Failure handling • How can we abstract away these common difficulties, allowing the programmer to be able to write simple code for their simple algorithms? Core Ideas

• Require the programmer to state their problem in terms of two functions, map and reduce • Handle the scalability concerns inside the system, instead of requiring the programmer to handle them • Require the programmer to only write code about the easy to understand functionality Map, Reduce, and MapReduce

• map: (k1,v1) -> list (k2, v2) • reduce: (k2, list v2) -> list (v2) • : list (k1,v1) -> list (k2, list v2) for a rewrite of our production indexing system. Sec- 2.2 Types tion 7 discusses related and future work. Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and for a rewrite of our production indexing system.2 PrSec-ogramming2.2 TModelypes tion 7 discusses related and future work. reduce functions supplied by the user have associated types: The computationEvtakenesthougha set oftheinputpreviouskey/valuepseudo-codepairs, andis written in terms map (k1,v1) list(k2,v2) produces a setofofstringoutputinputskey/valueand pairs.outputs,Theconceptuallyuser of the map and → 2 Programming Model reduce (k2,list(v2)) list(v2) the MapReducereducelibraryfunctionsexpresses thesuppliedcomputationby theas twusero have associated → functions: Maptypes:and Reduce. I.e., the input keys and values are drawn from a different The computation takes a set of input key/value pairs, and domain than the output keys and values. Furthermore, Map, written by mapthe user, tak(k1,v1)es an input pair and pro- list(k2,v2) produces a set of output key/value pairs. The user of → the intermediate keys and values are from the same do- duces a set of intermediatereduce ke(k2,list(v2))y/value pairs. The MapRe- list(v2) the MapReduce library expresses the computationduceaslibrarytwo groups together all intermediate values asso-→ main as the output keys and values. functions: Map and Reduce. ciated with theI.e.,sametheintermediateinput keyskandey I vandaluespassesare drathemwn fromOura difC++ferentimplementation passes strings to and from Map, written by the user, takes an input pairtoandthepro-Reduce function.domain than the output keys and values. theFurthermore,user-defined functions and leaves it to the user code duces a set of intermediate key/value pairs. The MapRe-The Reduce thefunction,intermediatealso writtenkeysbyandthe vuseralues, acceptsare fromtotheconsamevert betweendo- strings and appropriate types. duce library groups together all intermediate valuesan intermediateasso- mainkey Iasandtheaoutputset of vkalueseys andfor vthatalues.key. It ciated with the same intermediate key I and passesmergesthemtogether theseOur C++valuesimplementationto form a possiblypassessmallerstrings to and from to the Reduce function. set of values. Ttheypicallyuser-definedjust zerofunctionsor one outputand leavaluevesisit to 2.3the userMorcodee Examples The Reduce function, also written by the userproduced, acceptsper Reduce invocation. The intermediate val- to convert between strings and appropriate Heretypes.are a few simple examples of interesting programs an intermediate key I and a set of values for thatueskareey. suppliedIt to the user’s reduce function via an iter- ator. This allows us to handle lists of values that are too that can be easily expressed as MapReduce computa- merges together these values to form a possibly smaller large to fit in memory. tions. set of values. Typically just zero or one output value is 2.3 More Examples produced per Reduce invocation. The intermediate val- ues are supplied to the user’s reduce function via2.1an iterExample- Here are a few simple examples of interestingDistribprogramsuted Grep: The map function emits a line if it that can be easily expressed as MapReducematchescomputa-a supplied pattern. The reduce function is an ator. This allows us to handle lists of values thatConsiderare toothe problem of counting the number of oc- identity function that just copies the supplied intermedi- large to fit in memory. currences of eachtions.word in a large collection of docu- ate data to the output. ments. The user would write code similar to the follow- ing pseudo-code: 2.1 Example Distributed Grep: The map function emits a line if it Example Count of URL Access Frequency: The map func- map(Stringmatcheskey, Stringa suppliedvalue):pattern. The reduce function is an Consider the problem of counting the number of oc- tion processes logs of web page requests and outputs // key:identitydocumentfunctionnamethat just copies the supplied intermedi- URL, 1 . The reduce function adds together all values currences of each word in a large collection of docu-// value: document contents ⟨ ⟩ • Word Count in a lot of documents ate data to the output. for the same URL and emits a URL, total count ments. The user would write code similar to the folloforw- each word w in value: ⟨ ⟩ ing pseudo-code: EmitIntermediate(w, "1"); pair. Count of URL Access Frequency: The map func- map(String key, String value): reduce(String key, Iterator values): tion processes logs of web page requests and outputs // key: document name // key: a word Reverse Web-Link Graph: The map function outputs URL, 1 . The reduce function adds togethertargetall values, source pairs for each link to a target // value: document contents // values:⟨ a list⟩ of counts ⟨ ⟩ int resultfor the= 0;same URL and emits a URL, totalURL countfound in a page named source. The reduce for each word w in value: ⟨ ⟩ EmitIntermediate(w, "1"); for eachpairv. in values: function concatenates the list of all source URLs as- result += ParseInt(v); sociated with a given target URL and emits the pair: Emit(AsString(result)); target, list(source) reduce(String key, Iterator values): ⟨ ⟩ Reverse Web-Link Graph: The map function outputs // key: a word The map function emits each word plus an associated // values: a list of counts target, source pairs for each link to a target count of occurrences⟨ (just ‘1’ in this⟩ simple example). Term-Vector per Host: A term vector summarizes the int result = 0; URL found in a page named source. The reduce The reduce function sums together all counts emitted most important words that occur in a document or a set for each v in values: function concatenates the list of all source URLs as- for a particular word. of documents as a list of word, frequency pairs. The result += ParseInt(v); ⟨ ⟩ In addition, thesociateduser writeswithcodea gitovenfilltaringeta maprURLeduceand emitsmap thefunctionpair: emits a hostname, term vector Emit(AsString(result)); target, list(source) ⟨ ⟩ specification object⟨ with the names of the input⟩ and out- pair for each input document (where the hostname is The map function emits each word plus an associatedput files, and optional tuning parameters. The user then extracted from the URL of the document). The re- invokes the MapReduce function, passing it the specifi- duce function is passed all per-document term vectors count of occurrences (just ‘1’ in this simple example). cation object. TheTerm-Vuser’sectorcodeperis linkHost:ed togetherA termwithvectorthe summarizesfor a giventhehost. It adds these term vectors together, The reduce function sums together all countsMapReduceemitted librarymost(implementedimportant wordsin C++).that Appendixoccur in aAdocumentthrowingor aawsetay infrequent terms, and then emits a final for a particular word. of documents as a list of word, frequency pairs. The contains the full program text for this example.⟨ hostname⟩ , term vector pair. In addition, the user writes code to fill in a mapreduce map function emits a hostname, term⟨ vector ⟩ ⟨ ⟩ specification object with the names of the input and out- pair for each input document (where the hostname is 2 put files, and optional tuning parameters. The userTo appearthen in eOSDIxtracted2004from the URL of the document). The re- invokes the MapReduce function, passing it the specifi- duce function is passed all per-document term vectors cation object. The user’s code is linked together with the for a given host. It adds these term vectors together, MapReduce library (implemented in C++). Appendix A throwing away infrequent terms, and then emits a final contains the full program text for this example. hostname, term vector pair. ⟨ ⟩

To appear in OSDI 2004 2 User Program

(1) fork (1) fork (1) fork

Master Architecture (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker (5) remote read file 0 • Large clusters of commodity PCs split 2 (3) read (4) local write worker output split 3 worker file 1 • Input files partitioned into M splits split 4 worker

Input Map Intermediate files Reduce Output • Intermediate keys partitioned into R files phase (on local disks) phase files

distributions Figure 1: Execution overview

Inverted Index: The map function parses each docu- large clusters of commodity PCs connected together with ment, and emits a sequence of word, document ID switched Ethernet [4]. In our environment: • Written to disk, remotely read by ⟨ ⟩ pairs. The reduce function accepts all pairs for a given (1) Machines are typically dual-processor x86 processors word, sorts the corresponding document IDs and emits a running Linux, with 2-4 GB of memory per machine. word, list(document ID) pair. The set of all output ⟨ ⟩ workers pairs forms a simple inverted index. It is easy to augment (2) Commodity networking hardware is used – typically this computation to keep track of word positions. either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in over- all bisection bandwidth. Distributed Sort: The map function extracts the key (3) A cluster consists of hundreds or thousands of ma- from each record, and emits a key, record pair. The ⟨ ⟩ chines, and therefore machine failures are common. reduce function emits all pairs unchanged. This compu- tation depends on the partitioning facilities described in (4) Storage is provided by inexpensive IDE disks at- Section 4.1 and the ordering properties described in Sec- tached directly to individual machines. A distributed file tion 4.2. system [8] developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable 3 Implementation hardware. (5) Users submit jobs to a scheduling system. Each job Many different implementations of the MapReduce in- consists of a set of tasks, and is mapped by the scheduler terface are possible. The right choice depends on the to a set of available machines within a cluster. environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large NUMA multi-processor, and yet another for an 3.1 Execution Overview even larger collection of networked machines. This section describes an implementation targeted The Map invocations are distributed across multiple to the computing environment in wide use at : machines by automatically partitioning the input data

To appear in OSDI 2004 3 Master vs Worker

• Only 1 master, many (thousands) workers • Workers do the computations of maps and reduces • Masters handle failures, assign tasks to workers Failure Detection and Recovery

• Master pings workers to determine if they are failed • Map worker failure: • Reexecute map tasks • Reduce worker failure: • Redo if not completed, don’t otherwise • Master failure: • Do nothing Locality

• Use GFS for file storage of input data • Map tasks assigned to minimize network time Task Granularity

• Map phase has M pieces • Reduce phase has R pieces • M + R > # machines • Dynamic load balancing • Worker failure recovery Backup Tasks

• Stragglers happen • Hard disk issues • Bad scheduling • Have back up executions • Executions complete when primary or backup complete • Occurs intelligently Refinements

• Partitioning Function • Ordering Guarantees • Combiner Function • Input and Output Types • Side-effects • Skipping Bad Records • Local Execution • Status Information • Counters Performance Setup

• Measured on 1800 machine cluster • 2GHz processors • 4GB memory • 2 160GB disks • Gigabit Ethernet 20000 20000 20000 Done Done Done 15000 15000 15000 10000 10000 10000 5000 5000 5000 Input (MB/s) Input (MB/s) Input (MB/s) 0 0 0 Performance 500 1000 500 1000 500 1000 20000 20000 20000 15000 15000 15000 1000 5.4 Effect of Backup Tasks 10000 10000 10000 5000 5000 5000 Shuffle (MB/s) Shuffle (MB/s) 800 Shuffle (MB/s) In Figure 3 (b), we show an execution of the sort pro- 0 0 0 • Grep gram with backup tasks disabled. The execution flow is 500 1000 500 1000 500 1000 similar to that shown in Figure 3 (a), except that there is 600 20000 a very long tail where hardly any write activity occurs. 20000 20000 15000 • 10^10 100 byte records in 150sAfter 960 seconds, all except 5 of the reduce tasks are 15000 15000 400 10000 completed. However these last few stragglers don’t fin- 10000 10000 5000 5000 5000

ish until 300 seconds later. The entire computation takes Output (MB/s) Output (MB/s) 0 200 0 Output (MB/s) 0 • 1283 seconds, an increase of 44% in elapsed time. 500 1000 Sort 500 1000 500 1000

Number of instances in source tree Seconds Seconds Seconds 0 2003/03 2003/06 2003/09 2003/12 2004/03 2004/06 2004/09 • 10^10 100 byte records in 891s5.5 Machine Failures (a) Normal execution (b) No backup tasks (c) 200 tasks killed In Figure 3 (c), we show an execution of the sort program Figure 3: Data transfer rates over time for different executions of the sort program where we intentionally killed 200 out of 1746 worker processes several minutes into the computation. The Figure 4: MapReduce instances over time original text line as the intermediate key/value pair. We the first batch of approximately 1700 reduce tasks (the underlying1000cluster scheduler immediately restarted new 5.4 Effect of Backup Tasks worker processes on these machines (since only the pro- used aNumberbuilt-inofIdentityjobs function as the 29,423Reduce operator. entire MapReduce was assigned about 1700 machines, cesses were killed, the machines were still functioning This functionsAverage jobpassescompletionthe intermediatetime key/v634aluesecspair un- and each machine executes at most one reduce task at a In Figure 3 (b), we show an execution of the sort pro- 800 properly). changedMachineas thedaysoutputused key/value pair. 79,186The finaldays sorted time). Roughly 300 seconds into the computation, some gram with backup tasks disabled. The execution flow is Input data read 3,288 TB The worker deaths show up as a negative input rate output is written to a set of 2-way replicated GFS files of these first batch of reduce tasks finish and we start similar to that shown in Figure 3 (a), except that there is 600 (i.e., 2Intermediateterabytes aredatawrittenproducedas the output of758theTBprogram). shuffling data for the remaining reduce tasks. All of the a very long tail where hardly any write activity occurs. since some previously completed map work disappears Output data written 193 TB As before, the input data is split into 64MB pieces shuffling is done about 600 seconds into the computation. After 960 seconds, all except 5 of the reduce tasks are (since the corresponding map workers were killed) and Average worker machines per job 157 400 (M = 15000). We partition the sorted output into 4000 The bottom-left graph shows the rate at which sorted completed. However these last few stragglers don’t fin- needs to be redone. The re-execution of this map work Average worker deaths per job 1.2 files (R = 4000). The partitioning function uses the ini- data is written to the final output files by the reduce tasks. ish until 300 seconds later. The entire computation takes happens relatively quickly. The entire computation fin- Average map tasks per job 3,351 ishes in 933200seconds including startup overhead (just an tial bytesAverageof thereducekey tasksto segrepergatejob it into one55of R pieces. There is a delay between the end of the first shuffling pe- 1283 seconds, an increase of 44% in elapsed time. riod and the start of the writing period because the ma- increase of 5% over the normal execution time). OurUniquepartitioningmap implementationsfunction for this benchmark395 has built- Number of instances in source tree Unique reduce implementations 269 chines are busy sorting the intermediate data. The writes 0 in knowledge of the distribution of keys. In a general 2003/03 2003/06 2003/09 2003/12 2004/03 2004/06 2004/09 Unique map/reduce combinations 426 continue at a rate of about 2-4 GB/s for a while. All of 5.5 Machine Failures sorting program, we would add a pre-pass MapReduce the writes finish about 850 seconds into the computation. 6 Experience operation that would collect a sample of the keys and Including startup overhead, the entire computation takes In Figure 3 (c), we show an execution of the sort program use the distribution of the sampled keys to compute split- Table 1: MapReduce jobs run in August 2004 891 seconds. This is similar to the current best reported where we intentionally killed 200 out of 1746 worker We wrote the first version of the MapReduce library in points for the final sorting pass. processes several minutes into the computation. The FebruaryFigureof 2003,4: MapReduceand made significantinstances oenhancementsver time to result of 1057 seconds for the TeraSort benchmark [18]. Figure 3 (a) shows the progress of a normal execution underlying cluster scheduler immediately restarted new it in August of 2003, including the locality optimization, A few things to note: the input rate is higher than the Figureof the sort4 shoprogram.ws the significantThe top-leftgrowthgraphin theshowsnumberthe rateof worker processes on these machines (since only the pro- dynamicNumberloadofbalancingjobs of task execution29,423across worker separate MapReduce programs checked into our primary shuffle rate and the output rate because of our locality Average job completion time 634 secs at which input is read. The rate peaks at about 13 GB/s cesses were killed, the machines were still functioning machines, etc. Since that time, we have been pleasantly source code management system over time, from 0 in optimization – most data is read from a local disk and Machine days used 79,186 days and dies off fairly quickly since all map tasks finish be- properly). surprised at how broadly applicable the MapReduce li- early 2003 to almost 900 separate instances as of late bypasses our relatively bandwidth constrained network. Input data read 3,288 TB fore 200 seconds have elapsed. Note that the input rate The shuffle rate is higher than the output rate because The worker deaths show up as a negative input rate brary has been for the kinds of problems we work on. September 2004. MapReduce has been so successful be- Intermediate data produced 758 TB is less than for grep. This is because the sort map tasks the output phase writes two copies of the sorted data (we since some previously completed map work disappears It has been used across a wide range of domains within cause it makes it possible to write a simple program and Output data written 193 TB spend about half their time and I/O bandwidth writing in- make two replicas of the output for reliability and avail- (since the corresponding map workers were killed) and Google,Averageincluding:worker machines per job 157 run it efficiently on a thousand machines in the course termediate output to their local disks. The corresponding ability reasons). We write two replicas because that is needs to be redone. The re-execution of this map work Average worker deaths per job 1.2 of half an hour, greatly speeding up the development and intermediate output for grep had negligible size. the mechanism for reliability and availability provided happens relatively quickly. The entire computation fin- larAvge-scaleerage mapmachinetasks perlearningjob problems,3,351 prototyping cycle. Furthermore, it allows programmers • The middle-left graph shows the rate at which data by our underlying file system. Network bandwidth re- ishes in 933 seconds including startup overhead (just an Average reduce tasks per job 55 who have no experience with distributed and/or parallel clustering problems for the and is sent over the network from the map tasks to the re- quirements for writing data would be reduced if the un- increase of 5% over the normal execution time). • Unique map implementations 395 systems to exploit large amounts of resources easily. Froogle products, duce tasks. This shuffling starts as soon as the first derlying file system used erasure coding [14] rather than Unique reduce implementations 269 At the end of each job, the MapReduce library logs Unique map/reduce combinations 426 map task completes. The first hump in the graph is for replication. extraction of data used to produce reports of popular statistics about the computational resources used by the • 6 Experience queries (e.g. Google Zeitgeist), job. In Table 1, we show some statistics for a subset of Table 1: MapReduce jobs run in August 2004 MapReduceTo appear jobsin OSDIrun at2004Google in August 2004. 9 We wrote the first version of the MapReduce library in extraction of properties of web pages for new exper- • February of 2003, and made significant enhancements to iments and products (e.g. extraction of geographi- it in August of 2003, including the locality optimization, Figurecal4 locationsshows thefromsignificanta largegrocorpuswth inof thewebnumberpages forof 6.1 Large-Scale Indexing dynamic load balancing of task execution across worker separatelocalizedMapReducesearch),programsand checked into our primary machines, etc. Since that time, we have been pleasantly source code management system over time, from 0 in One of our most significant uses of MapReduce to date large-scale graph computations. has been a complete rewrite of the production index- surprised at how broadly applicable the MapReduce li- early• 2003 to almost 900 separate instances as of late brary has been for the kinds of problems we work on. September 2004. MapReduce has been so successful be- It has been used across a wide range of domains within causeTo appearit makesin itOSDIpossible2004to write a simple program and 10 Google, including: run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and large-scale machine learning problems, • prototyping cycle. Furthermore, it allows programmers who have no experience with distributed and/or parallel clustering problems for the Google News and • systems to exploit large amounts of resources easily. Froogle products, At the end of each job, the MapReduce library logs extraction of data used to produce reports of popular statistics about the computational resources used by the • queries (e.g. Google Zeitgeist), job. In Table 1, we show some statistics for a subset of MapReduce jobs run at Google in August 2004. extraction of properties of web pages for new exper- • iments and products (e.g. extraction of geographi- cal locations from a large corpus of web pages for 6.1 Large-Scale Indexing localized search), and One of our most significant uses of MapReduce to date large-scale graph computations. has been a complete rewrite of the production index- •

To appear in OSDI 2004 10 Pros/Cons

• Pros • Nice abstraction • Resilient to failures • Lots of different extensions for different functionality • Speedy for a certain amount of data • Cons • Doesn’t scale to giant amounts of data (master requires O(M*R) info) • Lot of disk IO • Requires batches to be completed before going to next stage • Not all problems can be expressed in terms of map and reduce Further Work

• MapReduce Online • Pipeline the stages • Google’s Cloud DataFlow • “Simple, powerful model for building batch and streaming parallel data processing pipelines” • Hadoop • Utilizes mapreducealongside storage • JobTracker and TaskTracker • Piccolo • In memory • Dryad • Directed graph of vertices and channels