<<

contributed articles

doi:10.1145/1629175.1629198 of MapReduce has been used exten- MapReduce advantages over parallel sively outside of by a number of organizations.10,11 include storage-system independence and To help illustrate the MapReduce fine-grain for large jobs. programming model, consider the problem of counting the number of by Jeffrey Dean and occurrences of each word in a large col- lection of documents. The user would write code like the following pseudo- code: MapReduce: map(String key, String value): // key: document name // value: document contents for each word w in value: A Flexible EmitIntermediate(w, “1”); reduce(String key, Iterator values): // key: a word Data // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Processing Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The re- Tool duce function sums together all counts emitted for a particular word. MapReduce automatically paral- lelizes and executes the program on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, MapReduce is a programming model for processing the program’s execution and generating large data sets.4 Users specify a across a set of machines, handling machine failures, and managing re- map function that processes a key/value pair to quired inter-machine communication. generate a set of intermediate key/value pairs and MapReduce allows programmers with a reduce function that merges all intermediate no experience with parallel and dis- tributed systems to easily utilize the re- values associated with the same intermediate key. sources of a large distributed system. A We built a system around this programming model typical MapReduce computation pro- cesses many terabytes of data on hun- in 2003 to simplify construction of the inverted dreds or thousands of machines. Pro- index for handling searches at Google.com. Since grammers find the system easy to use, and more than 100,000 MapReduce then, more than 10,000 distinct programs have been z at jobs are executed on Google’s clusters implemented using MapReduce at Google, including every day. algorithms for large-scale graph processing, text Compared to Parallel Databases processing, , and statistical machine The query languages built into paral-

translation. The Hadoop open source implementation lel systems are also used to Mar i us w i on by Illustrat

72 communications of the acm | January 2010 | vol. 53 | no. 1 january 2010 | vol. 53 | no. 1 | communications of the acm 73 contributed articles

express the type of computations sup- support a new storage system by de- would need to read only that sub-range ported by MapReduce. A 2009 paper fining simple reader and writer imple- instead of scanning the entire . by Andrew Pavlo et al. (referred to here mentations that operate on the storage Furthermore, like Vertica and other col- as the “comparison paper”13) com- system. Examples of supported storage umn-store databases, we will read data pared the performance of MapReduce systems are stored in distributed only from the columns needed for this and parallel databases. It evaluated file systems,7 database query results,2,9 analysis, since Bigtable can store data the open source Hadoop implementa- data stored in Bigtable,3 and structured segregated by columns. tion10 of the MapReduce programming input files (such as B-trees). A single Yet another example is the process- model, DBMS-X (an unidentified com- MapReduce operation easily processes ing of log data within a certain date mercial database system), and Vertica and combines data from a variety of range; see the Join task discussion in (a column-store database system from storage systems. the comparison paper, where the Ha- a company co-founded by one of the Now consider a system in which a doop benchmark reads through 155 authors of the comparison paper). Ear- parallel DBMS is used to perform all million records to process the 134,000 lier blog posts by some of the paper’s data analysis. The input to such analy- records that fall within the date range authors characterized MapReduce as sis must first be copied into the parallel of interest. Nearly every logging sys- “a major step backwards.”5,6 In this DBMS. This loading phase is inconve- tem we are familiar with rolls over to article, we address several misconcep- nient. It may also be unacceptably slow, a new log file periodically and embeds tions about MapReduce in these three especially if the data will be analyzed the rollover time in the name of each publications: only once or twice after being loaded. log file. Therefore, we can easily run a ˲˲ MapReduce cannot use indices and For example, consider a batch-oriented MapReduce operation over just the log implies a full scan of all input data; Web-crawling-and-indexing system files that may potentially overlap the ˲˲ MapReduce input and outputs are that fetches a set of Web pages and specified date range, instead of reading always simple files in a ; and generates an inverted index. It seems all log files. ˲˲ MapReduce requires the use of in- awkward and inefficient to load the set efficient textual data formats. of fetched pages into a database just so Complex Functions We also discuss other important is- they can be read through once to gener- Map and Reduce functions are often sues: ate an inverted index. Even if the cost of fairly simple and have straightforward ˲˲ MapReduce is storage-system inde- loading the input into a parallel DBMS SQL equivalents. However, in many pendent and can process data without is acceptable, we still need an appropri- cases, especially for Map functions, the first requiring it to be loaded into a da- ate loading tool. Here is another place function is too complicated to be ex- tabase. In many cases, it is possible to MapReduce can be used; instead of pressed easily in a SQL query, as in the run 50 or more separate MapReduce writing a custom loader with its own ad following examples: analyses in complete passes over the hoc parallelization and fault-tolerance ˲˲ Extracting the set of outgoing links data before it is possible to load the data support, a simple MapReduce program from a collection of HTML documents into a database and complete a single can be written to load the data into the and aggregating by target document; analysis; parallel DBMS. ˲˲ Stitching together overlapping sat- ˲˲ Complicated transformations are ellite images to remove seams and to often easier to express in MapReduce Indices select high-quality imagery for Google than in SQL; and The comparison paper incorrectly said Earth; ˲˲ Many conclusions in the compari- that MapReduce cannot take advan- ˲˲ Generating a collection of inverted son paper were based on implementa- tage of pregenerated indices, leading index files using a compression scheme tion and evaluation shortcomings not to skewed benchmark results in the tuned for efficient support of Google fundamental to the MapReduce model; paper. For example, consider a large search queries; we discuss these shortcomings later in data set partitioned into a collection ˲˲ Processing all road segments in the this article. of nondistributed databases, perhaps world and rendering map tile images We encourage readers to read the using a hash function. An index can that display these segments for Google original MapReduce paper4 and the be added to each database, and the Maps; and comparison paper13 for more context. result of running a database query us- ˲˲ Fault-tolerant parallel execution of ing this index can be used as an input programs written in higher-level lan- Heterogenous Systems to MapReduce. If the data is stored in guages (such as Sawzall14 and Pig Lat- Many production environments con- D database partitions, we will run D in12) across a collection of input data. tain a mix of storage systems. Customer database queries that will become the Conceptually, such user defined data may be stored in a relational data- D inputs to the MapReduce execution. functions (UDFs) can be combined base, and user requests may be logged Indeed, some of the authors of Pavlo et with SQL queries, but the experience to a file system. Furthermore, as such al. have pursued this approach in their reported in the comparison paper indi- environments evolve, data may migrate more recent work.11 cates that UDF support is either buggy to new storage systems. MapReduce Another example of the use of in- (in DBMS-X) or missing (in Vertica). provides a simple model for analyzing dices is a MapReduce that reads from These concerns may go away over the data in such heterogenous systems. Bigtable. If the data needed maps to a long term, but for now, MapReduce is a End users can extend MapReduce to sub-range of the Bigtable row space, we better framework for doing more com-

74 communications of the acm | January 2010 | vol. 53 | no. 1 contributed articles plicated tasks (such as those listed ear- of uses an optimized lier) than the selection and aggregation binary representation that is more that are SQL’s forte. compact and much faster to encode and decode than the textual formats Structured Data and Schemas used by the Hadoop benchmarks in the Pavlo et al. did raise a good point that MapReduce is comparison paper. For example, the schemas are helpful in allowing multi- a highly effective automatically generated code to parse ple applications to share the same data. a Rankings protocol buffer record For example, consider the following and efficient runs in 20 nanoseconds per record as schema from the comparison paper: tool for large-scale compared to the 1,731 nanoseconds CREATE TABLE Rankings ( required per record to parse the tex- pageURL VARCHAR(100) fault-tolerant tual input format used in the Hadoop PRIMARY KEY, benchmark mentioned earlier. These pageRank INT, data analysis. measurements were obtained on a JVM avgDuration INT ); running on a 2.4GHz Intel Core-2 Duo. The Java code fragments used for the The corresponding Hadoop bench- benchmark runs were: marks in the comparison paper used an inefficient and fragile textual for- // Fragment 1: protocol buf- mat with different attributes separated fer parsing by vertical bar characters: for (int i = 0; i < numItera- tions; i++) { 137|http://www.somehost.com/ rankings.parseFrom(value); index.html|602 = rankings.get- P a g e r a n k(); In contrast to ad hoc, inefficient } formats, virtually all MapReduce op- erations at Google read and write data // Fragment 2: text for- in the Protocol Buffer format.8 A high- mat parsing (extracted from level language describes the input and Benchmark1.java output types, and compiler-generated // from the source code code is used to hide the details of en- posted by Pavlo et al.) coding/decoding from application for (int i = 0; i < numItera- code. The corresponding protocol buf- tions; i++) { fer description for the Rankings data String data[] = value.to- would be: String().split(“\\|”); pagerank = Integer. message Rankings { v a l u e O f(d a t a[0]); required string pageurl = 1; } required int32 pagerank = 2; required int32 avgduration = 3; Given the factor of an 80-fold dif- } ference in this record-parsing bench- mark, we suspect the absolute num- The following Map function frag- bers for the Hadoop benchmarks in ment processes a Rankings record: the comparison paper are inflated and cannot be used to reach conclusions Rankings r = new Rankings(); about fundamental differences in the r.parseFrom(value); performance of MapReduce and paral- if (r.getPagerank() > 10) { ... } lel DBMS.

The protocol buffer framework Fault Tolerance allows types to be upgraded (in con- The MapReduce implementation uses strained ways) without requiring exist- a pull model for moving data between ing applications to be changed (or even mappers and reducers, as opposed to recompiled or rebuilt). This level of a push model where mappers write di- schema support has proved sufficient rectly to reducers. Pavlo et al. correctly for allowing thousands of Google engi- pointed out that the pull model can re- neers to share the same evolving data sult in the creation of many small files types. and many disk seeks to move data be- Furthermore, the implementation tween mappers and reducers. Imple-

january 2010 | vol. 53 | no. 1 | communications of the acm 75 contributed articles CACM_TACCESS_one-third_page_vertical:Layout 1 6/9/09 1:04 PM Page 1

mentation tricks like batching, sorting, format for structured data (protocol and grouping of intermediate data and buffers) instead of inefficient textual smart scheduling of reads are used by formats. Google’s MapReduce implementation Reading unnecessary data. The com- to mitigate these costs. parison paper says, “MR is always forced MapReduce implementations tend to start a query with a scan of the entire not to use a push model due to the input file.” MapReduce does not require ACM fault-tolerance properties required a full scan over the data; it requires only by Google’s developers. Most MapRe- an implementation of its input inter- Transactions on duce executions over large data sets face to yield a set of records that match encounter at least a few failures; apart some input specification. Examples of from hardware and software problems, input specifications are: Accessible Google’s cluster scheduling system can ˲˲ All records in a set of files; preempt MapReduce tasks by killing ˲˲ All records with a visit-date in the Computing them to make room for higher-priority range [2000-01-15..2000-01-22]; and tasks. In a push model, failure of a re- ˲˲ All data in Bigtable table T whose ducer would force re-execution of all “language” column is “Turkish.” Map tasks. The input may require a full scan We suspect that as data sets grow over a set of files, as Pavlo et al. sug- larger, analyses will require more gested, but alternate implementations computation, and fault tolerance will are often used. For example, the input become more important. There are al- may be a database with an index that ready more than a dozen distinct data provides efficient filtering or an in- sets at Google more than 1PB in size dexed file structure (such as daily log and dozens more hundreds of TBs files used for efficient date-based fil- in size that are processed daily using tering of log data). MapReduce. Outside of Google, many This mistaken assumption about users listed on the Hadoop users list11 MapReduce affects three of the five are handling data sets of multiple hun- benchmarks in the comparison paper dreds of terabytes or more. Clearly, as (the selection, aggregation, and join data sets continue to grow, more users tasks) and invalidates the conclusions will need a fault-tolerant system like in the paper about the relative perfor- MapReduce that can be used to process mance of MapReduce and parallel da- these large data sets efficiently and ef- tabases. ◆ ◆ ◆ ◆ ◆ fectively. Merging results. The measurements of Hadoop in all five benchmarks in the This quarterly publication is a Performance comparison paper included the cost quarterly journal that publishes Pavlo et al. compared the performance of a final phase to merge the results of refereed articles addressing issues of the Hadoop MapReduce implemen- the initial MapReduce into one file. In of computing as it impacts the tation to two database implementa- practice, this merging is unnecessary, lives of people with disabilities. tions; here, we discuss the performance since the next consumer of MapReduce differences of the various systems: output is usually another MapReduce The journal will be of particular Engineering considerations. Startup that can easily operate over the set of interest to SIGACCESS members overhead and sequential scanning files produced by the first MapReduce, and delegrates to its affiliated speed are indicators of maturity of im- instead of requiring a single merged in- conference (i.e., ASSETS), as well plementation and engineering trade- put. Even if the consumer is not another offs, not fundamental differences in MapReduce, the reducer processes in as other international accessibility programming models. These differ- the initial MapReduce can write directly conferences. ences are certainly important but can to a merged destination (such as a Big- ◆ ◆ ◆ ◆ ◆ be addressed in a variety of ways. For table or parallel database table). www.acm.org/taccess example, startup overhead can be ad- Data loading. The DBMS measure- dressed by keeping worker processes ments in the comparison paper dem- www.acm.org/subscribe live, waiting for the next MapReduce in- onstrated the high cost of loading vocation, an optimization added more input data into a database before it than a year ago to Google’s MapReduce is analyzed. For many of the bench- implementation. marks in the comparison paper, the Google has also addressed sequen- time needed to load the input data into tial scanning performance with a variety a parallel database is five to 50 times of performance optimizations by, for ex- the time needed to analyze the data via ample, using efficient binary-encoding Hadoop. Put another way, for some of

76 communications of the acm | January 2010 | vol. 53 | no. 1 contributed articles the benchmarks, starting with data in a heterogenous system with many dif- collection of files on disk, it is possible ferent storage systems. Third, MapRe- to run 50 separate MapReduce analy- duce provides a good framework for ses over the data before it is possible to the execution of more complicated load the data into a database and com- functions than are supported directly plete a single analysis. Long load times MapReduce in SQL. may not matter if many queries will be provides fine-grain run on the data after loading, but this References is often not the case; data sets are often fault tolerance 1. abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Silberschatz, A., and Rasin, A. HadoopDB: An generated, processed once or twice, for large jobs; architectural hybrid of MapReduce and DBMS and then discarded. For example, the technologies for analytical workloads. In Proceedings of the Conference on Very Large Databases (Lyon, Web-search index-building system de- failure in the middle France, 2009); http://db.cs.yale.edu/hadoopdb/ scribed in the MapReduce paper4 is a 2. aster Data Systems, Inc. In-Database MapReduce of a multi-hour for Rich Analytics; http://www.asterdata.com/product/ sequence of MapReduce phases where .php. the output of most phases is consumed 3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., execution does Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., by one or two subsequent MapReduce and Gruber, R.E. Bigtable: A distributed storage phases. not require system for structured data. In Proceedings of the Seventh Symposium on Operating System Design restarting the job and Implementation (Seattle, WA, Nov. 6–8). Usenix Conclusion Association, 2006; http://labs.google.com/papers/ bigtable.html The conclusions about performance from scratch. 4. dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of in the comparison paper were based the Sixth Symposium on Operating System Design and on flawed assumptions about MapRe- Implementation (San Francisco, CA, Dec. 6–8). Usenix Association, 2004; http://labs.google.com/papers/ duce and overstated the benefit of par- mapreduce.html allel database systems. In our experi- 5. dewitt, D. and Stonebraker, M. MapReduce: A Major Step Backwards blogpost; http://databasecolumn. ence, MapReduce is a highly effective vertica.com/database-innovation/mapreduce-a-major- and efficient tool for large-scale fault- step-backwards/ 6. dewitt, D. and Stonebraker, M. MapReduce II tolerant data analysis. However, a few blogpost; http://databasecolumn.vertica.com/ useful lessons can be drawn from this database-innovation/mapreduce-ii/ 7. Ghemawat, S., Gobioff, H., and Leung, S.-T. The discussion: . InProceedings of the 19th ACM Startup latency. MapReduce imple- Symposium on Operating Systems Principles (Lake George, NY, Oct. 19–22). ACM Press, New York, 2003; mentations should strive to reduce http://labs.google.com/papers/gfs.html startup latency by using techniques like 8. Google. Protocol Buffers: Google’s Data Interchange Format. Documentation and open source release; worker processes that are reused across http://code.google.com/p/protobuf/ 9. Greenplum. Greenplum MapReduce: Bringing Next- different invocations; Generation Analytics Technology to the Enterprise; Data shuffling. Careful attention http://www.greenplum.com/resources/mapreduce/ 10. Hadoop. Documentation and open source release; must be paid to the implementation of http://hadoop.apache.org/core/ the data-shuffling phase to avoid gen- 11. Hadoop. Users list; http://wiki.apache.org/hadoop/ PoweredBy erating O(M*R) seeks in a MapReduce 12. olston, C., Reed, B., Srivastava, U., Kumar, R., and with M map tasks and R reduce tasks; Tomkins, A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD Textual formats. MapReduce users 2008 International Conference on Management of should avoid using inefficient textual Data (Auckland, New Zealand, June 2008); http:// hadoop.apache.org/pig/ formats; 13. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, Natural indices. MapReduce users D.J., Madden, S., and Stonebraker, M. A comparison of approaches to large-scale data analysis. In should take advantage of natural in- Proceedings of the 2009 ACM SIGMOD International dices (such as timestamps in log file Conference (Providence, RI, June 29–July 2). ACM Press, New York, 2009; http://database.cs.brown.edu/ names) whenever possible; and projects/mapreduce-vs-dbms/ Unmerged output. Most MapReduce 14. Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. output should be left unmerged, since Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models there is no benefit to merging if the and Infrastructure 13, 4, 227–298. http://labs.google. next consumer is another MapReduce com/papers/sawzall.html program.

MapReduce provides many signifi- Jeffrey Dean ([email protected]) is a Google Fellow in cant advantages over parallel data- the Systems Infrastructure Group of Google, Mountain bases. First and foremost, it provides View, CA. fine-grain fault tolerance for large Sanjay Ghemawat ([email protected]) is a Google Fellow in the Systems Infrastructure Group of Google, jobs; failure in the middle of a multi- Mountain View, CA. hour execution does not require re- starting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a © 2010 ACM 0001-0782/10/0100 $10.00

january 2010 | vol. 53 | no. 1 | communications of the acm 77