Mapreduce: a Flexible Data Processing Tool
Total Page:16
File Type:pdf, Size:1020Kb
contributed articles DOI:10.1145/1629175.1629198 of MapReduce has been used exten- MapReduce advantages over parallel databases sively outside of Google by a number of organizations.10,11 include storage-system independence and To help illustrate the MapReduce fine-grain fault tolerance for large jobs. programming model, consider the problem of counting the number of by JEFFREY DEAN AND SaNjay GHEMawat occurrences of each word in a large col- lection of documents. The user would write code like the following pseudo- code: MapReduce: map(String key, String value): // key: document name // value: document contents for each word w in value: A Flexible EmitIntermediate(w, “1”); reduce(String key, Iterator values): // key: a word Data // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Processing Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The re- Tool duce function sums together all counts emitted for a particular word. MapReduce automatically paral- lelizes and executes the program on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, MAPREDUCE IS A programming model for processing scheduling the program’s execution and generating large data sets.4 Users specify a across a set of machines, handling machine failures, and managing re- map function that processes a key/value pair to quired inter-machine communication. generate a set of intermediate key/value pairs and MapReduce allows programmers with a reduce function that merges all intermediate no experience with parallel and dis- tributed systems to easily utilize the re- values associated with the same intermediate key. sources of a large distributed system. A We built a system around this programming model typical MapReduce computation pro- cesses many terabytes of data on hun- in 2003 to simplify construction of the inverted dreds or thousands of machines. Pro- index for handling searches at Google.com. Since grammers find the system easy to use, and more than 100,000 MapReduce Z AT then, more than 10,000 distinct programs have been W jobs are executed on Google’s clusters US implemented using MapReduce at Google, including every day. I algorithms for large-scale graph processing, text ON BY MAR ON BY Compared to Parallel Databases I processing, machine learning, and statistical machine The query languages built into paral- translation. The Hadoop open source implementation lel database systems are also used to ILLUSTRAT 72 COMMUNICatIONS OF THE ACM | JanUARY 2010 | VOL. 53 | NO. 1 JANUARY 2010 | VOL. 53 | NO. 1 | COMMUNICatIONS OF THE ACM 73 contributed articles express the type of computations sup- support a new storage system by de- would need to read only that sub-range ported by MapReduce. A 2009 paper fining simple reader and writer imple- instead of scanning the entire Bigtable. by Andrew Pavlo et al. (referred to here mentations that operate on the storage Furthermore, like Vertica and other col- as the “comparison paper”13) com- system. Examples of supported storage umn-store databases, we will read data pared the performance of MapReduce systems are files stored in distributed only from the columns needed for this and parallel databases. It evaluated file systems,7 database query results,2,9 analysis, since Bigtable can store data the open source Hadoop implementa- data stored in Bigtable,3 and structured segregated by columns. tion10 of the MapReduce programming input files (such as B-trees). A single Yet another example is the process- model, DBMS-X (an unidentified com- MapReduce operation easily processes ing of log data within a certain date mercial database system), and Vertica and combines data from a variety of range; see the Join task discussion in (a column-store database system from storage systems. the comparison paper, where the Ha- a company co-founded by one of the Now consider a system in which a doop benchmark reads through 155 authors of the comparison paper). Ear- parallel DBMS is used to perform all million records to process the 134,000 lier blog posts by some of the paper’s data analysis. The input to such analy- records that fall within the date range authors characterized MapReduce as sis must first be copied into the parallel of interest. Nearly every logging sys- “a major step backwards.”5,6 In this DBMS. This loading phase is inconve- tem we are familiar with rolls over to article, we address several misconcep- nient. It may also be unacceptably slow, a new log file periodically and embeds tions about MapReduce in these three especially if the data will be analyzed the rollover time in the name of each publications: only once or twice after being loaded. log file. Therefore, we can easily run a ˲ MapReduce cannot use indices and For example, consider a batch-oriented MapReduce operation over just the log implies a full scan of all input data; Web-crawling-and-indexing system files that may potentially overlap the ˲ MapReduce input and outputs are that fetches a set of Web pages and specified date range, instead of reading always simple files in a file system; and generates an inverted index. It seems all log files. ˲ MapReduce requires the use of in- awkward and inefficient to load the set efficient textual data formats. of fetched pages into a database just so Complex Functions We also discuss other important is- they can be read through once to gener- Map and Reduce functions are often sues: ate an inverted index. Even if the cost of fairly simple and have straightforward ˲ MapReduce is storage-system inde- loading the input into a parallel DBMS SQL equivalents. However, in many pendent and can process data without is acceptable, we still need an appropri- cases, especially for Map functions, the first requiring it to be loaded into a da- ate loading tool. Here is another place function is too complicated to be ex- tabase. In many cases, it is possible to MapReduce can be used; instead of pressed easily in a SQL query, as in the run 50 or more separate MapReduce writing a custom loader with its own ad following examples: analyses in complete passes over the hoc parallelization and fault-tolerance ˲ Extracting the set of outgoing links data before it is possible to load the data support, a simple MapReduce program from a collection of HTML documents into a database and complete a single can be written to load the data into the and aggregating by target document; analysis; parallel DBMS. ˲ Stitching together overlapping sat- ˲ Complicated transformations are ellite images to remove seams and to often easier to express in MapReduce Indices select high-quality imagery for Google than in SQL; and The comparison paper incorrectly said Earth; ˲ Many conclusions in the compari- that MapReduce cannot take advan- ˲ Generating a collection of inverted son paper were based on implementa- tage of pregenerated indices, leading index files using a compression scheme tion and evaluation shortcomings not to skewed benchmark results in the tuned for efficient support of Google fundamental to the MapReduce model; paper. For example, consider a large search queries; we discuss these shortcomings later in data set partitioned into a collection ˲ Processing all road segments in the this article. of nondistributed databases, perhaps world and rendering map tile images We encourage readers to read the using a hash function. An index can that display these segments for Google original MapReduce paper4 and the be added to each database, and the Maps; and comparison paper13 for more context. result of running a database query us- ˲ Fault-tolerant parallel execution of ing this index can be used as an input programs written in higher-level lan- Heterogenous Systems to MapReduce. If the data is stored in guages (such as Sawzall14 and Pig Lat- Many production environments con- D database partitions, we will run D in12) across a collection of input data. tain a mix of storage systems. Customer database queries that will become the Conceptually, such user defined data may be stored in a relational data- D inputs to the MapReduce execution. functions (UDFs) can be combined base, and user requests may be logged Indeed, some of the authors of Pavlo et with SQL queries, but the experience to a file system. Furthermore, as such al. have pursued this approach in their reported in the comparison paper indi- environments evolve, data may migrate more recent work.11 cates that UDF support is either buggy to new storage systems. MapReduce Another example of the use of in- (in DBMS-X) or missing (in Vertica). provides a simple model for analyzing dices is a MapReduce that reads from These concerns may go away over the data in such heterogenous systems. Bigtable. If the data needed maps to a long term, but for now, MapReduce is a End users can extend MapReduce to sub-range of the Bigtable row space, we better framework for doing more com- 74 COMMUNICatIONS OF THE ACM | JanUARY 2010 | VOL. 53 | NO. 1 contributed articles plicated tasks (such as those listed ear- of protocol buffers uses an optimized lier) than the selection and aggregation binary representation that is more that are SQL’s forte. compact and much faster to encode and decode than the textual formats Structured Data and Schemas used by the Hadoop benchmarks in the Pavlo et al. did raise a good point that MapReduce is comparison paper. For example, the schemas are helpful in allowing multi- a highly effective automatically generated code to parse ple applications to share the same data. a Rankings protocol buffer record For example, consider the following and efficient runs in 20 nanoseconds per record as schema from the comparison paper: tool for large-scale compared to the 1,731 nanoseconds CREATE TABLE Rankings ( required per record to parse the tex- pageURL VARCHAR(100) fault-tolerant tual input format used in the Hadoop PRIMARY KEY, benchmark mentioned earlier.