A Bit More Parallelism COS 326 David Walker Princeton University slides copyright 2013-2015 David Walker and Andrew W. Appel permission granted to reuse these slides for non-commercial educaonal purposes Last Time: Parallel CollecDons The parallel sequence abstracDon is powerful: • tabulate • nth • length • map • split • treeview • scan – used to implement prefix-sum – clever 2-phase implementaon – used to implement filters • sorng PARALLEL COLLECTIONS IN THE "REAL WORLD" Big Data If Google wants to index all the web pages (or images or gmails or google docs or ...) in the world, they have a lot of work to do • Same with Facebook for all the facebook pages/entries • Same with TwiXer • Same with Amazon • Same with ... Many of these tasks come down to map, filter, fold, reduce, scan Google Map-Reduce Google MapReduce (2004): a fault tolerant, MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat
[email protected],
[email protected] massively parallel funcDonal programming Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually paradigm MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- – based on our friends "map" and "reduce" values associated with the same intermediate key.