Data Science: Bigtable, Mapreduce and Google File System

Total Page:16

File Type:pdf, Size:1020Kb

Data Science: Bigtable, Mapreduce and Google File System International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 Data Science: Bigtable, MapReduce and Google File System Karan B. Maniar1, Chintan B. Khatri2 1,2Atharva College of Engineering, University of Mumbai, Mumbai, India Abstract a simple data model and supports dynamic control over data layout and format. Bigtable is scalable Data science is the extension of research findings and and self-managing at the same time. Wide drawing conclusions from data[1]. BigTable is built on a few of Google technologies[2]. MapReduce is a applicability, scalability, high performance, and programming model and an associated high availability are some of the goals that Bigtable implementation for processing and generating large has achieved. There are numerous Google products data sets with a parallel, distributed algorithm on a that are using Bigtable for eg. Google Analytics, cluster[3]. Google File System is designed to provide Google Finance, Orkut, Personalized Search, efficient, reliable access to data using large clusters of Writely, and Google Earth. Bigtable maps two commodity hardware[4]. This paper will discuss random string values that are row key and column Bigtable, MapReduce and Google File System, along key. Timestamp is mapped into an associated with discussing the top 10 algorithms in data mining arbitrary byte array, which in turn makes it three in brief. dimensional[6]. The API of the Bigtable provides Keywords— Data Science, Bigtable, MapReduce, many functions like single row transactions are Google File System, Top 10 algorithms in data supported, allowing cells to be used as integer mining. counters and deleting and creating tables and column families. A Bigtable is not a relational database. One of the dimensions of each table is a field for time, permitting for versioning and 1. INTRODUCTION garbage collection. Bottlenecks and inefficiencies A distributed storage system for managing can be eliminated as they emerge[5]. structured data that is planned to scale to an immense size is called Bigtable. Internet’s development has resulted in the establishment and 3. MAPREDUCE wide usage of many internet based applications[5]. Google started the development of Bigtable in 2003 The Hadoop programs perform two distinguishable to address the demands of those applications. The tasks and the term MapReduce refers to them. Map applications that require a large storage volume use job and reduce job are the two tasks. Breaking Bigtable[6]. down individual elements into tuples after taking a set of data and converting it into another set of data For processing and generating large data sets, the is known as the map job. The output from the map programming model that is used is MapReduce. job is then taken as input and those data tuples are Map and Reduce functions that are ubiquitous in converted into a smaller set of tuples. As the name functional programming are the inspiration for suggests, the map job always precedes the reduce MapReduce. It is easy to use. MapReduce job[9]. There are lots of improvements from programs may not be fast all the time[7]. Terasort records that are: Shuffle 30%+, Merge improvements. MapReduce also enables to reuse Google File System (GFS) is a scalable distributed task slots. It is also used for medical image analysis file system for data-intensive applications that are because the modern imaging techniques like 3D large. There two primary advantages of GFS are and 4D result in a large file size which requires a the hardware on which it runs is low-priced and it large amount of storage and network bandwidth. delivers high performance. GFS is a vital tool that Word count application is a common example of empowers to sustain to innovate and strike MapReduce. problems on the Internet[8]. 4. GOOGLE FILE SYSTEM 2. BIGTABLE Google File System is a closed source software. It A Bigtable is a sparse, distributed, persistent is developed by Google. The codename of its new multidimensional sorted map. For most databases version is Colossus. Its interface is a familiar file that are commercial, the scale is too large. Since system interface. Since there are hundreds of the scale is large, the cost would be very high. It is servers in a GSF cluster, a few of them could be ISSN: 2231-2803 http://www.ijcttjournal.org Page115 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 unavailable at any given time. To keep the system F. PageRank functioning, fast recovery and replication are the PageRank is a search ranking algorithm two strategies that are used. In fast recovery, the using hyperlinks on the Web. master and the chunkserver restore their state and start immediately irrespective of the way through G. AdaBoost which they were terminated. Replication is of two The AdaBoost algorithm is used to group types: chunk replication and master replication. In items to solve a problem or view them as a chunk replication, every chunk is replicated on whole instead of single entities. It is very many chunkservers on different racks. In master simple and has a great ability to make replication, the master scale is replicated for accurate predictions. dependability[10]. H. kNN: k-nearest neighbor classification kNN is used for classification and regression. A label is classified to an 5. TOP 10 ALGORITHMS IN DATA MINING[11] unlabelled vector. The most recurrent A. C4.5 among the k training samples, where k is C4.5 generates decision trees. These the user-defined number, nearest to that decision trees can be used for query point is assigned the label. classification. They are also referred to as statistical classifier. I. Naive Bayes If the probability an event is given i.e. the B. The k-means algorithm even has already occurred, then we can The k-means algorithm is used to partition find out the probability of an event using a given set of data into k cluster. It is an the Bayes’ Theorem. iterative procedure. The value of the number of clusters k is decided by the J. CART: Classification and Regression user. Trees CART is a method which is used for C. Support vector machines classification and regression trees. They Support vector machines cope well with are used for predicting variables that are errors during execution. They are dependent i.e. regression and predicting algorithms that perform two operations: categorical variables i.e. classification. Analyse data and recognize patters. There are two primary uses of support vector machines: Classification and estimating relationships among variables. D. The Apriori algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. E. The EM algorithm Em stands for expectation–maximization and it is an iterative algorithm. It is used to figure out maximum likelihood in models which are statistical and depend on variable that are not directly observed. ISSN: 2231-2803 http://www.ijcttjournal.org Page116 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 LITERATURE SURVEY Sr. No Year Paper Description 1 1998 The PageRank Citation Discusses PageRank which is the method for rating web Ranking: Bringing Order pages objectively and mechanically. to the Web 2 2001 Random forests Shows that random forests are an effective tool in prediction. 3 2003 The Google File System Presents the fact that The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. 4 2004 MapReduce: Simplified Shows that the MapReduce programming model has been Data Processing on Large successfully used at Google for many different purposes. Clusters 5 2004 A storage architecture for Examines the concept of early discard for interactive search early discard in interactive of unindexed data. search 6 2005 Interpreting the data: Presents a system for automating such analyses. Parallel analysis with Sawzall 7 2006 Integrating compression Discusses how to extend C-Store, which is a column-oriented and execution in column DBMS, with a compression sub-system. oriented database systems 8 2006 Bigtable: A Distributed Describes the simple data model provided by Bigtable, which Storage System for gives clients dynamic control over data layout and format Structured Data along with the design and implementation of Bigtable. 9 2007 Dynamo: Amazon’s Presents the design and implementation of Dynamo, a highly Highly Available Key- available key-value storage system that some of Amazon’s value Store core services use to provide an “always-on” experience. 10 2007 An engineering Describes selected algorithmic and engineering problems perspective encountered, and the solutions found for them. 11 2008 Top 10 algorithms in data Presents the top 10 data mining algorithms identified by the mining IEEE International Conference on Data Mining (ICDM) in December 2006. 12 2009 "http://www.readwriteweb. Discusses the problem with Relational Databases. com/enterprise/2009/02/is- therelational-database- doomed.php. ", "ReadWrite. " 13 2010 Google Big Table Describes the differences between Bigtable and relational database and focus on the different data models used by them. 14 2012 A Few Useful Things to Summarizes twelve key lessons that machine learning Know about Machine researchers and practitioners have learned. Learning ISSN: 2231-2803 http://www.ijcttjournal.org Page117 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 CONCLUSION Bigtable provides good performance and high availability. The MapReduce model is easy to use, problems are easily expressible and its implementation scales to large clusters of machines. The technology that enables a very powerful general purpose cluster is Global File System. Thus the basics of Bigtable, MapReduce and Google File System were discussed in brief. REFERENCE 1. en.wikipedia.org/wiki/Data_science 2. en.wikipedia.org/wiki/BigTable 3. en.wikipedia.org/wiki/MapReduce 4.
Recommended publications
  • Mapreduce and Beyond
    MapReduce and Beyond Steve Ko 1 Trivia Quiz: What’s Common? Data-intensive compung with MapReduce! 2 What is MapReduce? • A system for processing large amounts of data • Introduced by Google in 2004 • Inspired by map & reduce in Lisp • OpenSource implementaMon: Hadoop by Yahoo! • Used by many, many companies – A9.com, AOL, Facebook, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. 3 Background: Map & Reduce in Lisp • Sum of squares of a list (in Lisp) • (map square ‘(1 2 3 4)) – Output: (1 4 9 16) [processes each record individually] 1 2 3 4 f f f f 1 4 9 16 4 Background: Map & Reduce in Lisp • Sum of squares of a list (in Lisp) • (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – Output: 30 [processes set of all records in a batch] 4 9 16 f f f returned iniMal 1 5 14 30 5 Background: Map & Reduce in Lisp • Map – processes each record individually • Reduce – processes (combines) set of all records in a batch 6 What Google People Have NoMced • Keyword search Map – Find a keyword in each web page individually, and if it is found, return the URL of the web page Reduce – Combine all results (URLs) and return it • Count of the # of occurrences of each word Map – Count the # of occurrences in each web page individually, and return the list of <word, #> Reduce – For each word, sum up (combine) the count • NoMce the similariMes? 7 What Google People Have NoMced • Lots of storage + compute cycles nearby • Opportunity – Files are distributed already! (GFS) – A machine can processes its own web pages (map) CPU CPU CPU CPU CPU CPU CPU CPU
    [Show full text]
  • The Google File System (GFS)
    The Google File System (GFS), as described by Ghemwat, Gobioff, and Leung in 2003, provided the architecture for scalable, fault-tolerant data management within the context of Google. These architectural choices, however, resulted in sub-optimal performance in regard to data trustworthiness (security) and simplicity. Additionally, the application- specific nature of GFS limits the general scalability of the system outside of the specific design considerations within the context of Google. SYSTEM DESIGN: The authors enumerate their specific considerations as: (1) commodity components with high expectation of failure, (2) a system optimized to handle relatively large files, particularly multi-GB files, (3) most writes to the system are concurrent append operations, rather than internally modifying the extant files, and (4) a high rate of sustained bandwidth (Section 1, 2.1). Utilizing these considerations, this paper analyzes the success of GFS’s design in achieving a fault-tolerant, scalable system while also considering the faults of the system with regards to data-trustworthiness and simplicity. Data-trustworthiness: In designing the GFS system, the authors made the conscious decision not prioritize data trustworthiness by allowing applications to access ‘stale’, or not up-to-date, data. In particular, although the system does inform applications of the chunk-server version number, the designers of GFS encouraged user applications to cache chunkserver information, allowing stale data accesses (Section 2.7.2, 4.5). Although possibly troubling
    [Show full text]
  • A Review on GOOGLE File System
    International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016 RESEARCH ARTICLE OPEN ACCESS A Review on Google File System Richa Pandey [1], S.P Sah [2] Department of Computer Science Graphic Era Hill University Uttarakhand – India ABSTRACT Google is an American multinational technology company specializing in Internet-related services and products. A Google file system help in managing the large amount of data which is spread in various databases. A good Google file system is that which have the capability to handle the fault, the replication of data, make the data efficient, memory storage. The large data which is big data must be in the form that it can be managed easily. Many applications like Gmail, Facebook etc. have the file systems which organize the data in relevant format. In conclusion the paper introduces new approaches in distributed file system like spreading file’s data across storage, single master and appends writes etc. Keywords:- GFS, NFS, AFS, HDFS I. INTRODUCTION III. KEY IDEAS The Google file system is designed in such a way that 3.1. Design and Architecture: GFS cluster consist the data which is spread in database must be saved in of single master and multiple chunk servers used by a arrange manner. multiple clients. Since files to be stored in GFS are The arrangement of data is in the way that it can large, processing and transferring such huge files can reduce the overhead load on the server, the consume a lot of bandwidth. To efficiently utilize availability must be increased, throughput should be bandwidth files are divided into large 64 MB size highly aggregated and much more services to make chunks which are identified by unique 64-bit chunk the available data more accurate and reliable.
    [Show full text]
  • Download File
    Annex 2: List of tested and analyzed data sharing tools (non-exhaustive) Below are the specifications of the tools surveyed, as to February 2015, with some updates from April 2016. The tools selected in the context of EU BON are available in the main text of the publication and are described in more details. This list is also available on the EU BON Helpdesk website, where it will be regularly updated as needed. Additional lists are available through the GBIF resources page, the DataONE software tools catalogue, the BioVel BiodiversityCatalogue and the BDTracker A.1 GBIF Integrated Publishing Toolkit (IPT) Main usage, purpose, selected examples The Integrated Publishing Toolkit is a free open source software tool written in Java which is used to publish and share biodiversity data sets and metadata through the GBIF network. Designed for interoperability, it enables the publishing of content in databases or text files using open standards, namely, the Darwin Core and the Ecological Metadata Language. It also provides a 'one-click' service to convert data set metadata into a draft data paper manuscript for submission to a peer-reviewed journal. Currently, the IPT supports three core types of data: checklists, occurrence datasets and sample based data (plus datasets at metadata level only). The IPT is a community-driven tool. Core development happens at the GBIF Secretariat but the coding, documentation, and internationalization are a community effort. New versions incorporate the feedback from the people who actually use the IPT. In this way, users can help get the features they want by becoming involved. The user interface of the IPT has so far been translated into six languages: English, French, Spanish, Traditional Chinese, Brazilian Portuguese, Japanese (Robertson et al, 2014).
    [Show full text]
  • System and Organization Controls (SOC) 3 Report Over the Google Cloud Platform System Relevant to Security, Availability, and Confidentiality
    System and Organization Controls (SOC) 3 Report over the Google Cloud Platform System Relevant to Security, Availability, and Confidentiality For the Period 1 May 2020 to 30 April 2021 Google LLC 1600 Amphitheatre Parkway Mountain View, CA, 94043 650 253-0000 main Google.com Management’s Report of Its Assertions on the Effectiveness of Its Controls Over the Google Cloud Platform System Based on the Trust Services Criteria for Security, Availability, and Confidentiality We, as management of Google LLC ("Google" or "the Company") are responsible for: • Identifying the Google Cloud Platform System (System) and describing the boundaries of the System, which are presented in Attachment A • Identifying our service commitments and system requirements • Identifying the risks that would threaten the achievement of its service commitments and system requirements that are the objectives of our System, which are presented in Attachment B • Identifying, designing, implementing, operating, and monitoring effective controls over the Google Cloud Platform System (System) to mitigate risks that threaten the achievement of the service commitments and system requirements • Selecting the trust services categories that are the basis of our assertion We assert that the controls over the System were effective throughout the period 1 May 2020 to 30 April 2021, to provide reasonable assurance that the service commitments and system requirements were achieved based on the criteria relevant to security, availability, and confidentiality set forth in the AICPA’s
    [Show full text]
  • F1 Query: Declarative Querying at Scale
    F1 Query: Declarative Querying at Scale Bart Samwel John Cieslewicz Ben Handy Jason Govig Petros Venetis Chanjun Yang Keith Peters Jeff Shute Daniel Tenedorio Himani Apte Felix Weigel David Wilhite Jiacheng Yang Jun Xu Jiexing Li Zhan Yuan Craig Chasseur Qiang Zeng Ian Rae Anurag Biyani Andrew Harn Yang Xia Andrey Gubichev Amr El-Helw Orri Erling Zhepeng Yan Mohan Yang Yiqun Wei Thanh Do Colin Zheng Goetz Graefe Somayeh Sardashti Ahmed M. Aly Divy Agrawal Ashish Gupta Shiv Venkataraman Google LLC [email protected] ABSTRACT 1. INTRODUCTION F1 Query is a stand-alone, federated query processing platform The data processing and analysis use cases in large organiza- that executes SQL queries against data stored in different file- tions like Google exhibit diverse requirements in data sizes, la- based formats as well as different storage systems at Google (e.g., tency, data sources and sinks, freshness, and the need for custom Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query elimi- business logic. As a result, many data processing systems focus on nates the need to maintain the traditional distinction between dif- a particular slice of this requirements space, for instance on either ferent types of data processing workloads by simultaneously sup- transactional-style queries, medium-sized OLAP queries, or huge porting: (i) OLTP-style point queries that affect only a few records; Extract-Transform-Load (ETL) pipelines. Some systems are highly (ii) low-latency OLAP querying of large amounts of data; and (iii) extensible, while others are not. Some systems function mostly as a large ETL pipelines. F1 Query has also significantly reduced the closed silo, while others can easily pull in data from other sources.
    [Show full text]
  • A Study of Cryptographic File Systems in Userspace
    Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 4507-4513 Research Article A study of cryptographic file systems in userspace a b c d e f Sahil Naphade , Ajinkya Kulkarni Yash Kulkarni , Yash Patil , Kaushik Lathiya , Sachin Pande a Department of Information Technology PICT, Pune, India [email protected] b Department of Information Technology PICT, Pune, India [email protected] c Department of Information Technology PICT, Pune, India [email protected] d Department of Information Technology PICT, Pune, India [email protected] e Veritas Technologies Pune, India, [email protected] f Department of Information Technology PICT, Pune, India [email protected] Article History: Received: 10 January 2021; Revised: 12 February 2021; Accepted: 27 March 2021; Published online: 28 April 2021 Abstract: With the advancements in technology and digitization, the data storage needs are expanding; along with the data breaches which can expose sensitive data to the world. Thus, the security of the stored data is extremely important. Conventionally, there are two methods of storage of the data, the first being hiding the data and the second being encryption of the data. However, finding out hidden data is simple, and thus, is very unreliable. The second method, which is encryption, allows for accessing the data by only the person who encrypted the data using his passkey, thus allowing for higher security. Typically, a file system is implemented in the kernel of the operating systems. However, with an increase in the complexity of the traditional file systems like ext3 and ext4, the ones that are based in the userspace of the OS are now allowing for additional features on top of them, such as encryption-decryption and compression.
    [Show full text]
  • Containers at Google
    Build What’s Next A Google Cloud Perspective Thomas Lichtenstein Customer Engineer, Google Cloud [email protected] 7 Cloud products with 1 billion users Google Cloud in DACH HAM BER ● New cloud region Germany Google Cloud Offices FRA Google Cloud Region (> 50% latency reduction) 3 Germany with 3 zones ● Commitment to GDPR MUC VIE compliance ZRH ● Partnership with MUC IoT platform connects nearly Manages “We found that Google Ads has the best system for 50 brands 250M+ precisely targeting customer segments in both the B2B with thousands of smart data sets per week and 3.5M and B2C spaces. It used to be hard to gain the right products searches per month via IoT platform insights to accurately measure our marketing spend and impacts. With Google Analytics, we can better connect the omnichannel customer journey.” Conrad is disrupting online retail with new Aleš Drábek, Chief Digital and Disruption Officer, Conrad Electronic services for mobility and IoT-enabled devices. Solution As Conrad transitions from a B2C retailer to an advanced B2B and Supports B2C platform for electronic products, it is using Google solutions to grow its customer base, develop on a reliable cloud infrastructure, Supports and digitize its workplaces and retail stores. Products Used 5x Mobile-First G Suite, Google Ads, Google Analytics, Google Chrome Enterprise, Google Chromebooks, Google Cloud Translation API, Google Cloud the IoT connections vs. strategy Vision API, Google Home, Apigee competitors Industry: Retail; Region: EMEA Number of Automate Everything running
    [Show full text]
  • Beginning Java Google App Engine
    apress.com Kyle Roche, Jeff Douglas Beginning Java Google App Engine A book on the popular Google App Engine that is focused specifically for the huge number of Java developers who are interested in it Kyle Roche is a practicing developer and user of Java technologies and the Google App Engine for building Java-based Cloud applications or Software as a Service (SaaS) Cloud Computing is a hot technology concept and growing marketplace that drives interest in this title as well Google App Engine is one of the key technologies to emerge in recent years to help you build scalable web applications even if you have limited previous experience. If you are a Java 1st ed., 264 p. programmer, this book offers you a Java approach to beginning Google App Engine. You will explore the runtime environment, front-end technologies like Google Web Toolkit, Adobe Flex, Printed book and the datastore behind App Engine. You'll also explore Java support on App Engine from end Softcover to end. The journey begins with a look at the Google Plugin for Eclipse and finishes with a 38,99 € | £35.49 | $44.99 working web application that uses Google Web Toolkit, Google Accounts, and Bigtable. Along [1]41,72 € (D) | 42,89 € (A) | CHF the way, you'll dig deeply into the services that are available to access the datastore with a 52,05 focus on Java Data Objects (JDO), JDOQL, and other aspects of Bigtable. With this solid foundation in place, you'll then be ready to tackle some of the more advanced topics like eBook integration with other cloud platforms such as Salesforce.com and Google Wave.
    [Show full text]
  • Cloud Computing Bible Is a Wide-Ranging and Complete Reference
    A thorough, down-to-earth look Barrie Sosinsky Cloud Computing Barrie Sosinsky is a veteran computer book writer at cloud computing specializing in network systems, databases, design, development, The chance to lower IT costs makes cloud computing a and testing. Among his 35 technical books have been Wiley’s Networking hot topic, and it’s getting hotter all the time. If you want Bible and many others on operating a terra firma take on everything you should know about systems, Web topics, storage, and the cloud, this book is it. Starting with a clear definition of application software. He has written nearly 500 articles for computer what cloud computing is, why it is, and its pros and cons, magazines and Web sites. Cloud Cloud Computing Bible is a wide-ranging and complete reference. You’ll get thoroughly up to speed on cloud platforms, infrastructure, services and applications, security, and much more. Computing • Learn what cloud computing is and what it is not • Assess the value of cloud computing, including licensing models, ROI, and more • Understand abstraction, partitioning, virtualization, capacity planning, and various programming solutions • See how to use Google®, Amazon®, and Microsoft® Web services effectively ® ™ • Explore cloud communication methods — IM, Twitter , Google Buzz , Explore the cloud with Facebook®, and others • Discover how cloud services are changing mobile phones — and vice versa this complete guide Understand all platforms and technologies www.wiley.com/compbooks Shelving Category: Use Google, Amazon, or
    [Show full text]
  • Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗
    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence CrowdMR: Integrating Crowdsourcing with MapReduce for AI-Hard Problems ∗ Jun Chen, Chaokun Wang, and Yiyuan Bai School of Software, Tsinghua University, Beijing 100084, P.R. China [email protected], [email protected], [email protected] Abstract In this paper, we propose CrowdMR — an extended MapReduce model based on crowdsourcing to handle AI- Large-scale distributed computing has made available hard problems effectively. Different from pure crowdsourc- the resources necessary to solve “AI-hard” problems. As a result, it becomes feasible to automate the process- ing solutions, CrowdMR introduces confidence into MapRe- ing of such problems, but accuracy is not very high due duce. For a common AI-hard problem like classification to the conceptual difficulty of these problems. In this and recognition, any instance of that problem will usually paper, we integrated crowdsourcing with MapReduce be assigned a confidence value by machine learning algo- to provide a scalable innovative human-machine solu- rithms. CrowdMR only distributes the low-confidence in- tion to AI-hard problems, which is called CrowdMR. In stances which are the troublesome cases to human as HITs CrowdMR, the majority of problem instances are auto- via crowdsourcing. This strategy dramatically reduces the matically processed by machine while the troublesome number of HITs that need to be answered by human. instances are redirected to human via crowdsourcing. To showcase the usability of CrowdMR, we introduce an The results returned from crowdsourcing are validated example of gender classification using CrowdMR. For the in the form of CAPTCHA (Completely Automated Pub- lic Turing test to Tell Computers and Humans Apart) instances whose confidence values are lower than a given before adding to the output.
    [Show full text]
  • Mapreduce: Simplified Data Processing On
    MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat [email protected], [email protected] Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key. Many tation with large amounts of complex code to deal with real world tasks are expressible in this model, as shown these issues. in the paper. As a reaction to this complexity, we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- gram's execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp chine failures, and managing the required inter-machine and many other functional languages.
    [Show full text]