The Family of Mapreduce and Large Scale Data Processing Systems

Total Page:16

File Type:pdf, Size:1020Kb

The Family of Mapreduce and Large Scale Data Processing Systems The Family of MapReduce and Large Scale Data Processing Systems Sherif Sakr Anna Liu Ayman G. Fayoumi, NICTA and University of New NICTA and University of New King Abdulaziz University South Wales South Wales Jeddah, Saudia Arabia Sydney, Australia Sydney, Australia [email protected] [email protected] [email protected] ABSTRACT information. For example, Facebook collects 15 TeraBytes of data In the last two decades, the continuous increase of computational each day into a PetaByte-scale data warehouse [116]. Jim Gray, power has produced an overwhelming flow of data which has called called the shift a "fourth paradigm" [61]. The first three paradigms for a paradigm shift in the computing architecture and large scale were experimental, theoretical and, more recently, computational data processing mechanisms. MapReduce is a simple and power- science. Gray argued that the only way to cope with this paradigm ful programming model that enables easy development of scalable is to develop a new generation of computing tools to manage, vi- parallel applications to process vast amounts of data on large clus- sualize and analyze the data flood. In general, current computer ters of commodity machines. It isolates the application from the architectures are increasingly imbalanced where the latency gap details of running a distributed program such as issues on data dis- between multi-core CPUs and mechanical hard disks is growing tribution, scheduling and fault tolerance. However, the original im- every year which makes the challenges of data-intensive comput- plementation of the MapReduce framework had some limitations ing much harder to overcome [15]. Hence, there is a crucial need that have been tackled by many research efforts in several followup for a systematic and generic approach to tackle these problems with works after its introduction. This article provides a comprehensive an architecture that can also scale into the foreseeable future [110]. survey for a family of approaches and mechanisms of large scale In response, Gray argued that the new trend should instead focus on data processing mechanisms that have been implemented based on supporting cheaper clusters of computers to manage and process all the original idea of the MapReduce framework and are currently this data instead of focusing on having the biggest and fastest single gaining a lot of momentum in both research and industrial commu- computer. nities. We also cover a set of introduced systems that have been In general, the growing demand for large-scale data mining and implemented to provide declarative programming interfaces on top data analysis applications has spurred the development of novel of the MapReduce framework. In addition, we review several large solutions from both the industry (e.g., web-data analysis, click- scale data processing systems that resemble some of the ideas of stream analysis, network-monitoring log analysis) and the sciences the MapReduce framework for different purposes and application (e.g., analysis of data produced by massive-scale simulations, sen- scenarios. Finally, we discuss some of the future research direc- sor deployments, high-throughput lab equipment). Although par- allel database systems [40] serve some of these data analysis ap- tions for implementing the next generation of MapReduce-like so- 1 2 3 lutions. plications (e.g. Teradata , SQL Server PDW , Vertica , Green- plum4, ParAccel5, Netezza6), they are expensive, difficult to ad- minister and lack fault-tolerance for long-running queries [104]. 1. INTRODUCTION MapReduce [37] is a framework which is introduced by Google We live in the era of Big Data where we are witnessing a contin- for programming commodity computer clusters to perform large- uous increase on the computational power that produces an over- scale data processing in a single pass. The framework is designed whelming flow of data which has called for a paradigm shift in such that a MapReduce cluster can scale to thousands of nodes in the computing architecture and large scale data processing mech- a fault-tolerant manner. One of the main advantages of this frame- work is its reliance on a simple and powerful programming model. arXiv:1302.2966v1 [cs.DB] 13 Feb 2013 anisms. Powerful telescopes in astronomy, particle accelerators in physics, and genome sequencers in biology are putting massive vol- In addition, it isolates the application developer from all the com- umes of data into the hands of scientists. For example, the Large plex details of running a distributed program such as: issues on data Synoptic Survey Telescope [1] generates on the order of 30 Ter- distribution, scheduling and fault tolerance [103]. aBytes of data every day. Many enterprises continuously collect Recently, there has been a great deal of hype about cloud com- large datasets that record customer interactions, product sales, re- puting [10]. In principle, cloud computing is associated with a sults from advertising campaigns on the Web, and other types of new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to more central- ized and larger scale datacenters in order to reduce the costs associ- 1http://teradata.com/ 2http://www.microsoft.com/sqlserver/en/us/solutions- technologies/data-warehousing/pdw.aspx 3http://www.vertica.com/ 4http://www.greenplum.com/ 5http://www.paraccel.com/ 6 . http://www-01.ibm.com/software/data/netezza/ ated with the management of hardware and software resources. In map(String key, String value): reduce(String key, Iterator values): particular, cloud computing has promised a number of advantages // key: document name // key: a word for hosting the deployments of data-intensive applications such as: // value: document contents // values: a list of counts • Reduced time-to-market by removing or simplifying the time- for each word w in value: int result = 0; EmitIntermediate(w, “1”); consuming hardware provisioning, purchasing and deploy- for each v in values: result += ParseInt(v); ment processes. Emit(AsString(result)); • Reduced monetary cost by following a pay-as-you-go busi- ness model. • Unlimited (virtually) throughput by adding servers if the work- Figure 1: An Example MapReduce Program [37]. load increases. In principle, the success of many enterprises often rely on their abil- ity to analyze expansive volumes of data. In general, cost-effective computation takes a set of key/value pairs input and produces a set processing of large datasets is a nontrivial undertaking. Fortunately, of key/value pairs as output. The user of the MapReduce frame- MapReduce frameworks and cloud computing have made it easier work expresses the computation using two functions: Map and Re- than ever for everyone to step into the world of big data. This tech- duce. The Map function takes an input pair and produces a set of nology combination has enabled even small companies to collect intermediate key/value pairs. The MapReduce framework groups and analyze terabytes of data in order to gain a competitive edge. together all intermediate values associated with the same interme- For example, the Amazon Elastic Compute Cloud (EC2)7 is offered diate key I and passes them to the Reduce function. The Reduce as a commodity that can be purchased and utilised. In addition, function receives an intermediate key I with its set of values and Amazon has also provided the Amazon Elastic MapReduce8 as an merges them together. Typically just zero or one output value is online service to easily and cost-effectively process vast amounts of produced per Reduce invocation. The main advantage of this model data without the need to worry about time-consuming set-up, man- is that it allows large computations to be easily parallelized and re- agement or tuning of computing clusters or the compute capacity executed to be used as the primary mechanism for fault tolerance. upon which they sit. Hence, such services enable third-parties to Figure 1 illustrates an example MapReduce program expressed in perform their analytical queries on massive datasets with minimum pseudo-code for counting the number of occurrences of each word effort and cost by abstracting the complexity entailed in building in a collection of documents. In this example, the map function and maintaining computer clusters. emits each word plus an associated count of occurrences while the The implementation of the basic MapReduce architecture had reduce function sums together all counts emitted for a particular some limitations. Therefore, several research efforts have been word. In principle, the design of the MapReduce framework has triggered to tackle these limitations by introducing several advance- considered the following main principles [126]: ments in the basic architecture in order to improve its performance. • Low-Cost Unreliable Commodity Hardware: Instead of us- This article provides a comprehensive survey for a family of ap- ing expensive, high-performance, reliable symmetric mul- proaches and mechanisms of large scale data analysis mechanisms tiprocessing (SMP) or massively parallel processing (MPP) that have been implemented based on the original idea of the MapRe- machines equipped with high-end network and storage sub- duce framework and are currently gaining a lot of momentum in systems, the MapReduce framework is designed to run on both research and industrial communities. In particular, the remain- large clusters of commodity hardware. This hardware is man- der of this article is organized as follows. Section 2 describes the aged and powered by open-source operating systems and util- basic architecture of the MapReduce framework. Section 3 dis- ities so that the cost is low. cusses several techniques that have been proposed to improve the • Extremely Scalable RAIN Cluster: Instead of using central- performance and capabilities of the MapReduce framework from ized RAID-based SAN or NAS storage systems, every MapRe- different perspectives. Section 4 covers several systems that sup- duce node has its own local off-the-shelf hard drives. These port a high level SQL-like interface for the MapReduce framework.
Recommended publications
  • Annual Report 2018
    Pakistan Telecommunication Company Limited Company Telecommunication Pakistan PTCL PAKISTAN ANNUAL REPORT 2018 REPORT ANNUAL /ptcl.official /ptclofficial ANNUAL REPORT Pakistan Telecommunication /theptclcompany Company Limited www.ptcl.com.pk PTCL Headquarters, G-8/4, Islamabad, Pakistan Pakistan Telecommunication Company Limited ANNUAL REPORT 2018 Contents 01COMPANY REVIEW 03FINANCIAL STATEMENTS CONSOLIDATED Corporate Vision, Mission & Core Values 04 Auditors’ Report to the Members 129-135 Board of Directors 06-07 Consolidated Statement of Financial Position 136-137 Corporate Information 08 Consolidated Statement of Profit or Loss 138 The Management 10-11 Consolidated Statement of Comprehensive Income 139 Operating & Financial Highlights 12-16 Consolidated Statement of Cash Flows 140 Chairman’s Review 18-19 Consolidated Statement of Changes in Equity 141 Group CEO’s Message 20-23 Notes to and Forming Part of the Consolidated Financial Statements 142-213 Directors’ Report 26-45 47-46 ہ 2018 Composition of Board’s Sub-Committees 48 Attendance of PTCL Board Members 49 Statement of Compliance with CCG 50-52 Auditors’ Review Report to the Members 53-54 NIC Peshawar 55-58 02STATEMENTS FINANCIAL Auditors’ Report to the Members 61-67 Statement of Financial Position 68-69 04ANNEXES Statement of Profit or Loss 70 Pattern of Shareholding 217-222 Statement of Comprehensive Income 71 Notice of 24th Annual General Meeting 223-226 Statement of Cash Flows 72 Form of Proxy 227 Statement of Changes in Equity 73 229 Notes to and Forming Part of the Financial Statements 74-125 ANNUAL REPORT 2018 Vision Mission To be the leading and most To be the partner of choice for our admired Telecom and ICT provider customers, to develop our people in and for Pakistan.
    [Show full text]
  • Apigee X Migration Offering
    Apigee X Migration Offering Overview Today, enterprises on their digital transformation journeys are striving for “Digital Excellence” to meet new digital demands. To achieve this, they are looking to accelerate their journeys to the cloud and revamp their API strategies. Businesses are looking to build APIs that can operate anywhere to provide new and seamless cus- tomer experiences quickly and securely. In February 2021, Google announced the launch of the new version of the cloud API management platform Apigee called Apigee X. It will provide enterprises with a high performing, reliable, and global digital transformation platform that drives success with digital excellence. Apigee X inte- grates deeply with Google Cloud Platform offerings to provide improved performance, scalability, controls and AI powered automation & security that clients need to provide un-parallel customer experiences. Partnerships Fresh Gravity is an official partner of Google Cloud and has deep experience in implementing GCP products like Apigee/Hybrid, Anthos, GKE, Cloud Run, Cloud CDN, Appsheet, BigQuery, Cloud Armor and others. Apigee X Value Proposition Apigee X provides several benefits to clients for them to consider migrating from their existing Apigee Edge platform, whether on-premise or on the cloud, to better manage their APIs. Enhanced customer experience through global reach, better performance, scalability and predictability • Global reach for multi-region setup, distributed caching, scaling, and peak traffic support • Managed autoscaling for runtime instance ingress as well as environments independently based on API traffic • AI-powered automation and ML capabilities help to autonomously identify anomalies, predict traffic for peak seasons, and ensure APIs adhere to compliance requirements.
    [Show full text]
  • Mapreduce: Simplified Data Processing On
    MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat [email protected], [email protected] Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key. Many tation with large amounts of complex code to deal with real world tasks are expressible in this model, as shown these issues. in the paper. As a reaction to this complexity, we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- gram's execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp chine failures, and managing the required inter-machine and many other functional languages.
    [Show full text]
  • Apache Hadoop Goes Realtime at Facebook
    Apache Hadoop Goes Realtime at Facebook Dhruba Borthakur Joydeep Sen Sarma Jonathan Gray Kannan Muthukkaruppan Nicolas Spiegelberg Hairong Kuang Karthik Ranganathan Dmytro Molkov Aravind Menon Samuel Rash Rodrigo Schmidt Amitanand Aiyer Facebook {dhruba,jssarma,jgray,kannan, nicolas,hairong,kranganathan,dms, aravind.menon,rash,rodrigo, amitanand.s}@fb.com ABSTRACT 1. INTRODUCTION Facebook recently deployed Facebook Messages, its first ever Apache Hadoop [1] is a top-level Apache project that includes user-facing application built on the Apache Hadoop platform. open source implementations of a distributed file system [2] and Apache HBase is a database-like layer built on Hadoop designed MapReduce that were inspired by Googles GFS [5] and to support billions of messages per day. This paper describes the MapReduce [6] projects. The Hadoop ecosystem also includes reasons why Facebook chose Hadoop and HBase over other projects like Apache HBase [4] which is inspired by Googles systems such as Apache Cassandra and Voldemort and discusses BigTable, Apache Hive [3], a data warehouse built on top of the applications requirements for consistency, availability, Hadoop, and Apache ZooKeeper [8], a coordination service for partition tolerance, data model and scalability. We explore the distributed systems. enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the At Facebook, Hadoop has traditionally been used in conjunction system, and how this solution has significant advantages over the with Hive for storage and analysis of large data sets. Most of this sharded MySQL database scheme used in other applications at analysis occurs in offline batch jobs and the emphasis has been on Facebook and many other web-scale companies.
    [Show full text]
  • New Zealand Reseller Update: June 2021 JUNE
    New Zealand Reseller Update: June 2021 JUNE All the stock, all the updates, all you need. Always speak to your Synnex rep before quoting customer If you have colleagues not receiving this monthly Google deck but would like to, please have them sign up here Follow Chrome Enterprise on LinkedIn Channel news June update Questions about Switching to Chrome Promotions Marketing Case Studies Training Product Launches & Stock updates Channel news Chrome OS in Action: Chrome Enterprise has announced new solutions Chrome Demo Tool is live and open for partner sign ups! to accelerate businesses move to Chrome OS On October 20, Chrome OS announced new solutions to help businesses deploy Chromebooks Demo Tool Guide and Chrome OS devices faster, while keeping their employees focused on what matters most. Each solution solves a real-world challenge we know businesses are facing right now and will help them support their distributed workforce. The Chrome Demo Tool is a new tool for Google for Education ● Chrome OS Readiness Tool: Helps businesses segment their workforce and identify and Chrome Enterprise partners with numerous pre-configured which Windows devices are ready to adopt Chrome OS (available 2021). options to demo top Chrome features including single sign-on ● Chrome Enterprise Recommended: Program that identifies verified apps for the Chrome OS environment. (SSO), parallels, zero touch enrollment (ZTE), and many more ● Zero-touch enrollment: Allow businesses to order devices that are already corporate that are coming soon. enrolled so they can drop ship directly to employees. ● Parallels Desktop: Gives businesses access to full-featured Windows and legacy apps locally on Chrome OS.
    [Show full text]
  • Google Search Techniques
    Google Search Techniques Google Search Techniques Disclaimer: Using Google to search the Internet will locate resources that are available to the public. While these resources are good for some purposes, serious research and academic work often requires access to databases, articles and books that, if they are available online, are only accessible by subscription. Fortunately, the UMass Library subscribes to most of these services. To access these resources online, go to the UMass Library Web site (library.umass.edu). For the best possible help finding information on any topic, talk to a reference librarian in person. They can help you find the resources you need and can teach you some fantastic techniques for doing your own searches. For a complete guide to Google’s features go to http://www.google.com/help/ Simple Search Strategies Google keeps the specifics of its page-ranking techniques secret, but here are a few things we know about what makes pages appear at the top of your search: - your search terms appears in the title of the web page - your search terms appear in links that lead to that page - your search terms appear in the content of the page (especially in headers) When you choose the search terms you enter into Google, think about the titles you would expect to see on these pages or that you would see in links to these pages. The more well-known your search target, the more easy it will be to find. Obscure topics or topics that share terms with more common topics will take more work to find.
    [Show full text]
  • Google/ Doubleclick REGULATION (EC) No 139/2004 MERGER PROCEDURE Article 8(1)
    EN This text is made available for information purposes only. A summary of this decision is published in all Community languages in the Official Journal of the European Union. Case No COMP/M.4731 – Google/ DoubleClick Only the English text is authentic. REGULATION (EC) No 139/2004 MERGER PROCEDURE Article 8(1) Date: 11-03-2008 COMMISSION OF THE EUROPEAN COMMUNITIES Brussels, 11/03/2008 C(2008) 927 final PUBLIC VERSION COMMISSION DECISION of 11/03/2008 declaring a concentration to be compatible with the common market and the functioning of the EEA Agreement (Case No COMP/M.4731 – Google/ DoubleClick) (Only the English text is authentic) Table of contents 1 INTRODUCTION .....................................................................................................4 2 THE PARTIES...........................................................................................................5 3 THE CONCENTRATION.........................................................................................6 4 COMMUNITY DIMENSION ...................................................................................6 5 MARKET DESCRIPTION......................................................................................6 6 RELEVANT MARKETS.........................................................................................17 6.1. Relevant product markets ............................................................................17 6.1.1. Provision of online advertising space.............................................17 6.1.2. Intermediation in
    [Show full text]
  • Sanitation Hackathon (P131958)
    Report No: ACS8614 World Public Disclosure Authorized IT based innovation in rural/urban WSS - Sanitation Hackathon (P131958) 26 May 2014 Public Disclosure Authorized TWIWP OTHER Public Disclosure Authorized Public Disclosure Authorized Standard Disclaimer: This volume is a product of the staff of the International Bank for Reconstruction and Development/ The World Bank. The findings, interpretations, and conclusions expressed in this paper do not necessarily reflect the views of the Executive Directors of The World Bank or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries. Copyright Statement: The material in this publication is copyrighted. Copying and/or transmitting portions or all of this work without permission may be a violation of applicable law. The International Bank for Reconstruction and Development/ The World Bank encourages dissemination of its work and will normally grant permission to reproduce portions of the work promptly. For permission to photocopy or reprint any part of this work, please send a request with complete information to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA, telephone 978-750-8400, fax 978-750-4470, http://www.copyright.com/. All other queries on rights and licenses, including subsidiary rights, should be addressed to the Office of the Publisher, The World Bank, 1818 H Street NW, Washington, DC 20433, USA, fax 202-522-2422, e-mail [email protected].
    [Show full text]
  • Usaid/Cambodia Development Innovations Project
    USAID/CAMBODIA DEVELOPMENT INNOVATIONS PROJECT QUARTERLY PROGRESS REPORT: APRIL 1 TO JUNE 30, 2014 JUNE 30, 2014 This publication was produced for review by the United States Agency for International Development. It was prepared by DAI. USAID/CAMBODIA DEVELOPMENT INNOVATIONS PROJECT QUARTERLY PROGRESS REPORT: JANUARY 1- MARCH 31, 2014 Program Title: USAID/Cambodia Development Innovations Sponsoring USAID Office: USAID/Cambodia Cooperative Agree. Number: AID-442-A-13-00003 Contractor: DAI Date of Publication: April 30, 2014 Author: DAI ** Please note that the change of the project name in the cooperative agreement from Social Innovation Lab Kampuchea to Development Innovations. The authors’ views expressed in this publication do not necessarily reflect the views of the United States Agency for International Development or the United States Government. CONTENTS CONTENTS ............................................................................................................................. v ABBREVIATIONS ................................................................................................................... vi EXECUTIVE SUMMARY ........................................................................................................... 8 1.0 INTRODUCTION ............................................................................................................. 10 1.1 OBJECTIVES AND KEY RESULTS ......................................................................................................... 10 1.2 OVERVIEW OF QUARTERLY
    [Show full text]
  • 2006 Marketing Advertising
    A SUPPLEMENT TO 2006 FACT PACK 4th ANNUAL GUIDE TO ADVERTISING MARKETING Published February 27, 2006 © Copyright 2006 Crain Communications Inc. 2 | Advertising Age | FactPack FactPack | Advertising Age | 3 FACT PACK 2006 CONTENTS TOP LINE DATA ON THE ADVERTISING AND MEDIA INDUSTRIES In a pdf version, click anywhere on the items below to jump directly to the page. GENERAL MOTORS CORP. is the top marketer by ad spending in the U.S. but who ranks Advertising & Marketing first on a global basis? A spot for Fox TV’s American Idol on Wednesdays at prime- Top five U.S. advertisers and their agencies . .6-7 time commands the most dollars per :30 ($518,466), but how much more is that than Top U.S. advertisers . .8 a spot for runner-up CSI:Crime Scene Investigation on CBS-TV the following night? And what about that growth in a Super Bowl :30 spot since the $42,000 average cost Top U.S. megabrands . .9 paid per :30 at Super Bowl I in 1967? Omnicom Group may be the world’s biggest U.S. ad spending totals by media . .10 marketing organization but how do its agency networks stack up against their com- Top U.S. advertisers by media . .11-13 petition? How big and far-reaching are those multifaceted media goliaths? It’s all in Top global marketers and spending in top 10 countries . .14-15 the FactPack, whether in print form on your desk, or a click away on your computer or network. Consumer brand market share leaders in select categories .
    [Show full text]
  • The Economic Advantage of Google Bigquery On- Demand Serverless Analytics
    Enterprise Strategy Group | Getting to the bigger truth.™ ESG Economic Validation White Paper The Economic Advantage of Google BigQuery On- Demand Serverless Analytics By Aviv Kaufmann, Senior ESG Validation Analyst; and Nik Rouda, Senior Analyst April 2017 This ESG White Paper was commissioned by Google and is distributed under license From ESG. © 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved. White Paper: The Economic Advantage oF Google BigQuery On-demand Serverless Analytics 2 Contents The Challenge: Economical Insight ......................................................................................................................................... 3 The Solution: Google BigQuery .............................................................................................................................................. 4 Google BigQuery versus Alternative Solutions ....................................................................................................................... 5 ESG’s Economic Value Audit Process ..................................................................................................................................... 6 Economic Benefits of Google BigQuery .................................................................................................................................. 6 ESG Economic Validation ....................................................................................................................................................... 8 Modeled Scenario
    [Show full text]
  • Creating World-Class Developer Experiences 2
    Inside the API Product Mindset PART 1 Creating World-Class Developer Experiences 2 3 4 Field-tested best practices Real-world use cases Developer experience checklist Table of contents Inside the API product mindset ................................................................................ 03 Understanding developers—internal and external—as API customers .................. 05 Field-tested best practices ..................................................................................... 07 Build an easy-to-use, self-service developer portal to drive adoption ......... 07 Create a community of developers ................................................................... 08 Never stop improving ......................................................................................... 09 Real-world use cases ............................................................................................... 10 How AccuWeather reached new audiences with its developer portal ......... 10 Developer experience checklist ............................................................................. 11 About Apigee API management ............................................................................... 12 Inside the API product mindset Application programming interfaces, or APIs, are the de facto mechanism for connecting applications, data, and systems—but they’re also much more. APIs abstract backend complexity behind a consistent interface, which means they not only allow one kind of software to talk to another, even if neither was designed
    [Show full text]