The Family of Mapreduce and Large Scale Data Processing Systems
Total Page:16
File Type:pdf, Size:1020Kb
The Family of MapReduce and Large Scale Data Processing Systems Sherif Sakr Anna Liu Ayman G. Fayoumi, NICTA and University of New NICTA and University of New King Abdulaziz University South Wales South Wales Jeddah, Saudia Arabia Sydney, Australia Sydney, Australia [email protected] [email protected] [email protected] ABSTRACT information. For example, Facebook collects 15 TeraBytes of data In the last two decades, the continuous increase of computational each day into a PetaByte-scale data warehouse [116]. Jim Gray, power has produced an overwhelming flow of data which has called called the shift a "fourth paradigm" [61]. The first three paradigms for a paradigm shift in the computing architecture and large scale were experimental, theoretical and, more recently, computational data processing mechanisms. MapReduce is a simple and power- science. Gray argued that the only way to cope with this paradigm ful programming model that enables easy development of scalable is to develop a new generation of computing tools to manage, vi- parallel applications to process vast amounts of data on large clus- sualize and analyze the data flood. In general, current computer ters of commodity machines. It isolates the application from the architectures are increasingly imbalanced where the latency gap details of running a distributed program such as issues on data dis- between multi-core CPUs and mechanical hard disks is growing tribution, scheduling and fault tolerance. However, the original im- every year which makes the challenges of data-intensive comput- plementation of the MapReduce framework had some limitations ing much harder to overcome [15]. Hence, there is a crucial need that have been tackled by many research efforts in several followup for a systematic and generic approach to tackle these problems with works after its introduction. This article provides a comprehensive an architecture that can also scale into the foreseeable future [110]. survey for a family of approaches and mechanisms of large scale In response, Gray argued that the new trend should instead focus on data processing mechanisms that have been implemented based on supporting cheaper clusters of computers to manage and process all the original idea of the MapReduce framework and are currently this data instead of focusing on having the biggest and fastest single gaining a lot of momentum in both research and industrial commu- computer. nities. We also cover a set of introduced systems that have been In general, the growing demand for large-scale data mining and implemented to provide declarative programming interfaces on top data analysis applications has spurred the development of novel of the MapReduce framework. In addition, we review several large solutions from both the industry (e.g., web-data analysis, click- scale data processing systems that resemble some of the ideas of stream analysis, network-monitoring log analysis) and the sciences the MapReduce framework for different purposes and application (e.g., analysis of data produced by massive-scale simulations, sen- scenarios. Finally, we discuss some of the future research direc- sor deployments, high-throughput lab equipment). Although par- allel database systems [40] serve some of these data analysis ap- tions for implementing the next generation of MapReduce-like so- 1 2 3 lutions. plications (e.g. Teradata , SQL Server PDW , Vertica , Green- plum4, ParAccel5, Netezza6), they are expensive, difficult to ad- minister and lack fault-tolerance for long-running queries [104]. 1. INTRODUCTION MapReduce [37] is a framework which is introduced by Google We live in the era of Big Data where we are witnessing a contin- for programming commodity computer clusters to perform large- uous increase on the computational power that produces an over- scale data processing in a single pass. The framework is designed whelming flow of data which has called for a paradigm shift in such that a MapReduce cluster can scale to thousands of nodes in the computing architecture and large scale data processing mech- a fault-tolerant manner. One of the main advantages of this frame- work is its reliance on a simple and powerful programming model. arXiv:1302.2966v1 [cs.DB] 13 Feb 2013 anisms. Powerful telescopes in astronomy, particle accelerators in physics, and genome sequencers in biology are putting massive vol- In addition, it isolates the application developer from all the com- umes of data into the hands of scientists. For example, the Large plex details of running a distributed program such as: issues on data Synoptic Survey Telescope [1] generates on the order of 30 Ter- distribution, scheduling and fault tolerance [103]. aBytes of data every day. Many enterprises continuously collect Recently, there has been a great deal of hype about cloud com- large datasets that record customer interactions, product sales, re- puting [10]. In principle, cloud computing is associated with a sults from advertising campaigns on the Web, and other types of new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to more central- ized and larger scale datacenters in order to reduce the costs associ- 1http://teradata.com/ 2http://www.microsoft.com/sqlserver/en/us/solutions- technologies/data-warehousing/pdw.aspx 3http://www.vertica.com/ 4http://www.greenplum.com/ 5http://www.paraccel.com/ 6 . http://www-01.ibm.com/software/data/netezza/ ated with the management of hardware and software resources. In map(String key, String value): reduce(String key, Iterator values): particular, cloud computing has promised a number of advantages // key: document name // key: a word for hosting the deployments of data-intensive applications such as: // value: document contents // values: a list of counts • Reduced time-to-market by removing or simplifying the time- for each word w in value: int result = 0; EmitIntermediate(w, “1”); consuming hardware provisioning, purchasing and deploy- for each v in values: result += ParseInt(v); ment processes. Emit(AsString(result)); • Reduced monetary cost by following a pay-as-you-go busi- ness model. • Unlimited (virtually) throughput by adding servers if the work- Figure 1: An Example MapReduce Program [37]. load increases. In principle, the success of many enterprises often rely on their abil- ity to analyze expansive volumes of data. In general, cost-effective computation takes a set of key/value pairs input and produces a set processing of large datasets is a nontrivial undertaking. Fortunately, of key/value pairs as output. The user of the MapReduce frame- MapReduce frameworks and cloud computing have made it easier work expresses the computation using two functions: Map and Re- than ever for everyone to step into the world of big data. This tech- duce. The Map function takes an input pair and produces a set of nology combination has enabled even small companies to collect intermediate key/value pairs. The MapReduce framework groups and analyze terabytes of data in order to gain a competitive edge. together all intermediate values associated with the same interme- For example, the Amazon Elastic Compute Cloud (EC2)7 is offered diate key I and passes them to the Reduce function. The Reduce as a commodity that can be purchased and utilised. In addition, function receives an intermediate key I with its set of values and Amazon has also provided the Amazon Elastic MapReduce8 as an merges them together. Typically just zero or one output value is online service to easily and cost-effectively process vast amounts of produced per Reduce invocation. The main advantage of this model data without the need to worry about time-consuming set-up, man- is that it allows large computations to be easily parallelized and re- agement or tuning of computing clusters or the compute capacity executed to be used as the primary mechanism for fault tolerance. upon which they sit. Hence, such services enable third-parties to Figure 1 illustrates an example MapReduce program expressed in perform their analytical queries on massive datasets with minimum pseudo-code for counting the number of occurrences of each word effort and cost by abstracting the complexity entailed in building in a collection of documents. In this example, the map function and maintaining computer clusters. emits each word plus an associated count of occurrences while the The implementation of the basic MapReduce architecture had reduce function sums together all counts emitted for a particular some limitations. Therefore, several research efforts have been word. In principle, the design of the MapReduce framework has triggered to tackle these limitations by introducing several advance- considered the following main principles [126]: ments in the basic architecture in order to improve its performance. • Low-Cost Unreliable Commodity Hardware: Instead of us- This article provides a comprehensive survey for a family of ap- ing expensive, high-performance, reliable symmetric mul- proaches and mechanisms of large scale data analysis mechanisms tiprocessing (SMP) or massively parallel processing (MPP) that have been implemented based on the original idea of the MapRe- machines equipped with high-end network and storage sub- duce framework and are currently gaining a lot of momentum in systems, the MapReduce framework is designed to run on both research and industrial communities. In particular, the remain- large clusters of commodity hardware. This hardware is man- der of this article is organized as follows. Section 2 describes the aged and powered by open-source operating systems and util- basic architecture of the MapReduce framework. Section 3 dis- ities so that the cost is low. cusses several techniques that have been proposed to improve the • Extremely Scalable RAIN Cluster: Instead of using central- performance and capabilities of the MapReduce framework from ized RAID-based SAN or NAS storage systems, every MapRe- different perspectives. Section 4 covers several systems that sup- duce node has its own local off-the-shelf hard drives. These port a high level SQL-like interface for the MapReduce framework.