ID 204 Comparison Between Hadoop and Spark

Proceedings of the International Conference on Industrial Engineering and Operations Management Bangkok, Thailand, March 5-7, 2019 Comparison between Hadoop and Spark Houssam BENBRAHIM, Hanaâ HACHIMI and Aouatif AMINE GS Laboratory "LGS", BOSS-Team, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco. [email protected], [email protected] and amine_aouatif@univ- ibntofail.ac.ma Abstract Big Data is a technology that aims to capture, store, manage, and analyze large numbers of data with different types: structured, semi-structured and unstructured. These data are guided by the rule of the 5 Vs: Volume, Velocity, Variety, Veracity and Value. To analyze a large amount of information coming from several sources, the technological world of Big Data is based on clearly identified tools, including the Hadoop Framework and the Apache Spark. Hadoop allows massive data storage with the Hadoop Distributed File System (HDFS) model, as well as the analysis with the MapReduce model, on a cluster that has one or more machines. Apache Spark analysis the distributed data, but it doesn’t contain a system storage. This article presents a comparative between the Big Data analysis methods offered by Hadoop and Spark, their architectures, operating modes and performances. Indeed, We have first begun by a general overview of Big Data technology. We discuss about the Hadoop Framework and their components: HDFS and MapReduce the first element of our comparison, we offer a study of their methodology, their mode of data processing and their architecture, at the end of this section we present the Hadoop Ecosystem. We then continue with Apache Spark Framework the second element of our study, we expos their features, their Ecosystem and their mode of analysis of large data. At the end, we compare these last two elements Hadoop and Spark based on a detailed study. Keywords Big Data, Hadoop, Spark, HDFS, MapReduce, Cluster, Analysis methods, Comparison. 1. Introduction The Big Data term origins was raised by the Meta Group research firm (now Gartner) in 2001 (Laney, 2001). This concept has profoundly transformed our society and it has became a new technology focus both in science and in industry. It is obvious that we are living at the era of the data deluge, evidenced by the huge volume of data collected from different sources and by the growth rate of the data generated (Corp, 2014). The question is: Where does this data come from? The movement began in the 2000s with the birth of Web 2.0, the emergence of social networks, the development of smartphones and continued with the orientation towards connected objects (Institut Montaigne, 2015). All messages, all documents, all images and videos shared, for example on Facebook, Twitter or Google represent a data. The IDC (International Data Corporation) report (Gantz & Reinsel, 2011) predicts that, from 2005 to 2020, the overall volume of data can grow from one hundred and thirty exabytes to forty thousand exabytes, representing double growth every two years. The Big Data approach has a strong impact on many sectors: finance, health, public sector and more (Corp, 2016). In fact, the data are presented as a new gold, which should be discovered and exploited, for instance, the Mckinsey report (McKinsey & Company, 2011) states that the potential value of global personal location data is estimated to be $100 billion in revenue to service providers over the next ten years and be as much as $700 billion in value to consumer, business and users. Today, the biggest problem is the analysis of mass data with all the constraints and difficulties surrounding it. This challenge offers new opportunities in terms of information management, and understanding of structured data and unstructured allows improved the results directly. © IEOM Society International 690 Proceedings of the International Conference on Industrial Engineering and Operations Management Bangkok, Thailand, March 5-7, 2019 The area of data analysis is based on a powerful technology, at the heart of them, we find an framework that known as the Hadoop MapReduce. This module works in a parallel environment that performs advanced processing functions in a short time. Introduced by Google in 2004 (Dean & Ghemawat, 2004), MapReduce help programmers to perform a data analysis, delegated and managed this data under an architecture, which can include thousands of computers running at the same time, this is called a «cluster» (LeMagIT/TechTarget, 2014). Also, we have Spark a Big Data processing framework, it is an open source project of the Apache Foundation. This module built to perform sophisticated analysis and designed for speed and easy of use (Apache Spark). Apache Spark offers a complete and unified ecosystem to meet the needs of Big Data processing for various data sets, various in nature (text, graph, ...) as well as various in source (batch, streaming or real-time) (What is Apache Spark?). In this paper, we start with the Big Data definition and their 5Vs properties. We continue with the Hadoop technology storage and analysis. Next, we discuss about the Apache Spark and their mode of treatment. Following that, we present a comparison between the previous two technologies Hadoop and Spark. Finally, we outline a brief conclusion and suggestions for further research. 2. Big Data: definition and characteristics 2.1 Definition Firstly, it is sometimes difficult to agree a single definition of Big Data. In 2011, the McKinsey and Company firm defines Big Data as follows: “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” (McKinsey & Company, 2011). Likewise, IDC defines Big Data with: “ Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” (Gantz & Reinsel, 2011). Also, IBM defines Big Data by: “ Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills.”(IBM). However, there are several definitions of Big Data, each of definitions focuses on a specific aspect of this phenomenon. 2.2 Characteristics In the early 2000s, and more precisely in 2001, when the analyst Doug Laney noted in his research by Meta Group (Laney, 2001) that data growth challenges and opportunities are three Vs (Russom, 2011)(Sagiroglu & Sinanc, 2013)(McAfee & Brynjolfsson, 2012)(Besse & Loubes, 2014): • Volume: that we are talking about a large volume of data. Data is growing exponentially, causing forecasters to mention 40 zettabytes will be generated in 2020 (EMC Corporation). This data are collected from a variety of sources like computers, smartphones connected to the internet, the access to social networks, the interconnected objects, etc. With this large volume generated, the challenge is to store and analyze this Big Data in real time (Corp, 2015). • Velocity: this means processing of huge data in a short time, sometimes in real time, with very advanced technologies like Hadoop and Spark. The large volume of data collected have added value that quickly reduces with time. They have no value if they are not fast used. • Variety: all types and forms of data (structured or unstructured). The data comes in various formats: text document, email, picture, audio, video, location data, log files and more. The data collected are very diverse in nature. These three Vs: Volume, Velocity and Variety, are the original properties of Big Data. The last two new "V" are (Demchenko, Ngo, & Membrey, 2013)(Teboul & Berthier, 2015)(SAS Company): • Value: the exploitation of data to draw information. What we want is turning the data into value, this fourth property is the most important V in the era of Big Data. This means money, and the recovery of benefits for businesses. • Veracity: refers to the reliability of data, with this high volume, velocity and variety of data, the quality and accuracy are less controllable. The uncertain and unpredictable nature of data like abbreviations, typos and colloquial speech that are shared via the internet and social networks are not necessarily correct. Big data © IEOM Society International 691 Proceedings of the International Conference on Industrial Engineering and Operations Management Bangkok, Thailand, March 5-7, 2019 technologies work with this types of data to bring order to this and combine analytic that correspond to user needs. These five elements are the main keys to understanding the word of Big Data. The goal is to process in a short time a large volume of data, with innovative technologies in order to develop the untapped information. Big data can be described by 5 characteristics (Demchenko, Grosso, De Laat, & Membrey, 2013) represented by the Figure 1. Figure 1. The 5 Vs of Big Data 3. Hadoop Framework 3.1 Hadoop Core Apache Hadoop is an open-source software written in Java that supports massive data storage and processing on a cluster which contains several machines (ApacheTM Hadoop). In 2004, Google has published a research article on the algorithm MapReduce, designed to achieve of the analytical operations to large scale on a multiple servers, as well on the file system in the cluster, Google File System (GFS) (Dean & Ghemawat, 2004). Doug Cutting, who worked on the development of the search engine free Apache Lucene has inspired the concepts described in Google article and decided to replicate in open source the tools developed by Google for its needs (LeMagIT/TechTarget, 2014). Hadoop Core offers a distributed file system that gives high-speed access to application data, from another side, it provides a system for parallel analysis of large data sets (Intel IT Center, 2013).

ID 204 Comparison Between Hadoop and Spark

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Apache Spark Solution Guide

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers

Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds

Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]