A Technological Survey on Apache Spark and Hadoop Technologies

INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277- 8616 A Technological Survey On Apache Spark And Hadoop Technologies. Dr MD NADEEM AHMED, AASIF AFTAB, MOHAMMAD MAZHAR NEZAMI Abstract: These days, whether it is mid-level or multilevel organizations alike accumulate enormous volume of data, and the only intension collecting these data is to: extract meaningful data called value through advanced level of data mining or analytics, and apply in decision making by personalized advertisement targeting , making the huge profit in business and extracting the rapidly using big data technologies. Big data due to its several features like value, volume, velocity, Variability and variety put further numerous challenges In this paper, we have completed an investigation of different huge information related advancements and we look at the advantages and disadvantages of these big data technologies and a comparative study of different proposed authors works over performance optimization of Big data related technologies has been done. Index Terms: Hadoop, Apache Spark, Survey, Performance, HDFS,mapReduce,Bigdata —————————— —————————— 1 INTRODUCTION technology Since introduced by Google in 2004. Hadoop is The Total data up to 90's century is now today’s sample an open-source usage of MapReduce. In different data data. According to Eric Schmidt, down of civilization till 2003 analytic use cases it has been applied such as reporting, there was five (5) Exabyte of data/information but now that OLAP and web data search machine learning, data mining, amount of data/information is created only in two (2) days social networking analysis and retrieval of information. To because data/information is growing much faster than we magnify and remove the disadvantages of Map Reduce expected. The reason of grow out of these data/information Apache Spark was developed. Apache Spark shall be is that it is coming from every corner of the whole world. considered as advancement of Map Reduce. Apache Spark Twitter process 340 million messages weekly. Data can process data 10x occasions quicker than guide Reduce generation in the last one year is equivalent to data on Disk and 100x occasions quicker than Map Reduce on generated in last 15 year. Facebook user generates 2.7 memory. This can be achieved by minimizing the number of billion comments and likes. Amazon S3 storage adds more reading/compose activities to plate. It stores the moderate than one billion objects biweekly. E bay stores 90 petabytes handling information in memory of data about customer transactions. Enterprise information . measures is no more in Tera bytes and Peta bytes but Exa and Zeta bytes. Because of progressive rise of data, it is Big data compromise of three broad components: - very important how to fastly retrieve information from Big Data in the research institutes and enterprise. Presently, the system of Hadoop ecosystem has been more largely accepted by scientist. This ecosystem fuse HDFS, Map Reduce, Hive, HBase and Pig and so on. Pig and Hive are called batch Processing. Big Data—the term portraying the collection of new information from sources, for example, online individual movement, business exchanges, and sensor systems—it contains numerous trademark, for example, data as high speed, high volume, and high- assortment. For raising the decision making and perceptivity, this information high speed, high volume and high-assortment resources of data demanded imaginative structure and savvy information preparing according to BD’s definition of Gartner. (gartner.com, 2013). For big data processing MapReduce [15] has become a recognized 2 LITERATURE SURVEY ____________________ In 2012, Floratou et al. [6] took a gander at Hive versus a Ph.D. (CS). similar database from Microsoft - SQL Server applying Lecturer (CS) TPC-H specification. The outcomes demonstrate shows Lecturer, Department of Computer Sc. that at the four scale factors SQL Server is continually IFTM University, India speedier than Hive for all TPC-H tests. Although when the College of CS/IT College of science and Arts Balqarn dataset is litter the normal speedup of SQL Server over [email protected] Hive is greater. Floratou et al. [7] in 2014, did another test Jazan University contemplate: differentiating Hive against Impala applying a University of Bisha, Saudia Arabia TPC-H like specification and two TPC-DS revived outstanding tasks at hand. The outcome exhibited that 2.1X [email protected] to 2.8X speedier than Hive on Tez (Hive-Tez) for the TPC-H [email protected] tests and Impala is 3.3X to 4.4X faster than Hive on MapReduce (Hive-MR). 3100 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277- 8616 appreciate the property. Jingmin Li In [10] composed the In [8] , Yongqiang He , Rubao Lee , Yin Huai , Zheng Shao , ongoing information investigation framework in light of the Jain, N , Xiaodong Zhang , Zhiwei Xu has made RCFile to Impala. They clarified the purpose why Impala has been achieve brisk data stacking, speedy request planning and chosen by them by looking at the Hive. Impala inquiry exceedingly capable limit space utilize .This record proficiency is around 2~3 times than Hive. In [11], Lei Gu structure has good position of level line vertical section considered the Hadoop and Spark. They establish that store structure and store structure. We will take this record regardless of the way that Spark is all things considered association to test three-type request devices. In [9],they speedier than Hadoop in iterative sets of operation, it needs thought about some high effective circulated parallel to bear for additional memory usage. The speed of the databases and Hive ,and tune up the performance with Spark advantage is crippled precisely when the memory utilizing several framework parameters gave by Hive, for isn't adequately sufficient to store as of late created direct example, HDFS Block Size ,Parallel Processing Slot results. So we should consider the execution of the Number and Partitions. We can get from their framework for memory, when we took a gander at three-type question presenting distinctive request on comprehensively devices. S. No. Paper Author Advantages Issues Note the similarity or dissimilarity Intended to upgrade between three-type question SQL in the three kind apparatuses in different inquiry apparatuses and file format effect on the memory look at the distinction Performance Comparison and CPU, lastly, we observe of the when advancement. 1 of Hive & Impala and Xiaopeng Li, Wenli Zhou document design for the inquiry Additionally, need to Spark SQL time, talk about that the examine other file query speed of Impala, Parquet format and alternative record group is taken made by techniques of Spark SQL is the quickest compression. Here based on the TPCH benchmark author compares the The execution of SQL- execution of three agent SQL-on- on-Hadoop frameworks The Performance of SQL- Xiongpai Qin, Yueguo Hadoop frameworks. can be additionally 2 on-Hadoop Systems: An Chen*, Jun Chen, Shuai Li, Impala performs much better than improved by applying Experimental Study Jiesi Liu, Huijie Zhang Hive and Spark. Performance of further developed SQL-on-Hadoop systems parallel database remarkably increased by using systems. pipeline way of querying. More performance can be achieved by isolating executing time of the Gesture based communication and gadgets (client cost of Performance issue and gamification of many application the program)from the Jay Kiruthika, Dr Souheli 3 Query Optimization in Big has led to increase in 3D storage. SQL query, Khaddaj Multidimensional Data Here author compares the cost of experimenting on large 3D storage and Time execution. 3 D databases involved complex structure will produce more results to work on Here authors introduced an exhibition expectation system Need to check the the Which runs job on Apache Spark I/O cost expectation for platform, Author demonstrate an alternate models for assessing the execution arrangement of of occupation by mirror the utilization. This may be Performance Prediction for Kewen Wang, Mohammad execution of genuine employment happening because of it 4 Apache Spark Platform Maifi Hasan Khan on a little scale on a continuous is unable to track bunch network action in . For execution of time and memory enough subtleties in a prophecy precision is observed to little scale be high, For different applications impersonation. the I/O cost prediction shows variation. Cross-Platform Resource Dazhao Cheng, Xiaobo Author has noticed that if in YARN While deploying more 5 Scheduling Zhou, Palden Lama, Jun clusters, by running Spark and processing paradigms for Spark and MapReduce Wu, and Changjun Jiang MapReduce causes noteworthy need to explore more 3101 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020 ISSN 2277- 8616 on YARN execution debasement in light of cross-platform resource the fact that to the scheduling proposal semantic gap between the dynamic (e.g. Stormand Hive, Pig application demands and the ,Shark, ,) on Hadoop reservation-based resource YARN allocation scheme of YARN. Therefore, a cross-platform resource scheduling middleware has been developed and designed , iKayak, that focused to enhance the utilization of cluster resource and for Spark-on-YARN deployment enhance application performance. By using concept of Authors implementation

A Technological Survey on Apache Spark and Hadoop Technologies

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Apache Spark Solution Guide

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers

Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds

Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]