Big Data Analysis: Ap Spark Perspective

Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Year 2015 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172 & Print ISSN: 0975-4350 Big Data Analysis: Ap Spark Perspective By Abdul Ghaffar Shoro & Tariq Rahim Soomro SZABIST Dubai Campus, United Arab Emirates Abstract- Big Data have gained enormous attention in recent years. Analyzing big data is very common requirement today and such requirements become nightmare when analyzing of bulk data source such as twitter twits are done, it is really a big challenge to analyze the bulk amount of twits to get relevance and different patterns of information on timely manner. This paper will explore the concept of Big Data Analysis and recognize some meaningful information from some sample big data source, such as Twitter twits, using one of industries emerging tool, known as Spark by Apache. Keywords : big data analysis, twitter, apache spark, apache hadoop, open source. GJCST-C Classification : D.2.11, H.2.8 BigDataAnalysisApSparkPerspective Strictly as per the compliance and regulations of: © 2015. Abdul Ghaffar Shoro & Tariq Rahim Soomro. This is a research/review paper, distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all noncommercial use, distribution, and reproduction inany medium, provided the original work is properly cited. Big Data Analysis: Ap Spark Perspective Abdul Ghaffar Shoro α & Tariq Rahim Soomro σ Abstract- Big Data have gained enormous attention in recent is pretty fast at both running programs as well as writing years. Analyzing big data is very common requirement today data. Spark supports in-memory computing, that and such requirements become nightmare when analyzing of enables it to query data much faster compared to disk- bulk data source such as twitter twits are done, it is really a big based engines such as Hadoop, and also it offers a 2015 challenge to analyze the bulk amount of twits to get relevance general execution model that can optimize arbitrary ear and different patterns of information on timely manner. This Y paper will explore the concept of Big Data Analysis and operator graph [1]. This paper organized as follows: recognize some meaningful information from some sample big section 2 focus on literature review exploring the Big data source, such as Twitter twits, using one of industries Data Analysis & its tools and recognize some 7 emerging tool, known as Spark by Apache. meaningful information from some sample big data Keywords: big data analysis, twitter, apache spark, source, such as Twitter feeds, using one of industries apache hadoop, open source. emerging tool, Apache Spark along with justification of using Spark; section 3 will discuss material and method; I. Introduction section 4 will discuss the results of analyzing of big data n today’s computer age, our life has become pretty using Spark; and finally discussion and future work will much dependent on technological gadgets and more be highlighted in section 5. or less all aspects of human life, such as personal, I I Version V Issue I II. Literature Review social and professional are fully covered with technology. More or less all the above aspects are a) Big Data dealing with some sort of data; due to immense A very popular description for the exponential increase in complexity of data due to rapid growth growth and availability of huge amount of data with all X Volume ) required speed and variety have originated new possible variety is popularly termed as Big Data. This is C ( challenges in the life of data management. This is where one of the most spoke about terms in today’s Big Data term has given a birth. Accessing, Analyzing, automated world and perhaps big data is becoming of Securing and Storing big data are one of most spoken equal importance to business and society as the terms in today’s technological world. Big Data analysis Internet has been. It is widely believed and proved that is a process of gathering data from different resources more data leads to more accurate analysis, and of and then organizing that data in meaning full way and course more accurate analysis could lead to more then analyzing those big sets of data to discover legitimate, timely and confident decision making, as a meaningful facts and figures from that data collection. result, better judgment and decisions more likely means This analysis of data not only helps to determine the higher operational efficiencies, reduced risk and cost hidden facts and figures of information in bulk of big reductions [2]. Big Data researchers visualize big data data, but also it provides with categorize the data or as follows: rank the data with respect to important of information it i. Volume-wise provides. In short big data analysis is the process of This is the one of the most important factors, finding knowledge from bulk variety of data. Twitter as contributed to emergence of big data. Data volume is organization itself processes approximately 10k tweets multiplying to various factors. Organizations and per second before publishing them for public, they governments has been recording transactional data for analyze all this data with this extreme fast rate, to ensure decades, social media continuously pumping steams of every tweet is following decency policy and restricted unstructured data, automation, sensors data, machine- words are filtered out from tweets. All this analyzing to-machine data and so much more. Formerly, data obal Journal of Computer Science and Technology Technology and Science Computer of Journal obal process must be done in real time to avoid delays in storage was itself an issue, but thanks to advance and l publishing twits live for public; for example business like affordable storage devices, today, storage itself is not a G Forex Trading analyze social data to predict future big challenge but volume still contributes to other public trends. To analyze such huge data it is required challenges, such as, determining the relevance within to use some kind of analysis tool. This paper focuses massive data volumes as well as collecting valuable on open source tool Apache Spark. Spark is a cluster information from data using analysis [3]. computing system from Apache with incubator status; ii. Velocity-wise this tool is specialized at making data analysis faster, it Volume of data is challenge but the pace at Author α σ: Department of Computing, SZABIST Dubai Campus, which it is increasing is a serious challenge to be dealt Dubai, UAE. e-mails: [email protected], [email protected] with time and efficiency. The Internet streaming, RFID ©2015 Global Journals Inc. (US) Big Data Analysis: Ap Spark Perspective tags, automation and sensors, robotics and much more i. Apache Hive technology facilities, are actually driving the need to deal Hive is a data warehousing infrastructure, which with huge pieces of data in real time. So velocity of data runs on top of Hadoop. It provides a language called increase is one of big data challenge with standing in Hive QL to organize, aggregate and run queries on the front of every big organization today [4]. data. Hive QL is similar to SQL, using a declarative programming model [7]. This differentiates the language iii. Variety-wise from Pig Latin, which uses a more procedural approach. Rapidly growing huge volume of data is a big In Hive QL as in SQL the desired final results are challenge but the variety of data is bigger challenge. described in one big query. In contrast, using Pig Latin, 2015 Data is growing in variety of formats, structured, un- the query is built up step by step as a sequence of structured, relational and non-relational, different files ear assignment operations. Apache Hive enables Y systems, videos, images, multimedia, financial data, developers specially SQL developers to write queries in aviation data and scientific data etc. Now the challenge Hive Query Language HQL. HQL is similar to standard is to find means to correlate all variety of data timely to 8 query language. HQL queries can be broken down by get value from this data. Today huge numbers of Hive to communicate to MapReduce jobs executed organizations are striving to get better solutions to this across a Hadoop Cluster. challenge [3]. ii. Apache Pig iv. Variability-wise Pig is a tool or in fact a platform to analyze huge Rapidly growing data with increasing variety is volumes of big data. Substantial parallelization of tasks what makes big data challenging but ups and downs in is a very key feature of Pig programs, which enables this trend of big data flow is also a big challenge, social them to handle massive data sets [7]. While Pig and V Issue I Version I I I Version V Issue media response to global events drives huge volumes Hive are meant to perform similar tasks [8]. The Pig is of data and it is required to be analyzed on time before better suited for the data preparation phase of data trend changes. Global events impact on financial processing, while Hive fits the data warehousing and markets, this overhead increase more while dealing with presentation scenario better. The idea is that as data is Volume X Volume un-structured data [5]. incrementally collected, it is first cleaned up using the ) C tools provided by Pig and then stored. From that point ( v. Complexity-wise All above factors make big data a really on Hive is used to run ad-hoc queries analyzing the challenge, huge volumes, continuously multiplying with data. During this work the incremental buildup of a data increasing variety of sources, and with unpredicted warehouse is not enabled and both data preparation trends.

Big Data Analysis: Ap Spark Perspective

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Apache Spark Solution Guide

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers

Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds

Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]