A Real Time Trending Topics Detection System for Twitter Social Network

International Journal of Computer Applications (0975 – 8887) Volume 152 – No.4, October 2016 Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network Cosmina Ivan Andrei Moldovan Department of Computer Science Department of Computer Science Technical University of Cluj-Napoca Technical University of Cluj-Napoca Cluj County, Romania Cluj County, Romania ABSTRACT providing real-time data output and analytics. The frameworks Big Data processing applications have become popular in the Apache Storm and Heron represent two open-source, scalable last few years. One of the main reasons is that the data systems developed for deployment in clusters, used recently generated by current systems and applications is more in real-time big-data processing and analysis. A cluster complex, have a higher speed and its volume increases represents an independent group of servers that collaborate as exponentially. Another reason would be that the traditional a unified system in order to offer greater availability and methods for data processing and storage are obsolete and the scalability. Horizontal scalability is obtained by allocating new tools and frameworks brought a lot of advantages. more nodes in the cluster, so the processing is based on Various social networks need to process big volumes of data, multiple hardware resources. and users take into consideration the speed and quality of the This paper is organized as follows: Section 2 encompasses the process. We propose an initial approach for processing data research we have done on existing stream processing from Twitter social network, in a system which allows a real- conceptual frameworks and implementations, with focus on time classification of tweets based on topics and user location. Apache Storm and Heron. Section 3 present the proposed With this approach we argue that in a dynamic world, were system design and implementation, in terms of the data increases exponentially and the processing needs to be architectural topology proposed, functional components very fast, the proposed system is capable to determine description, the execution model, and implementation details. trending topics in real time. In Section 4, a validation of the system was done and Section Keywords 5, contains conclusions and future developments. Twitter; trending topics; real-time; geolocation; Big Data 2. BIBLIOGRAPHIC RESEARCH The concept of Big Data is relatively new, which became 1. INTRODUCTION popular in the last decade, and its definition is relatively Over the last few years, both the volume of data that needs complex due to the properties that characterizes the immense processing and storage, and the variety of sources that provide volume of data which are in a continuous growth. the data have increased exponentially. This phenomenon is due to technological evolution and was embraced by well- Despite the increased interest, a universally accepted known companies like Amazon, Google, Twitter, etc., under definition for this concept is not established yet. According to the name Big Data, due to the fact that the size and MIT Technology Review (2013): "a data set, which is defined complexity of the data became difficult, even impossible to as high today, will be with a great certitude considered small manage using traditional data processing systems. The speed in the near future" [1]. The size of the data sets is often at which data is created and need to be processed and stored, reported to currently existing technology for processing it. In continuously increase in the last years. The most popular tools the absence of a well-established definition, representative and frameworks for real time stream data processing in the players on the market have contributed to the Big Data Big-Data era will be presented. The paper presents the Big phenomenon with their approach and implementation, for Data concept by defining it from different perspectives, the example Oracle says that "Big Data is derived significantly various processing models followed by an analysis of the from business traditionally based on relational databases, currently industrial and research frameworks. The purpose of correlated with new sources of unstructured data", Intel this paper is to present the necessary concepts regarding the appreciate as " Big Data Opportunities occur in organizations different Big Data processing models, to propose and that generate an average of 300 terabytes of data per week " implement a system which make use of two new stream [2]. processing frameworks from Apache -Storm and Heron, in The most popular way to characterize Big Data is based on order to determine the most discussed subjects from the the 3 V’s: volume, variety and velocity. Volume refers to the Twitter social network and their classification at the size of the data, and has increased exponentially over the last geolocation level. For obtaining relevant outcomes, the needs few years and this trend continues. Velocity describes the to be done in real time where the allowed latency is at the frequency at which the data are generated and received. seconds level, using as input data, a real data stream from Variety is one of the most important characteristics, as it Twitter social network. In this respect, another objective was describes the diversity in content and representation. A forth to propose and implement a classification method through V was added recently, namely veridicity which refers to which to obtain results as accurately as possible by selecting reliability, accuracy and precision all together [3]. relevant information from the vast amounts of data produced by the social network. Due to the increasing volume of data that needs to be created, processed and analyzed continuously, the traditional The processing model will involve the following steps: 1) processing and storage methods and technologies became reading the data stream, 2) processing the data in real time, 3) obsolete, and new models and their counterpart frameworks as 16 International Journal of Computer Applications (0975 – 8887) Volume 152 – No.4, October 2016 Storm, Spark, Flink or Heron [4] are used for data streaming processing of a big volume of data, and the latency introduced processing. by that is resolved through de service level in real time. It can be observed that both models: the batch model and the stream 2.1 Batch processing using MapReduce processing, bring advantages and disadvantages. Batch model processing allows the analysis of a big volume of data, but Majority of batch processing systems were based on the introduce latency in obtaining results, problem which was programming paradigm known as MapReduce processing solved by the stream processing at the price that data is model, which was introduced by Google and first available only for a definite period of time. The hybrid implemented in Apache Hadoop [4], a framework which lambda architecture offers a combined approach, in trying to integrates various batch processing technologies. MapReduce overcome all these disadvantages, but at a price of a is based on dividing the processing in two major stages: Map significant complexity. and Reduce, each receiving as input data a key-value pair, whose type is established by the programmer and also return a 2.4 New stream processing frameworks key-value pair as a result. Batch processing based on the The main reason that led to the development of a great MapReduce model presume reading a set of data of the number of frameworks for stream processing was mainly the dimension of the batch. When the batch is full (or when the inherent limitations of the batch processing, namely the high execution of some processing is forced by the planner) data latency introduced. Real-time processing requires the will be submitted for processing, obtaining a “framed” model processing of a continuous flow of data, so that the results of execution, either in terms of time (at some imposed obtained can be available with a minimum latency accessible intervals), or in terms of the volume of data (imposed by the to the final user. The well-known systems that offer stream dimension of the batch). Although, Hadoop is a good processing data are Apache Storm, Heron, Apache, Splunk- framework for what it was developed, the model has some Streaming, Spark and many others [8]. As we will describe in limitations as for example the possibility that the problem and more details the first two as being of interest, the rest will be data cannot be transposed in key-value pairs, or the specific only shortly presented. delay of producing the output data determined by the Splunk is a platform that can be used for real –time analysis dimension of the data that has to be processed and the for machine generated big data, and for processing structured computing power of the system. and unstructured files. Splunk captures, indexes and correlates real-time data in a searchable repository from which it can 2.2 Stream processing generate graphs, reports, alerts, dashboards and visualization. As an alternative, the recent data stream processing, also Spark streaming is an extension of the base framework Spark known as processing in real time, involves a continuous that offers a high level API in Java, Python, R for the Big processing of the input data. In this context, real-time can be Data processing. analyzed from two points of view: the data and the final user point of view. From the data perspective, the term real time Apache Storm [9,10] is a distributed system, open-source of refers the data processing as soon they are received, so the Big Data type. Storm includes a series of features like results obtained after the analysis will be always current. horizontal scalability, tolerance to failure, ensuring data From the perspective of the final user [5], the definition of the processing and support for different languages. big-data concept correlated with real time, will be made based Basic concepts in Storm are topology, tuple, stream, bolt and on the necessary time for the system to respond to an spout and their functional definition will be presented.

A Real Time Trending Topics Detection System for Twitter Social Network

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support