A Scalable Framework for Multilevel Streaming Data Analytics Using Deep Learning

A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning Shihao Ge Haruna Isah Farhana Zulkernine Shahzad Khan School of Computing School of Computing School of Computing Gnowit Inc. Queen’s University Queen’s University Queen’s University Ottawa, ON, Canada Kingston, ON, Canada Kingston, ON, Canada Kingston, ON, Canada [email protected] [email protected] [email protected] [email protected] Abstract—The rapid growth of data in velocity, volume, value, Historically, data acquisition and processing requires variety, and veracity has enabled exciting new opportunities several time-consuming steps and traditional analytical and presented big challenges for businesses of all types. processes are limited to using stored and structured data. The Recently, there has been considerable interest in developing time it takes for these steps to complete and drive any systems for processing continuous data streams with the actionable decisions are often delayed to the extent that any increasing need for real-time analytics for decision support in action taken from the analysis is more reactive than proactive the business, healthcare, manufacturing, and security. The in nature [1]. The extraction of information from streaming analytics of streaming data usually relies on the output of text data and tasks involving Natural Language Processing offline analytics on static or archived data. However, (NLP) such as multilingual document classification, news businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market deduplication, sentiment analysis, and language translation information and continuously look for a unified analytics often requires processing millions of documents in a timely framework that can integrate both streaming and offline manner [2]. These challenges led to streaming analytics, a analytics in a seamless fashion to extract knowledge from large new programming paradigm designed to facilitate real-time volumes of hybrid streaming data. We present our study on analysis and action on data when an event occurs [6]. In designing a multilevel streaming text data analytics framework traditional computing, we access stored data to answer by comparing leading edge scalable open-source, distributed, evolving and dynamic analytic questions. With stream and in-memory technologies. We demonstrate the functionality computing, we generally deploy parallel or distributed of the framework for a use case of multilevel text analytics applications powered with machine learning models that using deep learning for language understanding and sentiment continuously analyze streams of data [1]. analysis including data indexing and query processing. Our There are typically two phases of streaming analytics framework combines Spark streaming for real time text involving large scale and high-velocity multimedia data: processing, the Long Short Term Memory (LSTM) deep first, offline data modeling, which involves the analysis of learning model for higher level sentiment analysis, and other historical data to build models and second, streaming tools for SQL-based analytical processing to provide a scalable analytics, which involves the application of the trained solution for multilevel streaming text analytics. model on live data streams [1]. For instance, indexing a corpus of documents can be implemented very efficiently Keywords-deep learning; natural language processing; news with offline processing, but a streaming approach offers the media; sentiment analysis; unstructured data competitive advantage of timeliness [6]. However, media analytics is challenging because of the unstructured and I. INTRODUCTION noisy nature of media articles. In addition, many NLP The world today is generating an inconceivable amount algorithms are compute-intensive and take a long time for of data every minute [1]. The volume of news data from both execution even for low-latency situations. These challenges mainstream and social media sources is enormous which call for a paradigm shift in the computing architecture for includes billions of archived documents with millions being large scale streaming data processing [2]. Open-source added daily. This digital information is available mainly in a solutions that can process document streams while semi-structured or unstructured format [2] containing a maintaining high levels of data throughput and a low level of wealth of information on what is happening around us across response latency are desirable. the world and our perspectives on these events [3]. Social The goal of this study is to develop a distributed media platforms like Facebook and Twitter have now framework for scaling-up the analysis of media articles to become an inseparable part of human communication and a keep pace with the rate of growth of news streams. The source of a huge amount of data that includes opinions, contributions of this work are as follows. First, we propose a feelings or general information regarding matters of interest scalable framework for multilevel streaming analytics of [4]. Some of this data may lose its value or be lost forever if social media data by leveraging distributed open-source tools not processed immediately. It is, therefore, important to and deep learning architectures. Second, we demonstrate the develop scalable systems that can ingest and analyze utility of the framework for multilevel analytics using unstructured data on a continuous basis [5]. Twitter data. The paper is organized as follows. Section II presents the B. Challenges in Streaming News Data Analytics research problem with a use case scenario and necessary The traditional decision support approaches were based background information including a literature study on on analysis of static or past data to assess goodness and streaming text data analytics. Next, we present our proposed applicability of derived solutions to evolving and dynamic streaming media analytics framework and its features in problems. Recent machine learning approaches train models section III. Section IV provides details about the based on previously collected data and deploy that on stream implementation of the framework while Section V reports computing pipelines to analyze ever-changing stream of data our experimental and evaluation results. Finally, we [1]. The low-latency, scalability, fault-tolerance, and conclude in Section VI with concluding remarks and a list of reliability stipulations of streaming applications introduce future work. new semantics for analytics and also raises new operational challenges [7]. II. BACKGROUND Many traditional machine learning algorithms are Offline machine learning has been a valuable analytical iterative and require multiple passes over the data. For technique for decades. The process involves extracting instance, stochastic gradient descent optimization algorithm relevant information or intelligence from large stored for classification and regression tasks involve repeated scans datasets to predict and prevent future actions or behaviour. over the input data to reach convergence [8]. For these However, with increasing access to continuously generated algorithms to work in a streaming scenario, they need to be real time data, fast just-in-time analytics can help detecting modified for “one pass” processing with compromises in and preventing crimes, internet frauds, and make the most of accuracy. Streaming algorithms must also be designed to market situations in online trading, which require a sub- incorporate strategies for handling temporary peaks in data millisecond reaction time. Critical decisions are made based speed or volume, resource constraints, and concept drifts, a on knowledge from both past and the new data. Therefore, phenomenon that occurs as data evolves and can impact algorithms are developed to train decision models based on predictive models. Many streaming analytics engines exist past data, which are then deployed on streaming data to today some of which are marketed by large software vendors classify situations into different categories such as critical or while others were created as a part of open-source research non-critical or requiring specific actions to enable both projects. The most popular generic open source tools which proactive and reactive business intelligence [1]. support streaming analytics are Spark Streaming and Flink A. Use Case Scenario [6]. These tools enable the development of scalable Our industry-academic collaboration with Gnowit, a distributed streaming algorithms and their execution on media monitoring and analytics company is striving to multiple machines. However, the unstructured and noisy develop an efficient and scalable analytics framework that nature of media articles makes media analytics very can facilitate complex multilevel predictive analytics for challenging. real-time streaming data from a variety of sources including Unstructured data such as news articles from blogs and the Web and social media data. The following are the social media are typically complex and require specialized functional requirements of the analytics pipeline: algorithms that need to be integrated with the streams • Ingest and integrate streaming news articles data from applications [1]. A fundamental part of this study is to several sources such as RSS feeds, Twitter Streaming incorporate a scalable real-time sentiment analysis model in API, and other specialized sources. the stream analytics pipeline for scoring the opinions

A Scalable Framework for Multilevel Streaming Data Analytics Using Deep Learning

Neural Networks (AI) (WBAI028-05) Lecture Notes

Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms

Mxnet-R-Reference-Manual.Pdf

Jssjournal of Statistical Software MMMMMM YYYY, Volume VV, Issue II

Verification of RNN-Based Neural Agent-Environment Systems

Anurag Soni's Blog

Deep Learning Architectures for Sequence Processing

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Julia: Next Generation Language Scipy 2019, Universidad De Los Andes

Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inﬂow Forecasting

Neural Network Architectures and Activation Functions: a Gaussian Process Approach

HGC: a Fast Hierarchical Graph-Based Clustering Method