Design and Implementation of a Component-based Distributed System for Text Mining in Social Networks by Yu Huang A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical and Computer Engineering University of Ontario Institute of Technology Oshawa, Ontario, Canada © Yu Huang, 2016 Abstract Design and Implementation of a Component-based Distributed System for Text Mining in Social Networks Yu Huang Advisor: University of Ontario Institute of Technology, 2016 Professor Qusay H. Mahmoud This report presents the design and implementation of a component-based distributed system for text mining in social networks. The system consists of three main types of components, data collection, data processing and data visualization. Three possible frameworks explore simple linear architecture, message feedback architecture, Kafka centric architecture and provide implementations of them. The final system adopts Kafka- centric architecture in which all components are connected through Kafka brokers. In terms of functionality, data collection components are responsible for collecting data from Twitter and producing messages to Kafka brokers. Data processing components contain a series of basic text mining topologies. Based on JavaScript libraries, data visualization is presented on web pages and allows users to interact with graphs and charts. In order to improve the scalability and performance of text mining, the project selects Apache Storm framework to implement data processing components. In this report, we evaluate the availability of Kafka and Storm, the rates of data collection components and the performance of data processing components. The experimental results demonstrate our system is available and scalable, and the component-based structure of this system enables it to be extended easily. ii Dedication To my parents and girlfriend. They mean a lot to me! iii Acknowledgements I would like to thank my parents and girlfriend first. Without their efforts and encouragement, I wouldn’t have the opportunity to study here. I would like to thank my supervisor, Dr. Qusay H. Mahmoud as well, who gave me instructions and offered advice through the project. I would also like to express my gratitude to Mr. Mark Neville, who is the writing specialist at the UOIT student learning centre. He gave me a lot of advice on English writing. Last but not least, I would like to thank all the people I met in my life. iv Table of Contents Abstract .............................................................................................................................. ii Acknowledgements .......................................................................................................... iv Acronyms and Abbreviations ......................................................................................... ix List of Figures .....................................................................................................................x List of Tables ................................................................................................................... xii Chapter 1 Introduction ....................................................................................................1 Motivation .............................................................................................................2 Problem Statement ................................................................................................2 Contributions .........................................................................................................3 Report Outline .......................................................................................................4 Chapter 2 Background and Related Work .....................................................................6 Social Networks ....................................................................................................6 2.1.1. Social Network Data ..................................................................................... 7 Text Mining ...........................................................................................................8 2.2.1. Normal steps ................................................................................................. 8 2.2.2. TF-IDF .......................................................................................................... 9 Distributed Stream Data Processing ......................................................................9 2.3.1. Apache Storm................................................................................................ 9 2.3.2. Twitter Heron .............................................................................................. 11 2.3.3. Spark Streaming .......................................................................................... 11 2.3.4. ZooKeeper................................................................................................... 12 Data Visualization ...............................................................................................13 Messaging Techniques ........................................................................................13 v 2.5.1. Kafka ........................................................................................................... 15 Related Work.......................................................................................................15 2.6.1. Social Mining .............................................................................................. 15 2.6.2. Data Visualization ....................................................................................... 17 2.6.3. Related Frameworks ................................................................................... 18 2.6.4. Related Products ......................................................................................... 20 Summary .............................................................................................................20 Chapter 3 Proposed Solution .........................................................................................21 Overview .............................................................................................................21 3.1.1. System Architectures .................................................................................. 22 Details of Architecture ........................................................................................25 Scalability, Expandability, and Availability .......................................................26 Trend Analysis ....................................................................................................27 Summary .............................................................................................................30 Chapter 4 Prototype Implementation ...........................................................................31 Overview .............................................................................................................31 Selection of Technologies ...................................................................................32 System Implementation .......................................................................................34 Component Implementation ................................................................................39 4.4.1. Data Collection Components ...................................................................... 39 4.4.2. Data Processing Components ..................................................................... 41 4.4.3. Data Visualization Components ................................................................. 47 4.4.4. Data Persistence Components ..................................................................... 49 Component Management ....................................................................................49 Data Format .........................................................................................................51 Challenges and Solutions ....................................................................................51 Summary .............................................................................................................52 vi Chapter 5 Evaluation and Results .................................................................................54 Experiment Environments ...................................................................................54 5.1.1. Vagrant and VirtualBox .............................................................................. 54 5.1.2. AWS EC2.................................................................................................... 55 Deployment .........................................................................................................56 5.2.1. Single Machine ........................................................................................... 56 5.2.2. Server Configuration ................................................................................... 57 5.2.3. Deployment of ZooKeeper Cluster ............................................................. 58 5.2.4. Building Storm Cluster ............................................................................... 59 5.2.5. Building Kafka Cluster ............................................................................... 59 Twitter Stream Throughput Evaluation...............................................................60
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages103 Page
-
File Size-