Implementing a Stream Processing Engine
Total Page:16
File Type:pdf, Size:1020Kb
Final Year Project School of Computer Science BSc (Hons) Computer Science with Industrial Experience Implementing a Stream Processing Engine Adam Dyk Supervised by Sandra Sampaio May 2016 Abstract This study presents the work undertaken to study the relationship between social network posts and election outcomes. Through the implementation of a generalpurpose stream processing engine, a constant stream of data is used to compute statistics pertaining to presidential candidates. An initial background stage determined appropriate sources of data, examined previous solutions and raised awareness about possible challenges. The system was subsequently designed through consultation with stakeholders, with implementation following shortly after. The output of the solution was compared to election results and an identification of possible trends and patterns was undertaken. After comparison of system output and current US presidential election results, it is deemed likely that a relationship exists between social media postings and election outcomes. However, the extent to which social media can be used for the purpose of election polling is yet to be determined. Page 1 of 59 Acknowledgments I would like to sincerely thank Dr. Sandra Sampaio for her continuous dedication and commitment throughout this project. Her advice and help have been invaluable. Likewise, I would like to extend the thanks to Mr. Jock McNaught for his feedback during presentations and project deliverables that helped me to better and improve the final system. Page 2 of 59 Table of Contents Abstract…….……………………………………………………………………………………..1 Acknowledgments…………………………………………………………………………….….2 Chapter 1: Introduction………………………………………………………………………....6 1.1 Chapter Overview………………………………………………………………………........6 1.2 Defining the Problem………………………………………………………………………...6 1.3 Project Overview……………………………………………………………………………..6 1.4 Motivation………………………………………………………………………………….....7 1.5 Project Aim…………………………………………………………………………………...7 1.6 Report Structure……………………………………………………………………………..8 Chapter 2: Background………………………………………………………………………….9 2.1 Chapter Overview…………………………………………………………………………....9 2.2 Intro to Big Data……………………………………………………………………………..9 2.3 Social Networks……………………………………………………………………………..10 2.3.1 Facebook…………………………………………………………………………………..11 2.3.2 Twitter……………………………………………………………………………………..12 2.3.3 LinkedIn…………………………………………………………………………………..12 2.3.4 Pinterest…………………………………………………………………………………...12 2.3.5 Google+………………………………………………………………………………........13 2.4 Stream Processing…………………………………………………………………………..13 2.5 Big Data in Elections…………………………………………………………………..…....14 2.5.1 Project Orca……………………………………………………………………………….14 2.5.2 Project Narwhal…………………………………………………………………………..15 2.5.3 Cambridge Analytica……………………………………………………………………..15 2.5.4 NationBuilder……………………………………………………………………………..16 2.6 Big Data Challenges………………………………………………………………………...16 2.6.1 Heterogeneity………………………………………………………………………….......16 2.6.2 Inconsistency and Incompleteness…………………………………………………..…...16 2.6.3 Scale…………………………………………………………………………………...…...17 2.6.4 Timeliness……………………………………………………………………………..…..17 2.6.5 Privacy and Data Ownership…………………………………………………………….17 2.6.6 The Human Perspective: Visualization and Collaboration…………………………….18 2.7 Chapter Summary………………………………………………………………………….18 Chapter 3: Design……………………………………………………………………………....19 3.1 Chapter Overview………………………………………………………………………......19 3.2 Requirements………………………………………………………………………………..19 3.2.1 Stakeholders……………………………………………………………………………....20 Page 3 of 59 3.2.2 Requirement Elicitation……………………………………………………………....….21 3.2.2 Functional Requirements………………………………………………………………...22 3.2.2 NonFunctional Requirements…………………………………………………………...23 3.3 Selected Technologies…………………………………………………………………….....23 3.3.1 Data Source……………………………………………………………………………......23 3.3.2 Computing Framework…………………………………………………………………..23 3.3.2 GeoNames……………………………………………………………………………...….24 3.4 System Output……………………………………………………………………………....24 3.4.1 Candidate Buzz Rating …………………………………………………………………..25 3.4.2 Average Retweet Response Time………………………………………………………...25 3.4.3 Average Retweet Distance Traveled……………………………………………………..25 3.5 System Architecture………………………………………………………………………...25 3.6 Chapter Summary………………………………………………………………………….26 Chapter 4: Implementation…………………………………………………………………….27 4.1 Chapter Overview…………………………………………………………………………..27 4.2 Development Decisions……………………………………………………………………..27 4.2.1 Programming Language………………………………………………………………….27 4.2.2 Short Iterations…………………………………………………………………………...28 4.3 Data Acquisition………………………………………………………………………….....29 4.3.1 OAuth Authentication…………………………………………………………………....30 4.3.2 Filtering Tweets…………………………………………………………………………...31 4.3.2 Stream Acquisition………………………………………………………………………..32 4.4 Information Extraction and Cleaning……………………………………………………..33 4.4.1 Status Format……………………………………………………………………………..33 4.4.2 Data Formating…………………………………………………………………………...33 4.4.3 Retweeted Statuses………………………………………………………………………..35 4.5 Data Integration, Aggregation and Representation……………………………………....36 4.5.1 MapReduce………………………………………………………………………………..36 4.5.2 Haversine Distance Formula……………………………………………………………..37 4.6 Chapter Summary………………………………………………………………………….38 Chapter 5: Analysis……………………………………………………………………………..39 5.1 Chapter Overview…………………………………………………………………………..39 5.2 GeoLocation Unavailability………………………………………………………………..39 5.3 US Presidential Election Results…………………………………………………………...41 5.4 Candidate Buzz Rating Analysis…………………………………………………………..42 5.5 Average Retweet Response Time Analysis ………………………………………………..43 5.6 Average Retweet Distance Traveled Analysis …………………………………………….44 5.7 Chapter Summary………………………………………………………………………….45 Page 4 of 59 Chapter 6: Evaluation………………………………………………………………………….46 6.1 Chapter Overview…………………………………………………………………………..46 6.2 Achievements………………………………………………………………………………..46 6.3 Challenges…………………………………………………………………………………...47 6.4 Skills Attained……………………………………………………………………………....47 6.5 Limitations ………………………………………………………………………………….48 6.6 Possible Improvements …………………………………………………………………….48 6.7 Conclusion…………………………………………………………………………………..49 References……………………………………………………………………………………….51 Appendix………………………………………………………………………………………...54 Section A: Candidate Buzz Rating Data……………………………………………………....54 Section B: Average Retweet Response Time Data…………………………………………....55 Page 5 of 59 Chapter 1: Introduction 1.1 Chapter Overview The chapter provides an introduction to the underlying problem of predicting election results. Subsequently, the proposed solution, motivation and aim of the project and the structure that the report will follow throughout are covered. 1.2 Defining the Problem The phenomenon of running elections goes back as far as Ancient Rome but it has been more recently made prevalent at the beginning of the 17th century throughout Europe and North America. An election is known as “the formal process of selecting a person for public office or of accepting or rejecting a political proposition by voting”.[1] Ever since the drafting of the Declaration of Independence, United States of America has been a sovereign republic[2], consequently empowering people to select their representatives, including the president. In the buildup to the election day, election polls are conducted by a plethora of independent research and new organizations. Despite refinements to the way these polls are structured and conducted, they continue to be biased, inaccurate and unreliable. Furthemore, these processes are often difficult and timely. As a result, it is essential to research new alternative and improved methods of conducting election polls.[3] 1.3 Project Overview Page 6 of 59 According to recent studies, our generation spends an alarming amount of time using online social networks. Nowadays, the average person spends 1.66 hours a day browsing through their social media accounts.[4] Ideally, the views shared by the users could be utilized to conduct a study of a candidate’s popularity and therefore chances of a successful campaign. Thus, this project aims to evaluate public opinion on United States presidential candidates, using postings on social media networks. Due to the sheer amount of data generated, the use of Big Data technology is required. Currently, there is a number of emerging applications designated for real time processing. These systems can be designed with the intention of analyzing user posts. As a result, the project involves the creation of a stream processing engine intended for on the fly processing of social network posts. 1.4 Motivation Today, we produce data at an ever increasing rate. Between 2012 and 2014 Google’s number of search queries doubled from 2 to 4 million per minute.[5] Consequently, we must design systems that are able to process such amounts of information in realtime. Undertaking the prediction of election outcomes deemed a suitable study due to their importance across all nations. Citizens cast votes to select highstature individuals that will ultimately represent their respective countries and make decisions affecting their daily lives. United States was selected as the country of choice for several reasons. The predominant factor was the timing of the presidential elections, with voting days occurring across all states in the first half of 2016. Thus, the dates allowed for a sufficient data gathering before voting occurs and an adequate comparison when results are published. Furthemore, United States is currently listed as the 3rd most populous nation[6], providing an ample user