Implementing a Stream Processing Engine

Total Page:16

File Type:pdf, Size:1020Kb

Implementing a Stream Processing Engine Final Year Project School of Computer Science BSc (Hons) Computer Science with Industrial Experience Implementing a Stream Processing Engine Adam Dyk Supervised by Sandra Sampaio May 2016 Abstract This study presents the work undertaken to study the relationship between social network posts and election outcomes. Through the implementation of a general­purpose stream processing engine, a constant stream of data is used to compute statistics pertaining to presidential candidates. An initial background stage determined appropriate sources of data, examined previous solutions and raised awareness about possible challenges. The system was subsequently designed through consultation with stakeholders, with implementation following shortly after. The output of the solution was compared to election results and an identification of possible trends and patterns was undertaken. After comparison of system output and current US presidential election results, it is deemed likely that a relationship exists between social media postings and election outcomes. However, the extent to which social media can be used for the purpose of election polling is yet to be determined. Page 1 of 59 Acknowledgments I would like to sincerely thank Dr. Sandra Sampaio for her continuous dedication and commitment throughout this project. Her advice and help have been invaluable. Likewise, I would like to extend the thanks to Mr. Jock McNaught for his feedback during presentations and project deliverables that helped me to better and improve the final system. Page 2 of 59 Table of Contents Abstract…….……………………………………………………………………………………..1 Acknowledgments…………………………………………………………………………….….2 Chapter 1: Introduction………………………………………………………………………....6 1.1 Chapter Overview………………………………………………………………………........6 1.2 Defining the Problem………………………………………………………………………...6 1.3 Project Overview……………………………………………………………………………..6 1.4 Motivation………………………………………………………………………………….....7 1.5 Project Aim…………………………………………………………………………………...7 1.6 Report Structure……………………………………………………………………………..8 Chapter 2: Background………………………………………………………………………….9 2.1 Chapter Overview…………………………………………………………………………....9 2.2 Intro to Big Data……………………………………………………………………………..9 2.3 Social Networks……………………………………………………………………………..10 2.3.1 Facebook…………………………………………………………………………………..11 2.3.2 Twitter……………………………………………………………………………………..12 2.3.3 LinkedIn…………………………………………………………………………………..12 2.3.4 Pinterest…………………………………………………………………………………...12 2.3.5 Google+………………………………………………………………………………........13 2.4 Stream Processing…………………………………………………………………………..13 2.5 Big Data in Elections…………………………………………………………………..…....14 2.5.1 Project Orca……………………………………………………………………………….14 2.5.2 Project Narwhal…………………………………………………………………………..15 2.5.3 Cambridge Analytica……………………………………………………………………..15 2.5.4 NationBuilder……………………………………………………………………………..16 2.6 Big Data Challenges………………………………………………………………………...16 2.6.1 Heterogeneity………………………………………………………………………….......16 2.6.2 Inconsistency and Incompleteness…………………………………………………..…...16 2.6.3 Scale…………………………………………………………………………………...…...17 2.6.4 Timeliness……………………………………………………………………………..…..17 2.6.5 Privacy and Data Ownership…………………………………………………………….17 2.6.6 The Human Perspective: Visualization and Collaboration…………………………….18 2.7 Chapter Summary………………………………………………………………………….18 Chapter 3: Design……………………………………………………………………………....19 3.1 Chapter Overview………………………………………………………………………......19 3.2 Requirements………………………………………………………………………………..19 3.2.1 Stakeholders……………………………………………………………………………....20 Page 3 of 59 3.2.2 Requirement Elicitation……………………………………………………………....….21 3.2.2 Functional Requirements………………………………………………………………...22 3.2.2 Non­Functional Requirements…………………………………………………………...23 3.3 Selected Technologies…………………………………………………………………….....23 3.3.1 Data Source……………………………………………………………………………......23 3.3.2 Computing Framework…………………………………………………………………..23 3.3.2 GeoNames……………………………………………………………………………...….24 3.4 System Output……………………………………………………………………………....24 3.4.1 Candidate Buzz Rating …………………………………………………………………..25 3.4.2 Average Retweet Response Time………………………………………………………...25 3.4.3 Average Retweet Distance Traveled……………………………………………………..25 3.5 System Architecture………………………………………………………………………...25 3.6 Chapter Summary………………………………………………………………………….26 Chapter 4: Implementation…………………………………………………………………….27 4.1 Chapter Overview…………………………………………………………………………..27 4.2 Development Decisions……………………………………………………………………..27 4.2.1 Programming Language………………………………………………………………….27 4.2.2 Short Iterations…………………………………………………………………………...28 4.3 Data Acquisition………………………………………………………………………….....29 4.3.1 OAuth Authentication…………………………………………………………………....30 4.3.2 Filtering Tweets…………………………………………………………………………...31 4.3.2 Stream Acquisition………………………………………………………………………..32 4.4 Information Extraction and Cleaning……………………………………………………..33 4.4.1 Status Format……………………………………………………………………………..33 4.4.2 Data Formating…………………………………………………………………………...33 4.4.3 Retweeted Statuses………………………………………………………………………..35 4.5 Data Integration, Aggregation and Representation……………………………………....36 4.5.1 MapReduce………………………………………………………………………………..36 4.5.2 Haversine Distance Formula……………………………………………………………..37 4.6 Chapter Summary………………………………………………………………………….38 Chapter 5: Analysis……………………………………………………………………………..39 5.1 Chapter Overview…………………………………………………………………………..39 5.2 GeoLocation Unavailability………………………………………………………………..39 5.3 US Presidential Election Results…………………………………………………………...41 5.4 Candidate Buzz Rating Analysis…………………………………………………………..42 5.5 Average Retweet Response Time Analysis ………………………………………………..43 5.6 Average Retweet Distance Traveled Analysis …………………………………………….44 5.7 Chapter Summary………………………………………………………………………….45 Page 4 of 59 Chapter 6: Evaluation………………………………………………………………………….46 6.1 Chapter Overview…………………………………………………………………………..46 6.2 Achievements………………………………………………………………………………..46 6.3 Challenges…………………………………………………………………………………...47 6.4 Skills Attained……………………………………………………………………………....47 6.5 Limitations ………………………………………………………………………………….48 6.6 Possible Improvements …………………………………………………………………….48 6.7 Conclusion…………………………………………………………………………………..49 References……………………………………………………………………………………….51 Appendix………………………………………………………………………………………...54 Section A: Candidate Buzz Rating Data……………………………………………………....54 Section B: Average Retweet Response Time Data…………………………………………....55 Page 5 of 59 Chapter 1: Introduction 1.1 Chapter Overview The chapter provides an introduction to the underlying problem of predicting election results. Subsequently, the proposed solution, motivation and aim of the project and the structure that the report will follow throughout are covered. 1.2 Defining the Problem The phenomenon of running elections goes back as far as Ancient Rome but it has been more recently made prevalent at the beginning of the 17th century throughout Europe and North America. An election is known as “the formal process of selecting a person for public office or of accepting or rejecting a political proposition by voting”.[1] Ever since the drafting of the Declaration of Independence, United States of America has been a sovereign republic[2], consequently empowering people to select their representatives, including the president. In the build­up to the election day, election polls are conducted by a plethora of independent research and new organizations. Despite refinements to the way these polls are structured and conducted, they continue to be biased, inaccurate and unreliable. Furthemore, these processes are often difficult and timely. As a result, it is essential to research new alternative and improved methods of conducting election polls.[3] 1.3 Project Overview Page 6 of 59 According to recent studies, our generation spends an alarming amount of time using online social networks. Nowadays, the average person spends 1.66 hours a day browsing through their social media accounts.[4] Ideally, the views shared by the users could be utilized to conduct a study of a candidate’s popularity and therefore chances of a successful campaign. Thus, this project aims to evaluate public opinion on United States presidential candidates, using postings on social media networks. Due to the sheer amount of data generated, the use of Big Data technology is required. Currently, there is a number of emerging applications designated for real time processing. These systems can be designed with the intention of analyzing user posts. As a result, the project involves the creation of a stream processing engine intended for on the fly processing of social network posts. 1.4 Motivation Today, we produce data at an ever increasing rate. Between 2012 and 2014 Google’s number of search queries doubled from 2 to 4 million per minute.[5] Consequently, we must design systems that are able to process such amounts of information in real­time. Undertaking the prediction of election outcomes deemed a suitable study due to their importance across all nations. Citizens cast votes to select high­stature individuals that will ultimately represent their respective countries and make decisions affecting their daily lives. United States was selected as the country of choice for several reasons. The predominant factor was the timing of the presidential elections, with voting days occurring across all states in the first half of 2016. Thus, the dates allowed for a sufficient data gathering before voting occurs and an adequate comparison when results are published. Furthemore, United States is currently listed as the 3rd most populous nation[6], providing an ample user
Recommended publications
  • Webex Messenger Deployment for Cisco Jabber 14.0
    Webex Messenger Deployment for Cisco Jabber 14.0 First Published: 2021-03-25 Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com Tel: 408 526-4000 800 553-NETS (6387) Fax: 408 527-0883 THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS, INFORMATION, AND RECOMMENDATIONS IN THIS MANUAL ARE BELIEVED TO BE ACCURATE BUT ARE PRESENTED WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. USERS MUST TAKE FULL RESPONSIBILITY FOR THEIR APPLICATION OF ANY PRODUCTS. THE SOFTWARE LICENSE AND LIMITED WARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT SHIPPED WITH THE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE OR LIMITED WARRANTY, CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY. The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB's public domain version of the UNIX operating system. All rights reserved. Copyright © 1981, Regents of the University of California. NOTWITHSTANDING ANY OTHER WARRANTY HEREIN, ALL DOCUMENT FILES AND SOFTWARE OF THESE SUPPLIERS ARE PROVIDED “AS IS" WITH ALL FAULTS. CISCO AND THE ABOVE-NAMED SUPPLIERS DISCLAIM ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF DEALING, USAGE, OR TRADE PRACTICE. IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUT LIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THIS MANUAL, EVEN IF CISCO OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
    [Show full text]
  • Digital Data Increasingly Plays a Central Role in Contemporary Politics and Public in September 2016 (DATACTIVE)
    Krisis 2018, Issue 1 1 Data Activism www.krisis.eu Reversing Data Politics: An Introduction to the Special Issue advertising to the monitoring of citizens. Many aspects of the state and the market Lonneke van der Velden and Stefania Milan today have to do with the ‘data economy’ and its rules (or lack thereof). In this special issue, we are also interested in ‘data politics’, but we want to shift the focus of the conversation. Big data corporations and intelligence agencies are not the only ones acting on datafication, or the process of turning into monetizable and analyzable data many aspects of life that had never been quantified before, such as people’s emotions and interpersonal connections. Non-governmental organiza- tions, hackers, and activists of all kinds provide a myriad of ‘alternative’ interven- tions, interpretations, and imaginaries of what data stands for and what can be done with it. The idea of the special issue emerged during a two-day workshop on ‘Contentious Data’ hosted by the research group DATACTIVE at the University of Amsterdam Digital data increasingly plays a central role in contemporary politics and public in September 2016 (DATACTIVE). As the organisers argued elsewhere, these life. Citizen voices in the so-called public sphere are increasingly mediated by pro- emerging forms of ‘data activism’, that is to say socio-technical mobilizations and prietary social media platforms such as Twitter and Facebook, and are thus shaped tactics taking a critical approach towards datafication and massive data collection, by algorithmic ranking and re-ordering. ‘Calculated publics’ fashioned by ‘new offer new epistemologies able to counteract the mainstream positivistic discourse kinds of human and machine interaction’ (Amoore and Piotukh 2016, 2) replace of datafication (Milan and van der Velden 2016).
    [Show full text]
  • The-Road-To-Mobility-2020.Pdf
    The 2020 Guide To Trends And Technology For Smart Cities And Transportation THE ROAD TO MOBILITY First Edition: The Road to Mobility The 2020 Guide to Trends and Technology for Smart Cities and Transportation Published by BlackBerry Limited, 2200 University Ave, E Waterloo, ON Canada N2K 047 ©2020 BlackBerry Limited. Trademarks, including but not limited to BLACKBERRY, EMBLEM Design, CYLANCE and QNX are the trademarks or registered trademarks of BlackBerry Limited, its subsidiaries and/or affiliates, used under license, and the exclusive rights to such trademarks are expressly reserved. All other trademarks are the property of their respective owners. To download PDF or e-book copies of the First Edition: The Road to Mobility visit: http://blackberry.com/roadtomobility2020 Thanks to: Edited By: Jeffrey Davis, Anthony Freed Copy Editor: Carla Johnson Project Manager: Swetha Sirupa Executive Sponsor: Mark Wilson Designer: Douglas Kraus Read Blogs.BlackBerry.com, and follow us on Twitter (@BlackBerry) and LinkedIn (https://www.linkedin.com/company/blackberry/) 2 | THE ROAD TO MOBILITY THE ROAD TO MOBILITY | 3 Table of Contents SECTION 3 Bottlenecks to Change .................................................................................72 The Elephant in the Room: Shifting Culture from Competition to Collaboration ...............................................................................74 Mobility Explodes Opportunities for Automotive. Let’s Seize the Moment. ......6 Faye Francy, Automotive ISAC John Chen, BlackBerry Challenges to
    [Show full text]
  • When Data Crimes Are Real Crimes: Voter Surveillance and the Cambridge Analytica Conflict by Jesse Gordon BA Honours, Universit
    When Data Crimes are Real Crimes: Voter Surveillance and the Cambridge Analytica Conflict by Jesse Gordon BA Honours, University of Saskatchewan, 2016 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of MASTERS OF ARTS in the Department of Political Science © Jesse Gordon, 2019 University of Victoria All rights reserved. This Thesis may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author. ii Supervisory Committee When Data Crimes are Real Crimes: Voter Surveillance and the Cambridge Analytica Conflict by Jesse Gordon BA, University of Saskatchewan, 2016 Supervisory Committee Dr. Colin Bennett, Department of Political Science Supervisor Dr. Arthur Kroker, Department of Political Science Departmental Member iii Abstract This thesis asks what conditions elevated the Cambridge Analytica (CA) conflict into a sustained and global political issue? Was this a privacy conflict and if so, how was it framed as such? This work demonstrates that the public outcry to CA formed out of three underlying structural conditions: The rise of the alt-right as an ideology, surveillance capitalism, and a growing and unregulated voter analytics industry. A network of actors seized the momentum of this conflict to drive the message that voter surveillance is a threat to democratic elections. These actors humanized the CA conflict and created a catalyst for a large scale public outrage to these previously ignored structures. Their focus on democratic threat also allowed this conflict to transcend the typical contours of a privacy conflict and demonstrate that the consequences of CA are societal, rather than personal.
    [Show full text]
  • Orca - the Outage That May Change History November 2012
    the Availability Digest www.availabilitydigest.com Orca - The Outage That May Change History November 2012 The Romney campaign looked forward with confidence to the November 6, 2012, U.S. presidential election. Not only were many polls improving in its favor, but it had a secret weapon that it did not disclose until just before Election Day. Orca! Orca was a massive, technologically sophisticated tool that was aimed at GOTV – Get Out The Vote – in the critical swing states that would decide the election outcome. In elections that are as close as this one was predicted to be, outperforming polls by a single point can mean that entire states and all their Electoral votes can be won. But Orca failed. It never got off the ground on Election Day. Was this outage the cause of Governor Romney’s loss to President Obama? We will never know the answer to this question, but it was quite likely a factor. Orca Orca was a Web-based application that allowed 37,000 Romney campaign workers spread among the precincts of several key swing states to monitor and report who was voting. This data was sent in real time to a campaign War Room at Republican headquarters in the Boston Garden (now the TD Garden), where it was compared to known Romney supporters in those precincts. The data was used by campaign headquarters to deploy calls and volunteers to known Romney supporters who hadn’t yet voted without wasting resources on those who had. To use Orca, a volunteer logged into the Orca web site on his or her smart phone.
    [Show full text]