A Visual Analytics System for Making Sense of Real-Time Twitter Streams
Total Page:16
File Type:pdf, Size:1020Kb
Western University Scholarship@Western Electronic Thesis and Dissertation Repository 1-31-2020 10:00 AM A Visual Analytics System for Making Sense of Real-Time Twitter Streams Amir HaghighatiMaleki The University of Western Ontario Supervisor Sedig, Kamran The University of Western Ontario Co-Supervisor Haque, Anwar The University of Western Ontario Graduate Program in Computer Science A thesis submitted in partial fulfillment of the equirr ements for the degree in Master of Science © Amir HaghighatiMaleki 2020 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Artificial Intelligence and Robotics Commons, Computer and Systems Architecture Commons, Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons, Other Computer Engineering Commons, Science and Technology Studies Commons, Social Statistics Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation HaghighatiMaleki, Amir, "A Visual Analytics System for Making Sense of Real-Time Twitter Streams" (2020). Electronic Thesis and Dissertation Repository. 6809. https://ir.lib.uwo.ca/etd/6809 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract Through social media platforms, massive amounts of data are being produced. Twitter, as one such platform, enables users to post “tweets” on an unprecedented scale. Once analyzed by machine learning (ML) techniques and in aggregate, Twitter data can be an invaluable resource for gaining insight. However, when applied to real-time data streams, due to covariate shifts in the data (i.e., changes in the distributions of the inputs of ML algorithms), existing ML approaches result in different types of biases and provide uncertain outputs. This thesis describes a visual analytics system (i.e., a tool that combines data visualization, human-data interaction, and ML) to help users make sense of the real-time streams on Twitter. As proofs of concept, public-health and political discussions were analyzed. The system not only provides categorized and aggregate results but also enables the stakeholders to diagnose and to heuristically suggest fixes for the errors in the outcome. Keywords Visual Analytics, Stream Processing, Real-Time Twitter Analysis, Human-Information Interaction ii Summary for Lay Audience Through social media platforms, massive amounts of data are being produced. Twitter, as a microblogging social media platform, enables users to post short updates as “tweets” on an unprecedented scale. Once analyzed by using machine learning (ML) techniques and in aggregate, Twitter data can be an invaluable resource for gaining insight. However, when applied to real-time data streams, due to covariate shifts in the data (i.e., changes in the distributions of the inputs of ML algorithms), existing ML approaches result in different types of biases and provide uncertain outputs. This thesis describes a visual analytics system (i.e., a tool that combines data visualization, human-data interaction, and ML) to help users monitor, analyze, and make sense of the streams of discussions on Twitter in a real-time manner. This system helps the users to understand “who” is talking about “what” and “how” and/or “why” a tweet is posted. As case-studies, we use public-health and election discussions to demonstrate the capabilities enabled by the system. The system then not only provides categorized and aggregate results of such discussions but also enables the stakeholders to diagnose and to heuristically suggest fixes for the errors in the outcome, resulting in a more detailed understanding of the discussions. iii Acknowledgments Words cannot describe my gratefulness to my wonderful supervisors, Dr. Kamran Sedig and Dr. Anwar Haque. I would like to specifically express my sincerest gratitude to Dr. Sedig without whose exemplary, sound advice this work would not have been possible. I am truly indebted to your dedication, life-lessons, and fatherly mentorship. No one can hope for a better supervisor. I am also thankful to Wenjun Chen for her generous help during the initial course of this project. Finally, I offer my deepest gratitude to my parents, Rahele and Akbar, without whose sacrifices and vision of the future, I could not even be where I am today. iv Table of Contents Abstract ............................................................................................................................... ii Summary for Lay Audience ............................................................................................... iii Acknowledgments.............................................................................................................. iv Table of Contents ................................................................................................................ v List of Tables .................................................................................................................... vii List of Figures .................................................................................................................. viii List of Appendices .............................................................................................................. x Chapter 1 ............................................................................................................................. 1 1 Introduction .................................................................................................................... 1 1.1 Twitter: a Microblogging Platform ......................................................................... 2 1.2 Usages and Challenges ........................................................................................... 4 1.2.1 Twitter for Health Research ........................................................................ 6 1.2.2 Twitter for Election Campaigns .................................................................. 7 1.3 Constraints on Data Collection ............................................................................... 8 1.4 Research Questions ............................................................................................... 10 Chapter 2 ........................................................................................................................... 12 2 Background .................................................................................................................. 12 2.1 Approaches for Twitter Data Analysis ................................................................. 12 2.1.1 Assessing “who” ....................................................................................... 13 2.1.2 Assessing “what” ...................................................................................... 14 2.1.3 Assessing “how” ....................................................................................... 14 2.1.4 Assessing “why” ....................................................................................... 15 2.2 Stream Processing ................................................................................................. 16 2.2.1 Covariate Shift .......................................................................................... 19 v 2.3 Visual Analytics .................................................................................................... 20 2.3.1 Computational Tools and Techniques ...................................................... 21 2.3.2 Visualizations ............................................................................................ 21 2.3.3 Interactions ................................................................................................ 22 2.3.4 Human-Information Interaction ................................................................ 22 Chapter 3 ........................................................................................................................... 23 3 System Design .............................................................................................................. 23 3.1 Data Flow Design: Pipeline Architecture ............................................................. 24 3.1.1 Pre-Processor ............................................................................................ 26 3.1.2 Sentiment Analyzer ................................................................................... 27 3.1.3 Label Predictor (Classifier) ....................................................................... 27 3.2 Interface Design .................................................................................................... 30 3.2.1 A Progressive Web Application................................................................ 30 3.2.2 Analytics Page .......................................................................................... 32 3.2.3 Compare Page ........................................................................................... 34 3.2.4 Shuffler Page ............................................................................................. 36 3.3 Case Studies .......................................................................................................... 43 3.3.1 Public-Health Research ............................................................................