
ABSTRACT Title of dissertation: DISCOVERING CREDIBLE EVENTS IN NEAR REAL TIME FROM SOCIAL MEDIA STREAMS Cody Buntain, Doctor of Philosophy, 2016 Dissertation directed by: Professor Jennifer Golbeck School of Information Recent reliance on social media platforms as major sources of news and infor- mation, both for journalists and the larger population and especially during times of crisis, motivate the need for better methods of identifying and tracking high-impact events in these social media streams. Social media's volume, velocity, and democrati- zation of information (leading to limited quality controls) complicate rapid discovery of these events and one's ability to trust the content posted about these events. This dissertation addresses these complications in four stages, using Twitter as a model social platform. The first stage analyzes Twitter's response to major crises, specif- ically terrorist attacks in Western countries, showing these high-impact events do not significantly impact message or user volume. Instead, these events drive changes in Twitter's topic distribution, with conversation, retweets, and hashtags relevant to these events experiencing significant, rapid, and short-lived bursts in frequency. Furthermore, conversation participants tend to prefer information from local author- ities/organizations/media over national or international sources, with accounts for local police or local newspapers often emerging as central in the networks of inter- action. Building on these results, the second stage in this dissertation presents and evaluates a set of features that capture these topical bursts associated with crises by modeling bursts in frequency for individual tokens in the Twitter stream. The resulting streaming algorithm is capable of discovering notable moments across a series of major sports competitions using Twitter's public stream without relying on domain- or language-specific information or models. Furthermore, results demon- strate models trained on sporting competition data perform well when transferred to earthquake identification. This streaming algorithm is then extended in this dis- sertation's third stage to support real-time event tracking and summarization. This real-time algorithm leverages new distributed processing technology to operate at scale and is evaluated against a collection of other community-developed information retrieval systems, where it performs comparably. Further experiments also show this real-time burst detection algorithm can be integrated with these other information retrieval systems to increase overall performance. The final stage then investigates automated methods for evaluating credibility in social media streams by leveraging two existing data sets. These two data sets measure different types of credibility (veracity versus perception), and results show veracity is negatively correlated with the amount of disagreement in and length of a conversation, and perceptions of credibility are influenced by the amount of links to other pages, shared media about the event, and the number of verified users participating in the discussion. Contri- butions made across these four stages are then usable in the relatively new fields of computational journalism and crisis informatics, which seek to improve news gather- ing and crisis response by leveraging new technologies and data sources like machine learning and social media. DISCOVERING CREDIBLE EVENTS IN NEAR REAL TIME FROM SOCIAL MEDIA STREAMS by Cody Buntain Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Jennifer Golbeck, Chair/Advisor Professor Jimmy Lin Professor Nicholas Diakopoulos Professor Hector Corrada Bravo Professor V.S. Subrahmanian c Copyright by Cody Buntain 2016 Dedication To the two most important people in my life: Leslie and Leigh. Their support, encouragement, and patience cannot be adequately thanked. ii Acknowledgments This work simply would not exist without the encouragement of my advisor, Jennifer Golbeck. Her support and trust allowed me to find my own path as a scientist, and she was always there with a light hand to guide me when I became too lost or discouraged. Jen has been and will continue to be a role model for me in my academic career. I have benefitted from the insight of several other mentors over the course of this work as well. This research owes a great debt to Jimmy Lin, without whose assistance would have made for a more difficult, slow, and expensive effort. His al- lowances for my access to large stores of data, state-of-the-art processing infrastruc- ture, and a community of researchers contributed significantly to the investigations detailed herein and spurred my interest in large-scale data science. A special thanks is also owed to the remaining members of my committee: Nick Diakopoulos, Hector Corrada Bravo, and V.S. Subrahmanian. Their excellent questions, insights, and suggestions contributed greatly to this document's final form. Given this research's focus on social networks, I would be remiss not to thank my own social network, whose role as an intellectual sounding board and emotional support cannot be understated. Thanks to Phil, Robin, Sana, both Steves, Jay, Matt, Brenna, Kris, both Amandas, Adam, Greg, and Christine. Likewise, the trust and collaboration I received from Sandy and Michael Ring and my previous colleagues at Pikewerks contributed a great deal to my growth as a scientist and iii researcher. Finally, I want to thank my family: my mother, my sister, and Leigh for their continued encouragement and patience. Without my mother's incredible efforts to raise my sister and I as a single mother, none of this would be possible. Thank you. iv Table of Contents List of Tables viii List of Figures ix List of Abbreviationsx 1 Introduction1 1.1 Contributions...............................4 1.2 Dissertation Roadmap..........................8 2 Relevant Work on Social Media Analysis 12 2.1 Social Media, Terrorism, and Crisis Informatics............ 14 2.2 Event Detection.............................. 16 2.3 Credibility and Social Media....................... 22 3 Twitter Response to Terrorism 30 3.1 Events and Data Collection....................... 32 3.2 Twitter Behavioral Analysis....................... 33 3.3 Results................................... 36 3.4 Observations................................ 42 3.5 Consequences for Event Detection Algorithms............. 44 3.5.1 Other Future Work........................ 45 3.5.2 Limitations............................ 46 4 Language-Agnostic Event Discovery in Streams 48 4.1 Moment Discovery Defined........................ 51 4.1.1 Problem Definition........................ 51 4.1.2 The LABurst Model....................... 53 4.1.3 Temporal Features........................ 54 4.1.3.1 Frequency Regression................. 55 4.1.3.2 Frequency Differences................. 55 4.1.3.3 Inter-Arrival Time................... 56 4.1.3.4 Entropy......................... 56 v 4.1.3.5 Interaction Graph Density............... 57 4.1.3.6 Term-Frequency, Inverse Document Frequency (TF- IDF)........................... 58 4.1.3.7 Term-Frequency, Proportional Document Frequency (TF-PDF)........................ 58 4.1.3.8 BursT Score....................... 59 4.1.4 Bursty Token Classification................... 59 4.2 Evaluation Framework.......................... 60 4.2.1 Accuracy in Event Discovery................... 60 4.2.1.1 Sporting Competitions................. 61 4.2.1.2 Burst Detection Baselines............... 63 4.2.1.3 Evaluating Accuracy.................. 65 4.2.2 Domain Independence...................... 65 4.3 Data Collection.............................. 67 4.4 Experimental Results........................... 67 4.4.1 Setting Model Parameters.................... 67 4.4.2 Ablation Study.......................... 70 4.4.3 Event Discovery Results..................... 70 4.4.4 Composite Results........................ 72 4.4.5 Earthquake Detection...................... 74 4.5 Comparative Analysis.......................... 76 4.5.1 Identifying Event-Related Tokens................ 76 4.5.2 Discovering Unanticipated Moments............... 78 4.5.3 Addressing the Super Bowl.................... 79 4.6 Limitations and Extensions....................... 80 4.7 Conclusions................................ 81 5 Real-Time Event Discovery 83 5.1 Real-Time Extensions.......................... 84 5.1.1 Processing the Twitter Stream.................. 86 5.1.2 Identifying Bursty Tokens.................... 86 5.1.3 Moment Summarization..................... 87 5.2 Real-Time Topic Tracking........................ 88 5.2.1 Query Construction and Expansion............... 88 5.2.2 Filtering the Twitter Sample Stream.............. 89 5.2.3 Topic-Specific Summarization.................. 90 5.3 Evaluating Real-Time Topic Tracking.................. 90 5.4 NIST Evaluation Results......................... 92 5.5 Ensembles with RTTBurst........................ 95 5.5.1 Gating with RTTBurst...................... 96 5.6 Observations on RTTBurst and Ensembles............... 97 5.7 Conclusions................................ 102 vi 6 Evaluating Truth in Social Media 103 6.1 Data Set Descriptions........................... 105 6.1.1 The PHEME Rumor Data Set.................. 105 6.1.2 The CREDBANK Data Set................... 107 6.1.2.1 Twitter Data Acquisition............... 110 6.1.2.2 Labeling CREDBANK Topics............
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages167 Page
-
File Size-