Hua T D 2018.Pdf (12.73Mb)
Total Page:16
File Type:pdf, Size:1020Kb
Topics, Events, Stories in Social Media Ting Hua Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications Chang-Tien Lu, Chair Naren Ramakrishnan Ing-Ray Chen Chandan K. Reddy Zhenhui Jessie Li Dec 15, 2017 Falls Church, Virginia Keywords: Social media, Topic modeling, Event Detection Copyright 2017, Ting Hua Topics, Events, and Stories in Social Media Ting Hua (ABSTRACT) This thesis focuses on developing methods for social media analysis. Specifically, five directions are proposed here: 1) semi-supervised detection for targeted-domain events, 2) topical interaction study among multiple datasets, 3) discriminative learning about the identifications for common and distinctive topics, 4) epidemics modeling for flu forecasting with simulation via signals from social media data, 5) storyline generation for massive unorganized documents. For the first method, existing solutions in spatiotemporal event detection are mostly supervised ap- proaches that require expensive human efforts in labeling work. The contributions of our proposed work include: (1) Developed a semi-supervised framework, (2) Designed a novel label genera- tion method, and (3) Proposed an innovative multinomial spatial-scan algorithm. For the second method, most traditional solutions in topic modeling are designed to analyze formal documents such as news reports, but can not handle the noisy social media data efficiently and effectively. The contributions of the proposed work for the second task include: (1) Proposed a novel gener- ative model jointly considering Twitter and news data in one unified framework, (2) Designed an effective algorithm for model parameter inference, and (3) Explored the real world applications by utilizing outputs of the proposed model. Discriminative learning is the basis for comparative thinking, however, most related previous studies only work in the scenario that involving two- dataset. The third proposed work contributes in following aspects: (1) Proposed a Bayesian model to identify common and distinct topics for multiple datasets, (2) Developed efficient parameter inference algorithms based on Gibbs Sampling, and (3) Evaluated the proposed model on various datasets with comparison to important baselines. Existing work on epidemics modeling either can not guarantee the timeliness of disease surveillance, or can not accurately characterize the under- lying mechanism of flu spreading. The contributions of the fourth task include: (1) Proposed a novel integrated framework combining computational epidemiology and social media mining, (2) Designed an effective algorithm for model parameter inference, and (3) Compared the proposed method with important baselines on various datasets. In the filed of storyline generation, traditional solutions can not clearly represent the underlying structure of related events. And at the same time, most of them require human recognized labels as inputs. The contributions for this work include: (1) Proposed a generative framework for storyline detection, and (2) Developed efficient parameter inference algorithms, and (3) Utilized the proposed model to analyze the real world cases. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, IARPA, DoI/NBC, or the US Government. Topics, Events, and Stories in Social Media Ting Hua (GENERAL AUDIENCE ABSTRACT) The rise of “big data”, especially social media data (e.g., Twitter, Facebook, Youtube), gives new opportunities to the understanding of human behavior. Consequently, novel computing methods for mining patterns in social media data are therefore desired. Through applying these approaches, it has become possible to aggregate public available data to capture triggers underlying events, detect on-going trends, and forecast future happenings. This dissertation provides comprehensive studies for social media data analysis. The goals of the dissertation include: event early detection, future event prediction, and event chain organization. Specifically, these goals are achieved through efforts in the following aspects: (1) semi-supervised and unsupervised methods are developed to collect early signals from social media data and de- tect on-going events; (2) graphical models are proposed to model the interaction and comparison among multiple datasets; (3) traditional computational methods are combined with new emerge social media data analysis for the purpose of fast epidemic prediction; (4) events in different time stamps are organized into event chains via novel probabilistic models. The effectiveness of our approaches is evaluated using various datasets, such as Twitter posts and news articles. Also, interesting case studies are provided to show models’ abilities in the real world exploration. To my mother, for her love and support and spirit. Acknowledgments I feel appreciated to many friends and colleagues for their great supports of my Ph.D study. First and foremost, I express my deepest gratitude to my advisor and mentor, Dr. Chang-Tien Lu, to thank his advice and support. Dr. Lu is the best advisor a PhD student can hope. His great skills in advise are combination of intelligence, patience, and support. His guidance and wisdom made my Ph.D both interesting and productive. Dr. Lu can understand the details of my work quickly, and capture the key values. Dr. Lu also helped me a lot to make my presentation clear, simple, and easy to follow. Dr. Lu is a hard working man, who spent most of his time in lab. I will leave VT with his valuable advice and respectable quality, which will continue to benefit me both in life and research. Also, I feel rather thankful to Dr. Naren Ramakrishnan. He continuously gave me helpful suggestions in the research directions all these years. He broadened my research views and I always felt inspired during our conversations. I want to thank Dr. Ing-Ray Chen for his great advice and keeps me to a high standard in the presentation during our collaboration. His efforts made initial proposals in class into great research publications. I want to thank Dr. Reddy for his help, advice, and guidance for multiple work. He is a knowledgeable professor who can always give me new information about the most update-to-date research. Thank you to Dr. Zhenhui Li. Your insightful feedback and comments always gave me new views to rethink about my work. Also, I need to thank Johnny Cash, Bob Dylan, and Nirvana; Nietzsche and Camus; Marguerite Duras, Gabriel Garcia Marquez, and Jorge Luis Borges. With them, this dissertation is delayed by at least 1 year. However, without them, it can never be started and finished. Contents 1 Introduction 1 1.1 Research Issues . .3 1.1.1 Twitter Event Detection . .3 1.1.2 Underlying Factors behind Social Media and News . .4 1.1.3 Learning Common and Distinctive Topics from Multiple Datasets . .4 1.1.4 Seeding Simulation with Updates from Social Media Data . .4 1.1.5 Storyline Generation using Social Media . .5 1.2 Goals and Contributions . .5 1.3 Organization . .8 2 Twitter Event Detection 9 2.1 STED: Semi-Supervised Targeted Event Detection . .9 2.1.1 Introduction . .9 2.1.2 Framework and Methods . 10 Automatical Label Creation and Expansion . 11 Twitter Text Classification . 11 Location Estimation . 13 2.1.3 Demonstration . 14 2.1.4 Conclusion . 16 2.2 Automatic Targeted-Domain Spatiotemporal Event Detection in Twitter . 16 2.2.1 Introduction . 16 1 2.2.2 Related Work . 19 Event detection in newswire documents . 19 General-domain event detection in Twitter . 20 Targeted-domain event detection in Twitter . 20 Distant supervision and transfer learning . 21 2.2.3 Framework and Problem Formulation . 22 Framework . 22 Problem Formulation . 23 2.2.4 Automatic Label Generation . 25 Feature Extraction . 25 Relevancy Ranking . 26 Textual Similarity . 26 Spatial Similarity . 27 Temporal Similarity . 27 Label Refinement . 28 2.2.5 Spatiotemporal Event Detection . 29 Tweet Classifier . 29 Event Location Estimation . 31 2.2.6 Results . 33 Datasets and evaluation metrics . 33 Methods for Comparison . 35 Parameter settings . 36 Performance Analysis . 37 Overall Relevance Evaluation . 37 Evaluation of the Tweet Classifier . 42 Case Study . 46 2.2.7 Conclusion . 46 3 Underlying Factors behind Social Media and News 47 2 3.1 Analyzing Civil Unrest through Social Media . 47 3.1.1 Introduction . 47 3.1.2 Event-related Tweet Extraction . 49 3.1.3 Identifying Contributing Factors . 51 3.1.4 Event Evolution Analysis . 53 3.1.5 Conclusion . 55 3.2 Topical Analysis of Interactions Between News and Social Media . 55 3.2.1 Introduction . 55 3.2.2 Related Work . 58 Topic Modeling on Short Texts . 58 Transfer Knowledge in Multiple Datasets . 59 Mining Time Series and Topic Evolution . 59 3.2.3 Problem statement and Model . 59 3.2.4 Problem Statement . 59 Model . 61 3.2.5 Inference via Gibbs Sampling . 63 3.2.6 Discovery for topic lags and influence . 67 Topic distribution differences . 67 Topic temporal patterns . 68 Topic influence . 68 Key news reports and tweets . 68 3.2.7 Experiment . 69 Dataset . 69 Results of modeling performance . 70 Results of topic evolution discovery . 72 3.3 Conclusion . 77 4 A Probabilistic Model for Discovering Common and Distinctive Topics from Multiple Datasets 78 3 4.1 Introduction . 78 4.2 Related Work . 81 4.2.1 Traditional Topic Models . 81 4.2.2 Discriminative Topic Modeling . 81 4.2.3 Global and Local Aspects Mining . 82 4.3 Proposed Method . 82 4.3.1 Problem Statement . 82 4.3.2 Model Definition . 82 4.4 Inference . ..